Files
documentation/PRDs/fleet-infrastructure-recovery.md
jarvis 484b2e6272 DNS topology: AdGuard removed, Technitium authoritative + DoT + ad blocking
- Remove AdGuard Home from all service catalogs, deployment phases,
  persistence tables, and network architecture docs
- Update Technitium notes: authoritative .ai.home zone, recursive resolver,
  DoT forwarder to Cloudflare (tls://1.1.1.1), built-in ad blocking
- Resolve open questions #2 (Technitium upstream) and #3 (AdGuard layout)
- Add dns-topology.md: complete DNS architecture diagram, zone details,
  client assignments, Tailscale integration, troubleshooting table,
  migration history (AdGuard deployed → paused → removed)
2026-05-29 21:01:24 -04:00

74 lines
2.9 KiB
Markdown

# Iron Legion Fleet Infrastructure Recovery — PRD
**Date:** 2026-05-27
**Author:** Artemis
**Status:** Approved / In Progress
---
## Problem Statement
Six infrastructure issues are blocking fleet observability, container management, DNS, and SSO. Each issue is independently broken, but some share root causes (Docker networking, TLS, service wiring).
## Success Criteria
| # | Criterion | Acceptable |
|---|-----------|------------|
| 1 | Portainer | Bobby can log in, see all stacks/containers |
| 2 | Technitium | API responds on port 5380, DNS records queryable |
|| 3 | ~~AdGuard~~ | ~~Container stopped, Homepage shows no AdGuard tile~~ | ~~Removed~~ | Technitium handles ad blocking |
| 4 | Traefik TLS | HTTPS works on `*.ai.home` with valid cert |
| 5 | Beszel | Every node + every container monitored in dashboard |
| 6 | Prometheus | 0 targets down, alert pipeline active |
## Scope
**In scope:** Diagnose and fix all 6 issues. Update Homepage config. Deploy Beszel agents. Reconfigure Prometheus targets. Generate/apply TLS certs.
**Out of scope:** Migrating services between nodes, adding new services, re-architecting network topology.
## Constraints
- No Docker or nginx proxies — bare metal + Docker Engine only
- All swarm compose files must exist on ALL nodes per Bobby's rule
- Stacks deploy ONLY on MK7 (manager)
- TLS must work for local `.ai.home` domains (no public DNS)
- Bobby reviews configs before destructive changes
## Execution Plan (Chunks)
| Chunk | Task | Estimated Time |
|-------|------|---------------|
| **A** | Discovery — scan fleet, identify what's running vs. configured | 15 min |
| **B** | AdGuard shutdown + Homepage cleanup | 10 min |
| **C** | Portainer admin reset | 10 min |
| **D** | Beszel agent deployment (all nodes) | 30 min |
| **E** | Prometheus 5 down targets — diagnose + fix | 20 min |
| **F** | Technitium API — container + port + auth | 15 min |
| **G** | Traefik TLS → Authelia enable | 30 min |
## Open Questions
1. Does Bobby want local CA certs (mkcert) or Cloudflare origin certs for `*.ai.home`?
2. Are any Prometheus down targets expected (e.g., Shield powered off, MK44 standby)?
3. Should Beszel monitor Docker containers per-node or just node-level metrics?
---
## Current Fleet State (To Be Updated by Chunk A)
| Node | Role | Tailscale IP | LAN IP | Status |
|------|------|-------------|--------|--------|
| MK7 | Swarm Manager / Docker | ? | 192.168.7.7 | ? |
| Artemis | Dashboard / Orchestrator | 100.100.97.18 | 192.168.15.182 | ? |
| Neo | Nextcloud/Vaultwarden/Trilium | ? | ? | ? |
| Shield | PXE Server | ? | ? | Powered off |
| MK33 | Physical Worker | ? | ? | ? |
| MK34 | Physical Worker | ? | ? | ? |
| MK39 | Physical Worker | ? | ? | ? |
| MK42 | Physical Worker | ? | ? | ? |
| MK44 | Hulkbuster (standby) | ? | ? | Hardware standby |
| MK5 | Suitcase (repurposed) | ? | ? | ? |
*Note: Populate IP/status data during Chunk A discovery.*