- Remove AdGuard Home from all service catalogs, deployment phases, persistence tables, and network architecture docs - Update Technitium notes: authoritative .ai.home zone, recursive resolver, DoT forwarder to Cloudflare (tls://1.1.1.1), built-in ad blocking - Resolve open questions #2 (Technitium upstream) and #3 (AdGuard layout) - Add dns-topology.md: complete DNS architecture diagram, zone details, client assignments, Tailscale integration, troubleshooting table, migration history (AdGuard deployed → paused → removed)
2.9 KiB
2.9 KiB
Iron Legion Fleet Infrastructure Recovery — PRD
Date: 2026-05-27 Author: Artemis Status: Approved / In Progress
Problem Statement
Six infrastructure issues are blocking fleet observability, container management, DNS, and SSO. Each issue is independently broken, but some share root causes (Docker networking, TLS, service wiring).
Success Criteria
| # | Criterion | Acceptable |
|---|---|---|
| 1 | Portainer | Bobby can log in, see all stacks/containers |
| 2 | Technitium | API responds on port 5380, DNS records queryable |
| 3 | ||
| 4 | Traefik TLS | HTTPS works on *.ai.home with valid cert |
| 5 | Beszel | Every node + every container monitored in dashboard |
| 6 | Prometheus | 0 targets down, alert pipeline active |
Scope
In scope: Diagnose and fix all 6 issues. Update Homepage config. Deploy Beszel agents. Reconfigure Prometheus targets. Generate/apply TLS certs.
Out of scope: Migrating services between nodes, adding new services, re-architecting network topology.
Constraints
- No Docker or nginx proxies — bare metal + Docker Engine only
- All swarm compose files must exist on ALL nodes per Bobby's rule
- Stacks deploy ONLY on MK7 (manager)
- TLS must work for local
.ai.homedomains (no public DNS) - Bobby reviews configs before destructive changes
Execution Plan (Chunks)
| Chunk | Task | Estimated Time |
|---|---|---|
| A | Discovery — scan fleet, identify what's running vs. configured | 15 min |
| B | AdGuard shutdown + Homepage cleanup | 10 min |
| C | Portainer admin reset | 10 min |
| D | Beszel agent deployment (all nodes) | 30 min |
| E | Prometheus 5 down targets — diagnose + fix | 20 min |
| F | Technitium API — container + port + auth | 15 min |
| G | Traefik TLS → Authelia enable | 30 min |
Open Questions
- Does Bobby want local CA certs (mkcert) or Cloudflare origin certs for
*.ai.home? - Are any Prometheus down targets expected (e.g., Shield powered off, MK44 standby)?
- Should Beszel monitor Docker containers per-node or just node-level metrics?
Current Fleet State (To Be Updated by Chunk A)
| Node | Role | Tailscale IP | LAN IP | Status |
|---|---|---|---|---|
| MK7 | Swarm Manager / Docker | ? | 192.168.7.7 | ? |
| Artemis | Dashboard / Orchestrator | 100.100.97.18 | 192.168.15.182 | ? |
| Neo | Nextcloud/Vaultwarden/Trilium | ? | ? | ? |
| Shield | PXE Server | ? | ? | Powered off |
| MK33 | Physical Worker | ? | ? | ? |
| MK34 | Physical Worker | ? | ? | ? |
| MK39 | Physical Worker | ? | ? | ? |
| MK42 | Physical Worker | ? | ? | ? |
| MK44 | Hulkbuster (standby) | ? | ? | Hardware standby |
| MK5 | Suitcase (repurposed) | ? | ? | ? |
Note: Populate IP/status data during Chunk A discovery.