# Iron Legion Fleet Infrastructure Recovery — PRD **Date:** 2026-05-27 **Author:** Artemis **Status:** Approved / In Progress --- ## Problem Statement Six infrastructure issues are blocking fleet observability, container management, DNS, and SSO. Each issue is independently broken, but some share root causes (Docker networking, TLS, service wiring). ## Success Criteria | # | Criterion | Acceptable | |---|-----------|------------| | 1 | Portainer | Bobby can log in, see all stacks/containers | | 2 | Technitium | API responds on port 5380, DNS records queryable | | 3 | AdGuard | Container stopped, Homepage shows no AdGuard tile | | 4 | Traefik TLS | HTTPS works on `*.ai.home` with valid cert | | 5 | Beszel | Every node + every container monitored in dashboard | | 6 | Prometheus | 0 targets down, alert pipeline active | ## Scope **In scope:** Diagnose and fix all 6 issues. Update Homepage config. Deploy Beszel agents. Reconfigure Prometheus targets. Generate/apply TLS certs. **Out of scope:** Migrating services between nodes, adding new services, re-architecting network topology. ## Constraints - No Docker or nginx proxies — bare metal + Docker Engine only - All swarm compose files must exist on ALL nodes per Bobby's rule - Stacks deploy ONLY on MK7 (manager) - TLS must work for local `.ai.home` domains (no public DNS) - Bobby reviews configs before destructive changes ## Execution Plan (Chunks) | Chunk | Task | Estimated Time | |-------|------|---------------| | **A** | Discovery — scan fleet, identify what's running vs. configured | 15 min | | **B** | AdGuard shutdown + Homepage cleanup | 10 min | | **C** | Portainer admin reset | 10 min | | **D** | Beszel agent deployment (all nodes) | 30 min | | **E** | Prometheus 5 down targets — diagnose + fix | 20 min | | **F** | Technitium API — container + port + auth | 15 min | | **G** | Traefik TLS → Authelia enable | 30 min | ## Open Questions 1. Does Bobby want local CA certs (mkcert) or Cloudflare origin certs for `*.ai.home`? 2. Are any Prometheus down targets expected (e.g., Shield powered off, MK44 standby)? 3. Should Beszel monitor Docker containers per-node or just node-level metrics? --- ## Current Fleet State (To Be Updated by Chunk A) | Node | Role | Tailscale IP | LAN IP | Status | |------|------|-------------|--------|--------| | MK7 | Swarm Manager / Docker | ? | 192.168.7.7 | ? | | Artemis | Dashboard / Orchestrator | 100.100.97.18 | 192.168.15.182 | ? | | Neo | Nextcloud/Vaultwarden/Trilium | ? | ? | ? | | Shield | PXE Server | ? | ? | Powered off | | MK33 | Physical Worker | ? | ? | ? | | MK34 | Physical Worker | ? | ? | ? | | MK39 | Physical Worker | ? | ? | ? | | MK42 | Physical Worker | ? | ? | ? | | MK44 | Hulkbuster (standby) | ? | ? | Hardware standby | | MK5 | Suitcase (repurposed) | ? | ? | ? | *Note: Populate IP/status data during Chunk A discovery.*