Files
documentation/PRDs/fleet-infrastructure-recovery.md
jarvis 484b2e6272 DNS topology: AdGuard removed, Technitium authoritative + DoT + ad blocking
- Remove AdGuard Home from all service catalogs, deployment phases,
  persistence tables, and network architecture docs
- Update Technitium notes: authoritative .ai.home zone, recursive resolver,
  DoT forwarder to Cloudflare (tls://1.1.1.1), built-in ad blocking
- Resolve open questions #2 (Technitium upstream) and #3 (AdGuard layout)
- Add dns-topology.md: complete DNS architecture diagram, zone details,
  client assignments, Tailscale integration, troubleshooting table,
  migration history (AdGuard deployed → paused → removed)
2026-05-29 21:01:24 -04:00

2.9 KiB

Iron Legion Fleet Infrastructure Recovery — PRD

Date: 2026-05-27 Author: Artemis Status: Approved / In Progress


Problem Statement

Six infrastructure issues are blocking fleet observability, container management, DNS, and SSO. Each issue is independently broken, but some share root causes (Docker networking, TLS, service wiring).

Success Criteria

# Criterion Acceptable
1 Portainer Bobby can log in, see all stacks/containers
2 Technitium API responds on port 5380, DNS records queryable
3 AdGuard
4 Traefik TLS HTTPS works on *.ai.home with valid cert
5 Beszel Every node + every container monitored in dashboard
6 Prometheus 0 targets down, alert pipeline active

Scope

In scope: Diagnose and fix all 6 issues. Update Homepage config. Deploy Beszel agents. Reconfigure Prometheus targets. Generate/apply TLS certs.

Out of scope: Migrating services between nodes, adding new services, re-architecting network topology.

Constraints

  • No Docker or nginx proxies — bare metal + Docker Engine only
  • All swarm compose files must exist on ALL nodes per Bobby's rule
  • Stacks deploy ONLY on MK7 (manager)
  • TLS must work for local .ai.home domains (no public DNS)
  • Bobby reviews configs before destructive changes

Execution Plan (Chunks)

Chunk Task Estimated Time
A Discovery — scan fleet, identify what's running vs. configured 15 min
B AdGuard shutdown + Homepage cleanup 10 min
C Portainer admin reset 10 min
D Beszel agent deployment (all nodes) 30 min
E Prometheus 5 down targets — diagnose + fix 20 min
F Technitium API — container + port + auth 15 min
G Traefik TLS → Authelia enable 30 min

Open Questions

  1. Does Bobby want local CA certs (mkcert) or Cloudflare origin certs for *.ai.home?
  2. Are any Prometheus down targets expected (e.g., Shield powered off, MK44 standby)?
  3. Should Beszel monitor Docker containers per-node or just node-level metrics?

Current Fleet State (To Be Updated by Chunk A)

Node Role Tailscale IP LAN IP Status
MK7 Swarm Manager / Docker ? 192.168.7.7 ?
Artemis Dashboard / Orchestrator 100.100.97.18 192.168.15.182 ?
Neo Nextcloud/Vaultwarden/Trilium ? ? ?
Shield PXE Server ? ? Powered off
MK33 Physical Worker ? ? ?
MK34 Physical Worker ? ? ?
MK39 Physical Worker ? ? ?
MK42 Physical Worker ? ? ?
MK44 Hulkbuster (standby) ? ? Hardware standby
MK5 Suitcase (repurposed) ? ? ?

Note: Populate IP/status data during Chunk A discovery.