Files
documentation/PRDs/fleet-infrastructure-recovery.md
jarvis a7e70726eb CLEAN audit complete + fleet infrastructure recovery PRD
- AUDIT_REPORT.md: Hermes environment audit results (~1GB recovered)
  - 80 skills archived, 2 broken profiles removed, cron cleanup
  - ARTEMIS.md consolidated, rule deduplication completed
- PRDs/fleet-infrastructure-recovery.md: 6-item recovery plan
  - Portainer, Technitium DNS, Prometheus, Traefik TLS, Beszel, AdGuard
2026-05-27 22:15:31 -04:00

2.8 KiB

Iron Legion Fleet Infrastructure Recovery — PRD

Date: 2026-05-27 Author: Artemis Status: Approved / In Progress


Problem Statement

Six infrastructure issues are blocking fleet observability, container management, DNS, and SSO. Each issue is independently broken, but some share root causes (Docker networking, TLS, service wiring).

Success Criteria

# Criterion Acceptable
1 Portainer Bobby can log in, see all stacks/containers
2 Technitium API responds on port 5380, DNS records queryable
3 AdGuard Container stopped, Homepage shows no AdGuard tile
4 Traefik TLS HTTPS works on *.ai.home with valid cert
5 Beszel Every node + every container monitored in dashboard
6 Prometheus 0 targets down, alert pipeline active

Scope

In scope: Diagnose and fix all 6 issues. Update Homepage config. Deploy Beszel agents. Reconfigure Prometheus targets. Generate/apply TLS certs.

Out of scope: Migrating services between nodes, adding new services, re-architecting network topology.

Constraints

  • No Docker or nginx proxies — bare metal + Docker Engine only
  • All swarm compose files must exist on ALL nodes per Bobby's rule
  • Stacks deploy ONLY on MK7 (manager)
  • TLS must work for local .ai.home domains (no public DNS)
  • Bobby reviews configs before destructive changes

Execution Plan (Chunks)

Chunk Task Estimated Time
A Discovery — scan fleet, identify what's running vs. configured 15 min
B AdGuard shutdown + Homepage cleanup 10 min
C Portainer admin reset 10 min
D Beszel agent deployment (all nodes) 30 min
E Prometheus 5 down targets — diagnose + fix 20 min
F Technitium API — container + port + auth 15 min
G Traefik TLS → Authelia enable 30 min

Open Questions

  1. Does Bobby want local CA certs (mkcert) or Cloudflare origin certs for *.ai.home?
  2. Are any Prometheus down targets expected (e.g., Shield powered off, MK44 standby)?
  3. Should Beszel monitor Docker containers per-node or just node-level metrics?

Current Fleet State (To Be Updated by Chunk A)

Node Role Tailscale IP LAN IP Status
MK7 Swarm Manager / Docker ? 192.168.7.7 ?
Artemis Dashboard / Orchestrator 100.100.97.18 192.168.15.182 ?
Neo Nextcloud/Vaultwarden/Trilium ? ? ?
Shield PXE Server ? ? Powered off
MK33 Physical Worker ? ? ?
MK34 Physical Worker ? ? ?
MK39 Physical Worker ? ? ?
MK42 Physical Worker ? ? ?
MK44 Hulkbuster (standby) ? ? Hardware standby
MK5 Suitcase (repurposed) ? ? ?

Note: Populate IP/status data during Chunk A discovery.