# Iron Legion Fleet Infrastructure Recovery — PRD

**Date:** 2026-05-27
**Author:** Artemis
**Status:** Approved / In Progress

---

## Problem Statement

Six infrastructure issues are blocking fleet observability, container management, DNS, and SSO. Each issue is independently broken, but some share root causes (Docker networking, TLS, service wiring).

## Success Criteria

| # | Criterion | Acceptable |
|---|-----------|------------|
| 1 | Portainer | Bobby can log in, see all stacks/containers |
| 2 | Technitium | API responds on port 5380, DNS records queryable |
| 3 | AdGuard | Container stopped, Homepage shows no AdGuard tile |
| 4 | Traefik TLS | HTTPS works on `*.ai.home` with valid cert |
| 5 | Beszel | Every node + every container monitored in dashboard |
| 6 | Prometheus | 0 targets down, alert pipeline active |

## Scope

**In scope:** Diagnose and fix all 6 issues. Update Homepage config. Deploy Beszel agents. Reconfigure Prometheus targets. Generate/apply TLS certs.

**Out of scope:** Migrating services between nodes, adding new services, re-architecting network topology.

## Constraints

- No Docker or nginx proxies — bare metal + Docker Engine only
- All swarm compose files must exist on ALL nodes per Bobby's rule
- Stacks deploy ONLY on MK7 (manager)
- TLS must work for local `.ai.home` domains (no public DNS)
- Bobby reviews configs before destructive changes

## Execution Plan (Chunks)

| Chunk | Task | Estimated Time |
|-------|------|---------------|
| **A** | Discovery — scan fleet, identify what's running vs. configured | 15 min |
| **B** | AdGuard shutdown + Homepage cleanup | 10 min |
| **C** | Portainer admin reset | 10 min |
| **D** | Beszel agent deployment (all nodes) | 30 min |
| **E** | Prometheus 5 down targets — diagnose + fix | 20 min |
| **F** | Technitium API — container + port + auth | 15 min |
| **G** | Traefik TLS → Authelia enable | 30 min |

## Open Questions

1. Does Bobby want local CA certs (mkcert) or Cloudflare origin certs for `*.ai.home`?
2. Are any Prometheus down targets expected (e.g., Shield powered off, MK44 standby)?
3. Should Beszel monitor Docker containers per-node or just node-level metrics?

---

## Current Fleet State (To Be Updated by Chunk A)

| Node | Role | Tailscale IP | LAN IP | Status |
|------|------|-------------|--------|--------|
| MK7 | Swarm Manager / Docker | ? | 192.168.7.7 | ? |
| Artemis | Dashboard / Orchestrator | 100.100.97.18 | 192.168.15.182 | ? |
| Neo | Nextcloud/Vaultwarden/Trilium | ? | ? | ? |
| Shield | PXE Server | ? | ? | Powered off |
| MK33 | Physical Worker | ? | ? | ? |
| MK34 | Physical Worker | ? | ? | ? |
| MK39 | Physical Worker | ? | ? | ? |
| MK42 | Physical Worker | ? | ? | ? |
| MK44 | Hulkbuster (standby) | ? | ? | Hardware standby |
| MK5 | Suitcase (repurposed) | ? | ? | ? |

*Note: Populate IP/status data during Chunk A discovery.*