- AUDIT_REPORT.md: Hermes environment audit results (~1GB recovered) - 80 skills archived, 2 broken profiles removed, cron cleanup - ARTEMIS.md consolidated, rule deduplication completed - PRDs/fleet-infrastructure-recovery.md: 6-item recovery plan - Portainer, Technitium DNS, Prometheus, Traefik TLS, Beszel, AdGuard
74 lines
2.8 KiB
Markdown
74 lines
2.8 KiB
Markdown
# Iron Legion Fleet Infrastructure Recovery — PRD
|
|
|
|
**Date:** 2026-05-27
|
|
**Author:** Artemis
|
|
**Status:** Approved / In Progress
|
|
|
|
---
|
|
|
|
## Problem Statement
|
|
|
|
Six infrastructure issues are blocking fleet observability, container management, DNS, and SSO. Each issue is independently broken, but some share root causes (Docker networking, TLS, service wiring).
|
|
|
|
## Success Criteria
|
|
|
|
| # | Criterion | Acceptable |
|
|
|---|-----------|------------|
|
|
| 1 | Portainer | Bobby can log in, see all stacks/containers |
|
|
| 2 | Technitium | API responds on port 5380, DNS records queryable |
|
|
| 3 | AdGuard | Container stopped, Homepage shows no AdGuard tile |
|
|
| 4 | Traefik TLS | HTTPS works on `*.ai.home` with valid cert |
|
|
| 5 | Beszel | Every node + every container monitored in dashboard |
|
|
| 6 | Prometheus | 0 targets down, alert pipeline active |
|
|
|
|
## Scope
|
|
|
|
**In scope:** Diagnose and fix all 6 issues. Update Homepage config. Deploy Beszel agents. Reconfigure Prometheus targets. Generate/apply TLS certs.
|
|
|
|
**Out of scope:** Migrating services between nodes, adding new services, re-architecting network topology.
|
|
|
|
## Constraints
|
|
|
|
- No Docker or nginx proxies — bare metal + Docker Engine only
|
|
- All swarm compose files must exist on ALL nodes per Bobby's rule
|
|
- Stacks deploy ONLY on MK7 (manager)
|
|
- TLS must work for local `.ai.home` domains (no public DNS)
|
|
- Bobby reviews configs before destructive changes
|
|
|
|
## Execution Plan (Chunks)
|
|
|
|
| Chunk | Task | Estimated Time |
|
|
|-------|------|---------------|
|
|
| **A** | Discovery — scan fleet, identify what's running vs. configured | 15 min |
|
|
| **B** | AdGuard shutdown + Homepage cleanup | 10 min |
|
|
| **C** | Portainer admin reset | 10 min |
|
|
| **D** | Beszel agent deployment (all nodes) | 30 min |
|
|
| **E** | Prometheus 5 down targets — diagnose + fix | 20 min |
|
|
| **F** | Technitium API — container + port + auth | 15 min |
|
|
| **G** | Traefik TLS → Authelia enable | 30 min |
|
|
|
|
## Open Questions
|
|
|
|
1. Does Bobby want local CA certs (mkcert) or Cloudflare origin certs for `*.ai.home`?
|
|
2. Are any Prometheus down targets expected (e.g., Shield powered off, MK44 standby)?
|
|
3. Should Beszel monitor Docker containers per-node or just node-level metrics?
|
|
|
|
---
|
|
|
|
## Current Fleet State (To Be Updated by Chunk A)
|
|
|
|
| Node | Role | Tailscale IP | LAN IP | Status |
|
|
|------|------|-------------|--------|--------|
|
|
| MK7 | Swarm Manager / Docker | ? | 192.168.7.7 | ? |
|
|
| Artemis | Dashboard / Orchestrator | 100.100.97.18 | 192.168.15.182 | ? |
|
|
| Neo | Nextcloud/Vaultwarden/Trilium | ? | ? | ? |
|
|
| Shield | PXE Server | ? | ? | Powered off |
|
|
| MK33 | Physical Worker | ? | ? | ? |
|
|
| MK34 | Physical Worker | ? | ? | ? |
|
|
| MK39 | Physical Worker | ? | ? | ? |
|
|
| MK42 | Physical Worker | ? | ? | ? |
|
|
| MK44 | Hulkbuster (standby) | ? | ? | Hardware standby |
|
|
| MK5 | Suitcase (repurposed) | ? | ? | ? |
|
|
|
|
*Note: Populate IP/status data during Chunk A discovery.*
|