CLEAN audit complete + fleet infrastructure recovery PRD
- AUDIT_REPORT.md: Hermes environment audit results (~1GB recovered) - 80 skills archived, 2 broken profiles removed, cron cleanup - ARTEMIS.md consolidated, rule deduplication completed - PRDs/fleet-infrastructure-recovery.md: 6-item recovery plan - Portainer, Technitium DNS, Prometheus, Traefik TLS, Beszel, AdGuard
This commit is contained in:
73
PRDs/fleet-infrastructure-recovery.md
Normal file
73
PRDs/fleet-infrastructure-recovery.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# Iron Legion Fleet Infrastructure Recovery — PRD
|
||||
|
||||
**Date:** 2026-05-27
|
||||
**Author:** Artemis
|
||||
**Status:** Approved / In Progress
|
||||
|
||||
---
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Six infrastructure issues are blocking fleet observability, container management, DNS, and SSO. Each issue is independently broken, but some share root causes (Docker networking, TLS, service wiring).
|
||||
|
||||
## Success Criteria
|
||||
|
||||
| # | Criterion | Acceptable |
|
||||
|---|-----------|------------|
|
||||
| 1 | Portainer | Bobby can log in, see all stacks/containers |
|
||||
| 2 | Technitium | API responds on port 5380, DNS records queryable |
|
||||
| 3 | AdGuard | Container stopped, Homepage shows no AdGuard tile |
|
||||
| 4 | Traefik TLS | HTTPS works on `*.ai.home` with valid cert |
|
||||
| 5 | Beszel | Every node + every container monitored in dashboard |
|
||||
| 6 | Prometheus | 0 targets down, alert pipeline active |
|
||||
|
||||
## Scope
|
||||
|
||||
**In scope:** Diagnose and fix all 6 issues. Update Homepage config. Deploy Beszel agents. Reconfigure Prometheus targets. Generate/apply TLS certs.
|
||||
|
||||
**Out of scope:** Migrating services between nodes, adding new services, re-architecting network topology.
|
||||
|
||||
## Constraints
|
||||
|
||||
- No Docker or nginx proxies — bare metal + Docker Engine only
|
||||
- All swarm compose files must exist on ALL nodes per Bobby's rule
|
||||
- Stacks deploy ONLY on MK7 (manager)
|
||||
- TLS must work for local `.ai.home` domains (no public DNS)
|
||||
- Bobby reviews configs before destructive changes
|
||||
|
||||
## Execution Plan (Chunks)
|
||||
|
||||
| Chunk | Task | Estimated Time |
|
||||
|-------|------|---------------|
|
||||
| **A** | Discovery — scan fleet, identify what's running vs. configured | 15 min |
|
||||
| **B** | AdGuard shutdown + Homepage cleanup | 10 min |
|
||||
| **C** | Portainer admin reset | 10 min |
|
||||
| **D** | Beszel agent deployment (all nodes) | 30 min |
|
||||
| **E** | Prometheus 5 down targets — diagnose + fix | 20 min |
|
||||
| **F** | Technitium API — container + port + auth | 15 min |
|
||||
| **G** | Traefik TLS → Authelia enable | 30 min |
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. Does Bobby want local CA certs (mkcert) or Cloudflare origin certs for `*.ai.home`?
|
||||
2. Are any Prometheus down targets expected (e.g., Shield powered off, MK44 standby)?
|
||||
3. Should Beszel monitor Docker containers per-node or just node-level metrics?
|
||||
|
||||
---
|
||||
|
||||
## Current Fleet State (To Be Updated by Chunk A)
|
||||
|
||||
| Node | Role | Tailscale IP | LAN IP | Status |
|
||||
|------|------|-------------|--------|--------|
|
||||
| MK7 | Swarm Manager / Docker | ? | 192.168.7.7 | ? |
|
||||
| Artemis | Dashboard / Orchestrator | 100.100.97.18 | 192.168.15.182 | ? |
|
||||
| Neo | Nextcloud/Vaultwarden/Trilium | ? | ? | ? |
|
||||
| Shield | PXE Server | ? | ? | Powered off |
|
||||
| MK33 | Physical Worker | ? | ? | ? |
|
||||
| MK34 | Physical Worker | ? | ? | ? |
|
||||
| MK39 | Physical Worker | ? | ? | ? |
|
||||
| MK42 | Physical Worker | ? | ? | ? |
|
||||
| MK44 | Hulkbuster (standby) | ? | ? | Hardware standby |
|
||||
| MK5 | Suitcase (repurposed) | ? | ? | ? |
|
||||
|
||||
*Note: Populate IP/status data during Chunk A discovery.*
|
||||
Reference in New Issue
Block a user