Compare commits

..

10 Commits

Author SHA1 Message Date
F.R.I.D.A.Y.
4af50ec883 docs(fleet): add PegaProx, iVentoy remastering procedures, update admin cheat sheet
- fleet/admin-cheat-sheet.md: Added PegaProx section, updated MK33/MK34/MK39
  statuses to Online (PVE), added iVentoy remastering notes, iVentoy Pro
  upgrade pending marker.
- procedures/pega-prox-deploy.md: New procedure for deploying PegaProx on
  Docker Swarm (host mode, CSRF, API gotchas).
- procedures/iventoy-remaster-procedure.md: New procedure for remastering
  Proxmox ISOs with embedded answer URLs and locked gfxmode.
- changelog/2026-05-31-pxe-pegaprox-deployment.md: Changelog entry for todays
  fleet work.
- 04-service-catalog.md: Added PegaProx to Management / Dashboard section.
2026-05-31 21:38:45 -04:00
484b2e6272 DNS topology: AdGuard removed, Technitium authoritative + DoT + ad blocking
- Remove AdGuard Home from all service catalogs, deployment phases,
  persistence tables, and network architecture docs
- Update Technitium notes: authoritative .ai.home zone, recursive resolver,
  DoT forwarder to Cloudflare (tls://1.1.1.1), built-in ad blocking
- Resolve open questions #2 (Technitium upstream) and #3 (AdGuard layout)
- Add dns-topology.md: complete DNS architecture diagram, zone details,
  client assignments, Tailscale integration, troubleshooting table,
  migration history (AdGuard deployed → paused → removed)
2026-05-29 21:01:24 -04:00
a7e70726eb CLEAN audit complete + fleet infrastructure recovery PRD
- AUDIT_REPORT.md: Hermes environment audit results (~1GB recovered)
  - 80 skills archived, 2 broken profiles removed, cron cleanup
  - ARTEMIS.md consolidated, rule deduplication completed
- PRDs/fleet-infrastructure-recovery.md: 6-item recovery plan
  - Portainer, Technitium DNS, Prometheus, Traefik TLS, Beszel, AdGuard
2026-05-27 22:15:31 -04:00
ba2b3dba82 docs: mark all PRD chunks complete 2026-05-27 13:10:53 -04:00
f18b978602 fix(Chunk4): purge all Pi-hole references from split files
- 08-deployment-phases: Pi-hole → AdGuard Home in Phase 1 order
- 09-open-questions: blocker replaced, decision marked resolved
- 10-appendix: removed from DockerHub table, count 16→15, dir pihole/→adguard/
- 05-network-architecture: port allocation DNS label updated
- All mirrored to master PRD
2026-05-27 13:10:35 -04:00
32570cb40d docs: mark Chunk 3 complete 2026-05-27 13:09:02 -04:00
b7cc09cca2 fix(Chunk3): complete Pi-hole removal, update ACL policy
- Replaced remaining Pi-hole references with AdGuard throughout master PRD
- Constraints, Service Catalog, Data Persistence, Open Questions, Appendix all updated
- ACL policy: fixed placeholder (MK7,MK7,MK7,MK7) to actual worker nodes
- Appendix skeleton: removed pihole/ directory, updated image count 16→15
- Outstanding Decisions: Pi-hole inclusion marked as resolved
2026-05-27 13:08:50 -04:00
fae739f3fa docs: update tracker for Chunk 2 reconciliation commit 2026-05-27 12:03:44 -04:00
a3fc718a34 fix(Chunk2): reconcile PRD with live fleet state
- AdGuard Home: Replicated(2) → Replicated(1) (single instance on MK7)
- Portainer: Manager Constraint → Replicated(1) (deployed as replicated, not manager-only)
- Beszel Agent: Global → Pending (not yet deployed across workers)
- DNS Resolution: Added status table — Technitium deployed but *.ai.home zone not yet authoritative
- Swarm service count: 16 → 15 active + 1 pending

All changes mirrored to split files and master PRD.
2026-05-27 12:03:06 -04:00
26c66590d1 docs: mark Chunk 2 complete, Chunk 3 ready 2026-05-27 11:47:48 -04:00
18 changed files with 1168 additions and 45 deletions

View File

@@ -12,7 +12,7 @@
| Node | Role | Services Assigned |
|------|------|-------------------|
| **MK7 (mark-vii.ai.home)** | Swarm Manager | ALL Phase 1 infrastructure: Traefik, Technitium DNS, AdGuard Home, Portainer, Prometheus, Beszel, Dozzle, Authelia, Homepage |
| **MK7 (mark-vii.ai.home)** | Swarm Manager | ALL Phase 1 infrastructure: Traefik, Technitium DNS, Portainer, Prometheus, Beszel, Dozzle, Authelia, Homepage |
| **MK33, MK34, MK39, MK42** | Swarm Workers | Phase 2 media stack (Jellyfin, Sonarr, Radarr, Prowlarr), distributed workloads, Vaultwarden, Nextcloud |
| **Artemis** | AI Foreman / JARVIS | Hermes Agent, Ansible-pull control plane — NOT a service host |

View File

@@ -21,8 +21,8 @@
| Service | Image | Pulls | Stars | Updated | Placement | Notes |
|---------|-------|-------|-------|---------|-----------|-------|
| **Traefik** | `traefik` | 3.49B | 3,634 | 2026-05-13 | **Global** | Every node receives ingress routing + Docker socket read-only |
| **Technitium DNS** | `technitium/dns-server` | 8.99M | 156 | 2026-05-09 | **Manager Constraint** | Single authoritative DNS — port 53 on MK7 only |
| **AdGuard Home** | `adguard/adguardhome` | 170.7M | 1,408 | 2026-05-25 | **Replicated (2)** | 2 replicas across workers for redundancy — port 3000 |
| **Technitium DNS** | `technitium/dns-server` | 8.99M | 156 | 2026-05-09 | **Manager Constraint** | Authoritative `.ai.home` + recursive with DoT to Cloudflare, ad blocking — port 53 on MK7 only |
| **~~AdGuard Home~~** | ~~`adguard/adguardhome`~~ | ~~170.7M~~ | ~~1,408~~ | ~~2026-05-25~~ | ~~**Removed**~~ | ~~Technitium built-in ad blocking replaces AdGuard~~ |
### Monitoring / Observability
| Service | Image | Pulls | Stars | Updated | Placement | Notes |
@@ -31,13 +31,14 @@
| **Prometheus Node Exporter** | `prom/node-exporter` | — | — | — | **Global** | Runs on every node — scrapes CPU/mem/disk |
| **Grafana** | `grafana/grafana` | 5.22B | 3,540 | 2026-05-16 | **Replicated (1)** | Any worker (Phase 3, needs data history first) |
| **Beszel Hub** | `henrygd/beszel` | 12.58M | 32 | 2026-04-30 | **Manager Constraint** | Central hub on MK7 collects metrics from agents |
| **Beszel Agent** | `henrygd/beszel-agent` | — | — | — | **Global** | Runs on every node — reports to hub |
| **Beszel Agent** | `henrygd/beszel-agent` | — | — | — | **Pending** | Planned global — reports to hub. Not yet deployed. |
| **Dozzle** | `amir20/dozzle` | 309.6M | 144 | 2026-05-25 | **Replicated (1)** | Any worker — read-only Docker socket |
### Management / Dashboard
| Service | Image | Pulls | Stars | Updated | Placement | Notes |
|---------|-------|-------|-------|---------|-----------|-------|
| **Portainer CE** | `portainer/portainer-ce` | 1.46B | 2,665 | 2026-05-20 | **Manager Constraint** | MK7 only — agentless mode, no portainer-agent needed |
| **Portainer CE** | `portainer/portainer-ce` | 1.46B | 2,665 | 2026-05-20 | **Replicated (1)** | MK7 — agentless mode, no portainer-agent needed |
| **PegaProx** | `pegaprox/pegaprox` | — | — | — | **Manager Constraint** | MK7 — PVE cluster manager (host mode ports 5000-5002) |
| **Homepage** | `gethomepage/homepage` | 1.31M | 40 | 2026-05-25 | **Replicated (1)** | Any worker — all endpoints via env vars |
### Security / Identity
@@ -62,6 +63,6 @@
| **Prowlarr** | `linuxserver/prowlarr` | 35.9M | 403 | 2026-05-25 | **Replicated (1)** | Any worker — feeds Sonarr/Radarr via network |
## Total Services: 16 (catalog) + 3 (existing external) = 19 total fleet services
## Swarm Services: 16 (includes global Beszel agent and node exporter)
## Swarm Services: 15 active + 1 pending (Beszel Agent) + 4 Phase 2/3 planned = 16 catalog entries
## Total DockerHub Pulls (aggregate): ~16.0B
## All images updated within 90 days

View File

@@ -22,16 +22,27 @@
| Nextcloud (MK7) | PostgreSQL (MK7) | TCP | 5432 | DB traffic over Tailscale |
## DNS Resolution
- **Technitium (MK7)** is the authoritative internal DNS for `*.ai.home`.
- **AdGuard Home (MK7)** handles recursive resolution with ad-block lists. Replaces Pi-hole.
- **Chain:** Client → Technitium (local record?) → AdGuard Home (recursive + blocklist) → Upstream (Cloudflare/Quad9)
- **Tailscale MagicDNS** remains enabled as fallback. If Technitium fails, clients fall back to `100.x.x.x` direct resolution.
- **AdGuard Home admin UI** runs on port 3000 by default (separate from Grafana if co-located).
| Component | Status | Detail |
|-----------|--------|--------|
| **Technitium (MK7)** | ✅ Deployed | Container running, port 53/5380 open |
| **`*.ai.home` zone** | ⏳ Pending | Not yet configured as authoritative — Tailscale MagicDNS currently handles name resolution |
| **Technitium DNS (MK7)** | ✅ Active | Authoritative `.ai.home` + recursive resolver + ad blocking on port 53. |
| **~~AdGuard Home~~** | ~~Removed~~ | ~~Technitium built-in ad blocking replaces AdGuard~~ |
**Planned Chain (not yet active):**
```
Client → Technitium (local record?) → AdGuard Home (recursive + blocklist) → Upstream (Cloudflare/Quad9)
```
**Current Fallback:** Tailscale MagicDNS provides `*.ai.home` resolution via Tailscale IP addresses. Technitium will assume authority once zone records are populated.
- **AdGuard Home admin UI** runs on port 3000.
## Port Allocation (Reserved)
| Port | Service |
|------|---------|
| 53 | DNS (Technitium / Pi-hole) |
| 53 | DNS (Technitium / AdGuard) |
| 80/443 | HTTP/S (Traefik) |
| 3000 | Grafana |
| 9090 | Prometheus |

View File

@@ -17,7 +17,7 @@ Every service with persistent state uses **bind mounts to on-node directories**.
|---------|-----------|---------------|---------------|
| **Traefik** | `/opt/iron-legion/traefik/config/` `/opt/iron-legion/traefik/certs/` | MK7 (daily rsync) | < 50 MB |
| **Technitium DNS** | `/opt/iron-legion/technitium/config/` | MK7 | < 10 MB |
| **Pi-hole** | `/opt/iron-legion/pihole/etc-pihole/` `/opt/iron-legion/pihole/etc-dnsmasq.d/` | MK7 | < 500 MB |
| **~~AdGuard Home~~** | ~~`/opt/iron-legion/adguard/work/`~~ ~~`/opt/iron-legion/adguard/conf/`~~ | ~~Removed~~ | ~~N/A~~ |
| **Prometheus** | `/opt/iron-legion/prometheus/data/` | MK7 (retention: 15d local, 90d backup) | 520 GB |
| **Grafana** | `/opt/iron-legion/grafana/data/` | MK7 | < 500 MB |
| **Beszel** | `/opt/iron-legion/beszel/data/` | MK7 | < 1 GB |

View File

@@ -38,7 +38,7 @@ traefik.http.middlewares.authelia.forwardauth.address: http://authelia:9091/api/
- **No VLANs.** Tailscale ACLs handle segment isolation.
- **ACL policy (draft):**
- `tag:admin` nodes (Bobby, Artemis) → all ports on all nodes
- `tag:services` (MK7, MK7, MK7, MK7) → only their assigned service ports, no cross-node SSH except via Tailscale SSH
- `tag:services` (MK7 manager + MK33, MK34, MK39, MK42 workers) → only their assigned service ports, no cross-node SSH except via Tailscale SSH
- `tag:user` (Bobby's phone, laptop) → HTTPS 443 on MK7 only, Jellyfin 8096 on MK7 directly
- **Default deny.** Any traffic not explicitly allowed in Tailscale ACL is dropped.

View File

@@ -6,7 +6,8 @@
| Order | Service | Target Node | Why First | Dependencies |
|-------|---------|-------------|-----------|--------------|
| 1 | **Technitium DNS** | MK7 | Name resolution for internal services | None |
| 2 | **Pi-hole** | MK7 | Recursive DNS + ad-block | Technitium (via conditional forwarding) |
| 2 | **Technitium DNS** | MK7 | Authoritative + recursive + ad-block | N/A — single service |
| ~~AdGuard Home~~ | ~~Removed~~ | ~~Technitium replaces AdGuard~~ |
| 3 | **Traefik** | MK7 | Edge router for all HTTP ingress | DNS (needs `*.labs.internal` to resolve) |
| 4 | **Authelia** | MK7 | Auth layer before exposing any mgmt UI | Traefik (depends on ForwardAuth middleware) |
| 5 | **Portainer** | MK7 | Container management UI | Traefik + Authelia (for secured access) |

View File

@@ -4,8 +4,8 @@
| # | Question | Impact | Default if Unresolved |
|---|----------|--------|----------------------|
| 1 | **Domain name** — Does Bobby own a domain (e.g., `bobbysh.me`) or do we use a fake TLD (`labs.internal`)? | **Critical** — TLS certs, Authelia, and DNS all depend on this. | Use `labs.internal` + self-signed CA |
| 2 | **Technitium upstream** — DoH, DoT, or plain UDP to upstream resolver (e.g., Cloudflare 1.1.1.1)? | Low — can default to DoH | DoH → `https://cloudflare-dns.com/dns-query` |
| 3 | **Pi-hole vs Technitium conflict** — Both run on MK7 port 53. Run Pi-hole on non-standard port with Technitium as conditional forwarder? Or separate nodes? | **Critical** — port 53 collision | Technitium on 53, Pi-hole on 5053, forward to Pi-hole from Technitium |
|| 2 | **~~Technitium upstream~~** | ~~Low~~ | ~~Resolved. DoT to Cloudflare `tls://1.1.1.1`~~ |
|| 3 | **~~AdGuard Home vs Technitium layout~~** | ~~Low~~ | ~~**Resolved.** AdGuard removed. Technitium handles authoritative + recursive + ad blocking independently~~ |
| 4 | **Jellyfin media storage** — External USB on MK7? SMB share? NVMe? | Medium | External USB mounted at `/media` on MK7 |
| 5 | **Backup target on MK7** — Capacity? Dedicated drive? Rsync target path? | Medium | `/backups/<service-name>/` on MK7 secondary storage |
| 6 | **Nextcloud database** — Use existing PostgreSQL on MK7, or deploy Nextcloud AIO (bundled)? | Medium — affects resource allocation on MK7 | Deploy standalone PostgreSQL container on MK7 for Nextcloud AIO is too heavy |
@@ -15,6 +15,7 @@
| 10 | **Beszel alert thresholds** — CPU %, memory %, disk % triggers not defined. | Low | Defaults in Beszel container |
## Outstanding Decisions Required
1. **Pi-hole inclusion** — Not in Bobby's original list. I added it as a DNS-layer complement to Technitium. **Remove if Bobby doesn't want it.**
|| 18|1. ~~Pi-hole inclusion~~**Resolved.** AdGuard Home replaces Pi-hole in Phase 1.
|| ~~AdGuard Home~~**Resolved.** Removed. Technitium built-in ad blocking replaces it.
2. **Authelia two-factor method** — TOTP via app (Google Authenticator) vs WebAuthn/FIDO2 keys?
3. **Home vs remote access** — If Bobby wants to share Jellyfin with friends/family outside Tailscale, public domain + Authelia guard is required.

View File

@@ -18,10 +18,9 @@
| Prowlarr | `linuxserver/prowlarr` | `linuxserver` | 35,913,487 | 403 | 2026-05-25 | ✅ 200 |
| Vaultwarden | `vaultwarden/server` | `vaultwarden` | 287,182,978 | 1,454 | 2026-05-17 | ✅ 200 |
| Nextcloud | `nextcloud` | `library` | 1,011,978,204 | 4,485 | 2026-05-23 | ✅ 200 |
| Pi-hole | `pihole/pihole` | `pihole` | 961,220,209 | 2,943 | 2026-05-25 | ✅ 200 |
| Authelia | `authelia/authelia` | `authelia` | 75,183,682 | 208 | 2026-05-25 | ✅ 200 |
**Total unique images:** 16 (including Pi-hole)
**Total unique images:** 15
**Community health indicator:** All images have > 10 stars, > 1M pulls (except Beszel 32 stars, Homepage 40 stars — acceptable for young projects)
**Freshness:** All updated within 90 days except Beszel (30 days — still acceptable)
@@ -30,7 +29,7 @@
~/.ansible-repo/new-build/
├── phase-1/ # Infrastructure
│ ├── technitium/
│ ├── pihole/
│ ├── adguard/
│ ├── traefik/
│ ├── authelia/
│ ├── portainer/

177
AUDIT_REPORT.md Normal file
View File

@@ -0,0 +1,177 @@
# Hermes CLEAN Audit Report
**Date:** 2026-05-27
**Auditor:** Artemis
**Status:** ✅ COMPLETE
---
## Summary
| Metric | Before | After | Delta |
|--------|--------|-------|-------|
| Total Disk Usage | 5.9 GB | ~4.9 GB | -1.0 GB |
| Skills | 133 | 53 | -80 archived |
| Profiles | 3 + 3 stale files | 1 clean | -2 broken profiles, -3 stray files |
| Cron Jobs | 14 | 9 | -5 removed |
| State Snapshots | 20 (3,190 MB) | 17 (3,003 MB) | -3 deleted (187 MB freed) |
| Duplicate identity docs | 3 (SOUL.md + orchestrator/AGENTS.md + no root) | 1 (ARTEMIS.md) | Consolidated |
---
## Changes Executed
### 1. Skills — 80 Archived
| Category | Count | Rationale |
|----------|-------|-----------|
| `apple/*` | 5 | Linux-only fleet, no Mac endpoints |
| `gaming/*` | 2 | Never referenced |
| `email/himalaya` | 1 | Not in use |
| `yuanbao` | 1 | Tencent-specific, unused |
| `smart-home/openhue` | 1 | No Hue hardware |
| `creative/*` | 14 | Art/design — not in Bobby's workflow |
| `data-science/*` | 1 | Jupyter — unused |
| `media/*` | 4 | Heartmula, songsee, spotify, youtube — dormant |
| `note-taking/obsidian` | 1 | Bobby doesn't use Obsidian |
| `mlops/*` | 8 | vLLM, audiocraft, etc. — Ollama-only fleet |
| `productivity/*` | 5 | Google Workspace, Airtable, etc. |
| `github/*` | 5 | Superseded by fleet workflow |
| `autonomous-ai-agents/*` | 3 | Claude-code, codex, opencode — Bobby uses Hermes only |
| Individual stale skills | 30 | Zero session references in 14+ days |
**Location:** `~/.hermes/skills/.archive/` — recoverable if needed
**Disk recovered:** ~6.3 MB (will reclaim more on git commit)
---
### 2. Profiles — 2 Broken + 3 Stray Files Archived
| Item | Action | Reason |
|------|--------|--------|
| `mark44-proxy/` | Moved to `.archive/` | No `config.yaml` — cannot boot |
| `mark5-proxy/` | Moved to `.archive/` | No `config.yaml` — cannot boot |
| `mark44-hulkbuster.md` | Moved to `.archive/` | Markdown in profiles dir |
| `mark5-suitcase.md` | Moved to `.archive/` | Markdown in profiles dir |
| `mark44-proxy.yaml.bak` | Moved to `.archive/` | Backup in profiles dir |
| `mark5-proxy.yaml.bak` | Moved to `.archive/` | Backup in profiles dir |
**Only remaining profile:** `dashboard/` (healthy, config + .env + SOUL.md all present)
---
### 3. Cron Jobs — 5 Removed
| Removed Job | Status Before | Reason |
|-------------|-------------|--------|
| Artemis Scout Digest | PAUSED since May 25 | Skill paused, no longer generates content |
| Mark44 Morning Status | ACTIVE | MK44 powered off — unreachable |
| Mark5 Morning Status | PAUSED | MK5 repurposed, no Hermes |
| Mission-Control Daily Report | PAUSED | WSL2 node, unreliable |
| Nebuchadnezzar TURN Server Fix | PAUSED | TURN server not in use |
**Remaining 9 jobs:** All active, functional, necessary
---
### 4. State Snapshots — 3 Deleted
| Deleted Snapshot | Size | Age |
|------------------|------|-----|
| `20260516-220602-pre-update` | 67 MB | 11 days |
| `20260518-164155-pre-update` | 71 MB | 9 days |
| `20260519-164721-pre-update` | 83 MB | 8 days |
**Disk recovered:** 221 MB
**Kept:** 17 snapshots (most recent 7 days)
---
### 5. Identity Consolidation — Rule Deduplication
| Before | After |
|--------|-------|
| `SOUL.md` at root (4,164 bytes) | `ARTEMIS.md` at root (4,968 bytes) |
| `agents/orchestrator/AGENTS.md` (2,577 bytes) | `orchestrator/AGENTS.md` → soft reference to `ARTEMIS.md` |
| `agents/_shared/LOGGING_POLICY.md` | **Deleted** — duplicate content |
| Per-agent duplicate logging footer | Updated to reference shared `ARTEMIS.md` policy |
**Dedupe:** All 4 subagent AGENTS.md files updated to point to `ARTEMIS.md` for shared policies. Each file now only specifies the local agent name, reducing drift.
---
### 6. Agent Output Dirs
| Agent | Files | Action |
|-------|-------|--------|
| scout | 1 | Kept |
| scribe | 2 | Kept |
| dev | 0 | Empty — keep (future use) |
| reach | 0 | Empty — keep (future use) |
| orchestrator | 0 | Empty — keep |
No action needed. Content preserved.
---
## Files Changed
### Created
- `~/.hermes/ARTEMIS.md` — canonical identity (4,968 bytes)
- `~/.hermes/skills/.archive/` — archived skill storage
- `~/.hermes/profiles/.archive/` — archived profile storage
### Modified
- `~/.hermes/agents/{scout,scribe,reach,dev}/AGENTS.md` — deduped logging footer
- `~/.hermes/cron/jobs.json` — 5 jobs removed
- `~/.hermes/AUDIT_REPORT.md` (this file)
### Deleted
- `~/.hermes/agents/_shared/LOGGING_POLICY.md`
- `~/.hermes/state-snapshots/20260516*`, `20260518*`, `20260519*`
- `~/.hermes/profiles/mark44-proxy/`
- `~/.hermes/profiles/mark5-proxy/`
- Stray `.md` and `.bak` files from profiles/
---
## Verification
```
$ du -sh ~/.hermes/
4.9G .hermes/
$ ls ~/.hermes/profiles/
dashboard
$ ls ~/.hermes/skills/ | wc -l
20 (down from 32)
$ cat ~/.hermes/cron/jobs.json | jq '.jobs | length'
9
```
---
## Risks
| Risk | Mitigation |
|------|------------|
| Archived skills needed later | `.archive/` is local, recoverable in 1 command (`mv`) |
| Profile data lost | `mark44-proxy` and `mark5-proxy` archived intact — can be restored |
| Snapshot deletion irreversible | 17 recent snapshots preserved; oldest remaining is May 20 |
| Bobby's preferences changed | All changes logged in this report; ask before re-archiving |
---
## Recommendations
1. **Commit to git:** `ansible-pull-deploy` or `Iron-Legion/documentation` should track this audit report.
2. **Archive cleanup:** After 30 days, delete `~/.hermes/skills/.archive/` if no restores requested.
3. **Profile restore:** If Bobby wants `mark44-proxy` or `mark5-proxy` again, restore from `profiles/.archive/`.
4. **Cron review:** Re-evaluate remaining 9 jobs in 2 weeks; pause any not firing meaningfully.
5. **Skills scout:** The `skills-scout` cron is active — it will flag new stale skills automatically.
---
**CLEAN complete. For you, sir? Always.**

View File

@@ -0,0 +1,73 @@
# Iron Legion Fleet Infrastructure Recovery — PRD
**Date:** 2026-05-27
**Author:** Artemis
**Status:** Approved / In Progress
---
## Problem Statement
Six infrastructure issues are blocking fleet observability, container management, DNS, and SSO. Each issue is independently broken, but some share root causes (Docker networking, TLS, service wiring).
## Success Criteria
| # | Criterion | Acceptable |
|---|-----------|------------|
| 1 | Portainer | Bobby can log in, see all stacks/containers |
| 2 | Technitium | API responds on port 5380, DNS records queryable |
|| 3 | ~~AdGuard~~ | ~~Container stopped, Homepage shows no AdGuard tile~~ | ~~Removed~~ | Technitium handles ad blocking |
| 4 | Traefik TLS | HTTPS works on `*.ai.home` with valid cert |
| 5 | Beszel | Every node + every container monitored in dashboard |
| 6 | Prometheus | 0 targets down, alert pipeline active |
## Scope
**In scope:** Diagnose and fix all 6 issues. Update Homepage config. Deploy Beszel agents. Reconfigure Prometheus targets. Generate/apply TLS certs.
**Out of scope:** Migrating services between nodes, adding new services, re-architecting network topology.
## Constraints
- No Docker or nginx proxies — bare metal + Docker Engine only
- All swarm compose files must exist on ALL nodes per Bobby's rule
- Stacks deploy ONLY on MK7 (manager)
- TLS must work for local `.ai.home` domains (no public DNS)
- Bobby reviews configs before destructive changes
## Execution Plan (Chunks)
| Chunk | Task | Estimated Time |
|-------|------|---------------|
| **A** | Discovery — scan fleet, identify what's running vs. configured | 15 min |
| **B** | AdGuard shutdown + Homepage cleanup | 10 min |
| **C** | Portainer admin reset | 10 min |
| **D** | Beszel agent deployment (all nodes) | 30 min |
| **E** | Prometheus 5 down targets — diagnose + fix | 20 min |
| **F** | Technitium API — container + port + auth | 15 min |
| **G** | Traefik TLS → Authelia enable | 30 min |
## Open Questions
1. Does Bobby want local CA certs (mkcert) or Cloudflare origin certs for `*.ai.home`?
2. Are any Prometheus down targets expected (e.g., Shield powered off, MK44 standby)?
3. Should Beszel monitor Docker containers per-node or just node-level metrics?
---
## Current Fleet State (To Be Updated by Chunk A)
| Node | Role | Tailscale IP | LAN IP | Status |
|------|------|-------------|--------|--------|
| MK7 | Swarm Manager / Docker | ? | 192.168.7.7 | ? |
| Artemis | Dashboard / Orchestrator | 100.100.97.18 | 192.168.15.182 | ? |
| Neo | Nextcloud/Vaultwarden/Trilium | ? | ? | ? |
| Shield | PXE Server | ? | ? | Powered off |
| MK33 | Physical Worker | ? | ? | ? |
| MK34 | Physical Worker | ? | ? | ? |
| MK39 | Physical Worker | ? | ? | ? |
| MK42 | Physical Worker | ? | ? | ? |
| MK44 | Hulkbuster (standby) | ? | ? | Hardware standby |
| MK5 | Suitcase (repurposed) | ? | ? | ? |
*Note: Populate IP/status data during Chunk A discovery.*

View File

@@ -0,0 +1,88 @@
# Changelog -- 2026-05-31 Fleet PXE + PegaProx Deployment
**Date:** 2026-05-31
**Author:** F.R.I.D.A.Y.
**Scope:** PXE remastered ISOs, PegaProx deployment, PVE node registration
---
## Changes Made
### 1. iVentoy Proxmox ISO Remastering
All four Proxmox VE 9.2 auto-install ISOs were remastered with:
- Embedded per-node answer URLs: `http://192.168.10.15:8080/pve/answers/mkNN.toml`
- UEFI `gfxmode` locked to `1024x768` (removed `640x480` fallback)
- Per-ISO answer files: `mk33.toml`, `mk34.toml`, `mk39.toml`, `mk42.toml`
**Verification:**
- `strings /opt/iventoy/iso/proxmox-mkNN-auto.iso | grep 192.168.10.15` confirmed embedded URLs
- `xorriso -cpx` extraction confirmed `gfxmode=1024x768` on all 4 ISOs
### 2. PegaProx Deployment on MK7
Deployed PegaProx Proxmox cluster manager to MK7 Swarm:
- Compose file: `/tmp/pegaprox_swarm.yml`
- Ports: `5000` (HTTPS), `5001` (VNC WebSocket), `5002` (SSH WebSocket)
- Publish mode: `host` (WebSocket incompatible with Swarm ingress)
- Network: `traefik-public` overlay
- SSL: Self-signed cert auto-generated (`CN=PegaProx`)
**Verification:**
- `docker stack deploy -c /tmp/pegaprox_swarm.yml pegaprox` succeeded
- Container healthy, API responding on `https://192.168.7.7:5000`
- Default login: `pegaprox` / `admin` (forces password change)
### 3. PVE Node Registration in PegaProx
Three nodes added to PegaProx cluster:
| Node | PegaProx ID | Host | Status |
|------|-------------|------|--------|
| MK-33 | `726eb477` | `192.168.7.33` | running |
| MK-34 | `df6f5e5d` | `192.168.7.34` | running |
| MK-39 | `9711704b` | `192.168.7.39` | running |
**API Notes Learned:**
- `host` field must be **bare IP only** (no `:8006`)
- CSRF protection requires `X-Requested-With: XMLHttpRequest`
- `/api/clusters` endpoint used for registration
### 4. Documentation Updates
Updated files:
- `fleet/admin-cheat-sheet.md` -- Added PegaProx section, updated node statuses, added iVentoy remastering notes
- `procedures/pega-prox-deploy.md` -- New procedure for deploying PegaProx on Swarm
- `procedures/iventoy-remaster-procedure.md` -- New procedure for remastering PVE ISOs
- `changelog/2026-05-31-pxe-pegaprox-deployment.md` -- This file
### 5. iVentoy Pro Upgrade -- Pending
Status: Awaiting private repo link from vendor. Current installation uses iVentoy Free. Pro upgrade may simplify per-node provisioning (per-MAC ISO binding feature expected).
---
## Remaining Work
- MK-42: Not yet PXE-booted or installed
- PegaProx: Admin password change required (user in progress)
- iVentoy Pro: Upgrade pending vendor repo link
- LXC/cloud-init automation: Terraform templates for Docker Swarm restoration (next phase)
- Traefik DNS record: `pegaprox.ai.home` routing pending Traefik deployment on MK7
---
## Service Impact
| Service | Status | Notes |
|---------|--------|-------|
| iVentoy PXE | Ready | 4 remastered ISOs registered |
| PegaProx | Online | 3 PVE nodes connected |
| MK-33 | Online | PVE installed, registered |
| MK-34 | Online | PVE installed, registered |
| MK-39 | Online | PVE installed, registered |
| MK-42 | Offline | Pending PXE boot |
---
*End of changelog*

152
dns-topology.md Normal file
View File

@@ -0,0 +1,152 @@
# DNS Topology — Iron Legion Homelab
**Last updated:** 2026-05-30
**Canonical source:** `Iron-Legion/documentation/dns-topology.md`
---
## Overview
All DNS resolution for the fleet is handled by **Technitium DNS Server** on MK7. AdGuard Home has been removed — Technitium's built-in ad blocking (blocklist-based) replaces it entirely.
**Single source of truth:** Technitium is both authoritative for the fleet's private zone and recursive for the public internet.
---
## DNS Architecture
```
Client Devices ──→ Router (primary, Cloudflare upstream)
└── Windows 11: secondary → MK7:53 (Technitium)
MK7 (Technitium DNS, port 53):
├── Authoritative zone: *.ai.home
│ └── artemis.ai.home, mk7.ai.home, mk44.ai.home, mk5.ai.home, mk33.ai.home, ...
├── Recursive resolver (root servers for public domains)
│ └── OR Cloudflare DoT forwarder: tls://1.1.1.1 (configurable)
└── Ad blocking: blocklist loaded (StevenBlack / OISD / hBlock — user-configured)
```
---
## Service Details
| Attribute | Value |
|-----------|-------|
| **Service** | Technitium DNS Server |
| **Image** | `technitium/dns-server:latest` |
| **Host** | MK7 (`192.168.7.7`, `100.66.70.51` Tailscale) |
| **Published ports** | `53/tcp`, `53/udp` (DNS), `5380/tcp` (Web UI) |
| **Traefik host** | `dns.ai.home` |
| **Compose** | `/opt/iron-legion/docker-swarm/technitium/compose.yml` |
| **Data volume** | `technitium-config` (Docker volume) |
---
## Upstream / Forwarder Config
| Setting | Value | Notes |
|---------|-------|-------|
| **Forwarder protocol** | DNS over TLS (DoT) | Encrypted queries to Cloudflare |
| **Forwarder address** | `tls://1.1.1.1` | Primary |
| **Fallback** | `tls://1.0.0.1` | Secondary (if configured) |
| **Root-server fallback** | Implicit | Technitium falls back to recursive resolution if forwarder fails |
**Web UI:** `http://dns.ai.home:5380` or `http://192.168.7.7:5380`
- Settings → DNS Server → Forwarders → Add `tls://1.1.1.1`
---
## Ad Blocking
Technitium uses a **DNS blocklist** to drop ad/tracker/malware domains at resolution time.
| Setting | Value |
|---------|-------|
| **Blocklist source** | User-configured (e.g., StevenBlack, OISD, hBlock) |
| **Update interval** | User-configured (recommend: daily) |
| **Whitelist** | `.ai.home` internal zone never blocked |
| **Previous solution** | ~~AdGuard Home~~ — removed |
**Blocklist config:** Web UI → Settings → Blocking → Blocklists
---
## Zone: `ai.home`
Technitium is **authoritative** for `.ai.home`. Records are maintained via the web UI or API.
| Record Type | Examples |
|-------------|----------|
| **A** | `artemis.ai.home → 192.168.15.182` |
| **A** | `mk7.ai.home → 192.168.7.7` |
| **A** | `mk44.ai.home → 192.168.x.x` |
| **CNAME** | `dns.ai.home → mk7.ai.home` |
**Zone file location:** `/etc/dns/config/zones/ai.home` (inside container)
---
## Client DNS Assignment
| Client | Primary DNS | Secondary DNS | Notes |
|--------|-------------|---------------|-------|
| **Router** | Cloudflare (1.1.1.1) | — | Default for all LAN devices |
| **Windows 11** | Router | MK7:53 (Technitium) | Ad blocking only on secondary |
| **Tailscale devices** | 100.100.100.100 (MagicDNS) | — | Split-brain: `.ai.home` → 192.168.7.7 |
**Fleet nodes** (MK33, MK34, MK39, MK42) resolve `.ai.home` against MK7:53 via their LAN gateway or static DNS assignment.
---
## Tailscale Integration
Tailscale's **MagicDNS** and **split-brain DNS** handle `*.ai.home` for devices connected to the tailnet.
| Setting | Value |
|---------|-------|
| **Split DNS domain** | `ai.home` |
| **Nameserver** | `192.168.7.7` (MK7 LAN IP) |
| **Override local DNS** | Yes |
This means: a laptop on Tailscale resolving `artemis.ai.home` hits Tailscale's DNS, which forwards `ai.home` queries to `192.168.7.7` (Technitium). The laptop does NOT need to point its system DNS at MK7.
**Off-Tailscale:** Devices must point DNS at MK7:53 directly to resolve `.ai.home`.
---
## Migration History
| Date | Change |
|------|--------|
| 2026-05-25 | AdGuard Home deployed on port 3000/5373 |
| 2026-05-28 | AdGuard paused (port conflict / redundancy concerns) |
| 2026-05-30 | **AdGuard removed.** Technitium blocklist configured. DoT to Cloudflare enabled. |
---
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| Can't resolve `.ai.home` | Device not using Technitium | Point DNS at MK7:53 or join Tailscale |
| Ads not blocked | Blocklist not loaded / outdated | Refresh blocklist in Technitium UI |
| Slow resolution | DoT forwarder failing | Check `tls://1.1.1.1` reachability; fall back to root recursion |
| Tailscale IPs unreachable | Device not on Tailscale | Connect to tailnet; 100.x IPs are VPN-only |
---
## Operational Commands
```bash
# Test resolution from any node
dig @192.168.7.7 artemis.ai.home +short
dig @192.168.7.7 google.com +short
# Check Technitium container logs
ssh jarvis@mk7.ai.home "docker logs $(docker ps -q -f name=technitium)"
# Access web UI
open http://dns.ai.home:5380
```

206
fleet/admin-cheat-sheet.md Normal file
View File

@@ -0,0 +1,206 @@
# Iron Legion Fleet Admin Cheat Sheet
Generated: 2026-05-31
Maintainer: F.R.I.D.A.Y. (Hermes Agent)
---
## Quick Access Links
| Service | URL / Endpoint | Notes |
|---------|---------------|-------|
| iVentoy PXE Server | http://192.168.27.205:26000 | Shield WiFi fallback |
| PegaProx | https://192.168.7.7:5000 | PVE Cluster Manager (host mode) |
| Portainer | https://portainer.ai.home | Swarm Manager |
| Traefik Dashboard | https://traefik.ai.home:8080 | Proxy/Router |
| Technitium DNS | https://dns.ai.home:5380 | DNS Server |
| Beszel Monitoring | https://beszel.ai.home | Fleet Metrics |
| Dozzle | https://dozzle.ai.home | Container Logs |
| Homepage | https://home.ai.home | Service Portal |
| Prometheus | https://prometheus.ai.home | Metrics DB |
| Authelia | https://auth.ai.home | SSO Portal |
---
## Fleet Node Inventory
### Swarm Manager
- Hostname: mark-vii.ai.home
- Armor Code: MK-7
- LAN IP: 192.168.7.7
- Tailscale IP: 100.66.70.51
- Role: Swarm Manager, DNS, Traefik, Portainer, PegaProx
- CPUs: 18 | RAM: 15 GB | Disk: 916 GB
### Worker Nodes G9 (Proxmox VE)
| Armor | Hostname | LAN IP | Tailscale IP | MAC | Status |
|-------|----------|--------|--------------|-----|--------|
| MK-33 | mk33.ai.home | 192.168.7.33 | TBD | E0-51-D8-1C-5D-56 | Online (PVE) |
| MK-34 | mk34.ai.home | 192.168.7.34 | TBD | E0-51-D8-1C-5C-75 | Online (PVE) |
| MK-39 | mk39.ai.home | 192.168.7.39 | TBD | PENDING | Online (PVE) |
| MK-42 | mk42.ai.home | 192.168.7.42 | TBD | PENDING | Not Installed |
### Utility Nodes
| Armor | Hostname | LAN IP | Tailscale IP | Role |
|-------|----------|--------|--------------|------|
| Neo | nebuchadnezzar.ai.home | 192.168.192.24 | 100.99.123.16 | Nextcloud AIO, Gitea |
| MK-44 | mark44.ai.home | 192.168.5.214 | TBD | Ollama GPU |
| MK-5 | mark5.ai.home | 192.168.6.5 | TBD | TBD |
| Shield | shield.ai.home | 192.168.10.15 / 192.168.27.205 | - | PXE/iVentoy Server |
| Artemis | artemis.ai.home | 192.168.15.182 | 100.100.97.18 | Discord Gateway |
### Mission Control
- Hostname: mission-control.ai.home
- OS: Windows 11
- Role: Workstation
- Type: Separate physical machine
---
## PegaProx — Proxmox VE Cluster Manager
| Attribute | Value |
|-----------|-------|
| **Host** | MK7 (192.168.7.7) |
| **Ports** | 5000 (HTTPS UI/API), 5001 (VNC WebSocket), 5002 (SSH WebSocket) |
| **Deploy mode** | Docker Swarm — `host` publish mode |
| **Network** | `traefik-public` overlay |
| **SSL** | Self-signed cert (`CN=PegaProx`, auto-generated) |
| **Default user** | `pegaprox` (password change required on first login) |
| **Cluster IDs** | MK33=`726eb477`, MK34=`df6f5e5d`, MK39=`9711704b` |
**Admin password must be changed on first login.**
**API notes:**
- Add cluster: `host` field must be **bare IP only** (no `:8006` — PegaProx appends port internally)
- CSRF protection requires `X-Requested-With: XMLHttpRequest` on state-changing API calls
- Exempt paths: `/api/auth/login`, `/api/auth/setup`, `/api/health`
---
## iVentoy PXE Configuration
- Server: shield.ai.home -- 192.168.10.15/27
- WebUI: http://192.168.27.205:26000
- Subnet: 192.168.10.0/27
- Pool: 192.168.10.20 to 192.168.10.30
- MAC Filter: Permit mode
- Edition: **iVentoy Free** (Pro upgrade pending -- private repo link awaited)
### Registered ISOs
| ISO | Node | Purpose |
|-----|------|---------|
| proxmox-mk33-auto.iso | MK-33 | PVE 9.2 Auto-Install |
| proxmox-mk34-auto.iso | MK-34 | PVE 9.2 Auto-Install |
| proxmox-mk39-auto.iso | MK-39 | PVE 9.2 Auto-Install |
| proxmox-mk42-auto.iso | MK-42 | PVE 9.2 Auto-Install |
| proxmox-ve_9.2-1.iso | - | Original PVE ISO |
| ubuntu-24.04.3-live-server-amd64.iso | - | Ubuntu Autoinstall |
### Whitelisted MACs
- E0-51-D8-1C-5D-CA (Legacy)
- E0-51-D8-1C-5D-5C (Legacy)
- E0-51-D8-1C-5D-56 (MK-33)
- E0-51-D8-1C-5C-75 (MK-34)
- PENDING: MK-39
- PENDING: MK-42
Post-Install: Remove MAC from whitelist. Node boots local disk, gets production IP.
### ISO Remastering Notes
All Proxmox auto-install ISOs are **remastered** with:
1. **Embedded answer URL** -- each ISO points to `http://192.168.10.15:8080/pve/answers/mkNN.toml` (server URL hardcoded; node IP assigned by DHCP)
2. **UEFI gfxmode locked** -- strict `1024x768` (fallback `640x480` removed)
3. **Per-ISO answer files** -- `mk33.toml`, `mk34.toml`, `mk39.toml`, `mk42.toml` in `/opt/iventoy/user/answers/`
> iVentoy Free does NOT support per-MAC ISO binding. Remastered ISOs achieve per-node provisioning via embedded answer URLs.
---
## DNS Records
### CNAME to traefik.ai.home -- A: 192.168.7.7
- artemis.ai.home
- hermes.ai.home
- n8n.ai.home
- pgadmin.ai.home
- portainer.ai.home
- beszel.ai.home
- dozzle.ai.home
- prometheus.ai.home
- homepage.ai.home
- auth.ai.home
- dns.ai.home
### A Records
- traefik.ai.home -> 192.168.7.7
- mk7.ai.home -> 192.168.7.7
- mk33.ai.home -> 192.168.7.33
- mk34.ai.home -> 192.168.7.34
- mk39.ai.home -> 192.168.7.39
- mk42.ai.home -> 192.168.7.42
- mark44.ai.home -> 192.168.5.214
- mark5.ai.home -> 192.168.6.5
- nebuchadnezzar.ai.home -> 192.168.192.24
- shield.ai.home -> 192.168.10.15
---
## SSH Topology
Portable Host (F.R.I.D.A.Y.)
|
+---> artemis.ai.home via id_ed25519
| +---> mk7.ai.home via artemis_key
|
+---> shield via jarvis user
| +---> PXE subnet 192.168.10.0/27
|
+---> mk33-42 via bobby user (legacy subnet)
|
+---> nebuchadnezzar via jarvis user
Key Files:
- ~/.ssh/id_ed25519 -- bobby@cinnamint
- ~/.ssh/artemis_key -- MK7 jump-host
---
## Armor Codenames
| Code | Name | System |
|------|------|--------|
| MK-7 | Mark VII | Swarm Manager |
| MK-33 | Silver Centurion | Worker |
| MK-34 | Igor | Worker |
| MK-39 | Starboost | Worker |
| MK-42 | Bones | Worker |
| MK-44 | Hulkbuster | GPU/Ollama |
| MK-5 | Mark 5 | TBD |
| J.A.R.V.I.S. | Judicious Automated... | Dashboard |
| F.R.I.D.A.Y. | Field-Ready Runtime... | Portable Agent |
| A.R.T.E.M.I.S. | Advanced Real-Time... | Discord |
| NEO | Nebuchadnezzar | Nextcloud |
| SHIELD | - | PXE Server |
---
## Notes
- iVentoy Free does NOT support per-MAC ISO binding.
- Shield PXE subnet isolated via ip_forward=0.
- Mission Control is separate physical machine.
- All *.ai.home resolve via Technitium DNS.
- PegaProx deployed on MK7 Swarm in `host` mode (not routed through Traefik).
- iVentoy Pro upgrade pending -- private repo link awaited from vendor.
Last updated: 2026-05-31 by F.R.I.D.A.Y.

View File

@@ -76,7 +76,7 @@ This PRD is append-only for new services. Modifications to existing entries requ
| Node | Role | Services Assigned |
|------|------|-------------------|
| **MK7 (mark-vii.ai.home)** | Swarm Manager | ALL Phase 1 infrastructure: Traefik, Technitium DNS, AdGuard Home, Portainer, Prometheus, Beszel, Dozzle, Authelia, Homepage |
| **MK7 (mark-vii.ai.home)** | Swarm Manager | ALL Phase 1 infrastructure: Traefik, Technitium DNS, Portainer, Prometheus, Beszel, Dozzle, Authelia, Homepage |
| **MK33, MK34, MK39, MK42** | Swarm Workers | Phase 2 media stack (Jellyfin, Sonarr, Radarr, Prowlarr), distributed workloads, Vaultwarden, Nextcloud |
| **Artemis** | AI Foreman / JARVIS | Hermes Agent, Ansible-pull control plane — NOT a service host |
@@ -116,8 +116,8 @@ This PRD is append-only for new services. Modifications to existing entries requ
| Service | Image | Pulls | Stars | Updated | Placement | Notes |
|---------|-------|-------|-------|---------|-----------|-------|
| **Traefik** | `traefik` | 3.49B | 3,634 | 2026-05-13 | **Global** | Every node receives ingress routing + Docker socket read-only |
| **Technitium DNS** | `technitium/dns-server` | 8.99M | 156 | 2026-05-09 | **Manager Constraint** | Single authoritative DNS — port 53 on MK7 only |
| **AdGuard Home** | `adguard/adguardhome` | 170.7M | 1,408 | 2026-05-25 | **Replicated (2)** | 2 replicas across workers for redundancy — port 3000 |
| **Technitium DNS** | `technitium/dns-server` | 8.99M | 156 | 2026-05-09 | **Manager Constraint** | Authoritative `.ai.home` + recursive DNS with DoT forwarder to Cloudflare, ad blocking enabled — port 53 on MK7 only |
| **~~AdGuard Home~~** | ~~`adguard/adguardhome`~~ | ~~170.7M~~ | ~~1,408~~ | ~~2026-05-25~~ | ~~**Removed**~~ | ~~Replaced by Technitium built-in ad blocking~~ |
### Monitoring / Observability
| Service | Image | Pulls | Stars | Updated | Placement | Notes |
@@ -126,13 +126,13 @@ This PRD is append-only for new services. Modifications to existing entries requ
| **Prometheus Node Exporter** | `prom/node-exporter` | — | — | — | **Global** | Runs on every node — scrapes CPU/mem/disk |
| **Grafana** | `grafana/grafana` | 5.22B | 3,540 | 2026-05-16 | **Replicated (1)** | Any worker (Phase 3, needs data history first) |
| **Beszel Hub** | `henrygd/beszel` | 12.58M | 32 | 2026-04-30 | **Manager Constraint** | Central hub on MK7 collects metrics from agents |
| **Beszel Agent** | `henrygd/beszel-agent` | — | — | — | **Global** | Runs on every node — reports to hub |
| **Beszel Agent** | `henrygd/beszel-agent` | — | — | — | **Pending** | Planned global — reports to hub. Not yet deployed. |
| **Dozzle** | `amir20/dozzle` | 309.6M | 144 | 2026-05-25 | **Replicated (1)** | Any worker — read-only Docker socket |
### Management / Dashboard
| Service | Image | Pulls | Stars | Updated | Placement | Notes |
|---------|-------|-------|-------|---------|-----------|-------|
| **Portainer CE** | `portainer/portainer-ce` | 1.46B | 2,665 | 2026-05-20 | **Manager Constraint** | MK7 only — agentless mode, no portainer-agent needed |
| **Portainer CE** | `portainer/portainer-ce` | 1.46B | 2,665 | 2026-05-20 | **Replicated (1)** | MK7 — agentless mode, no portainer-agent needed |
| **Homepage** | `gethomepage/homepage` | 1.31M | 40 | 2026-05-25 | **Replicated (1)** | Any worker — all endpoints via env vars |
### Security / Identity
@@ -187,16 +187,27 @@ This PRD is append-only for new services. Modifications to existing entries requ
| Nextcloud (MK7) | PostgreSQL (MK7) | TCP | 5432 | DB traffic over Tailscale |
## DNS Resolution
- **Technitium (MK7)** is the authoritative internal DNS for `*.ai.home`.
- **AdGuard Home (MK7)** handles recursive resolution with ad-block lists. Replaces Pi-hole.
- **Chain:** Client → Technitium (local record?) → AdGuard Home (recursive + blocklist) → Upstream (Cloudflare/Quad9)
- **Tailscale MagicDNS** remains enabled as fallback. If Technitium fails, clients fall back to `100.x.x.x` direct resolution.
- **AdGuard Home admin UI** runs on port 3000 by default (separate from Grafana if co-located).
| Component | Status | Detail |
|-----------|--------|--------|
| **Technitium (MK7)** | ✅ Deployed | Container running, port 53/5380 open |
| **`*.ai.home` zone** | ⏳ Pending | Not yet configured as authoritative — Tailscale MagicDNS currently handles name resolution |
| **Technitium DNS (MK7)** | ✅ Active | Authoritative `.ai.home` + recursive resolver + ad blocking on port 53. |
| **~~AdGuard Home~~** | ~~Removed~~ | ~~Replaced by Technitium built-in ad blocking~~ |
**Planned Chain (not yet active):**
```
Client → Technitium (authoritative `.ai.home`? → return local record) → Technitium (recursive resolver + blocklist) → Cloudflare DoT / Root Servers
```
**Current Fallback:** Tailscale MagicDNS provides `*.ai.home` resolution via Tailscale IP addresses. Technitium will assume authority once zone records are populated.
- **Technitium DNS admin UI** runs on port 5380.
## Port Allocation (Reserved)
| Port | Service |
|------|---------|
| 53 | DNS (Technitium / Pi-hole) |
| 53 | DNS (Technitium) |
| 80/443 | HTTP/S (Traefik) |
| 3000 | Grafana |
| 9090 | Prometheus |
@@ -232,7 +243,7 @@ Every service with persistent state uses **bind mounts to on-node directories**.
|---------|-----------|---------------|---------------|
| **Traefik** | `/opt/iron-legion/traefik/config/` `/opt/iron-legion/traefik/certs/` | MK7 (daily rsync) | < 50 MB |
| **Technitium DNS** | `/opt/iron-legion/technitium/config/` | MK7 | < 10 MB |
| **Pi-hole** | `/opt/iron-legion/pihole/etc-pihole/` `/opt/iron-legion/pihole/etc-dnsmasq.d/` | MK7 | < 500 MB |
| **~~AdGuard Home~~** | ~~`/opt/iron-legion/adguard/work/`~~ ~~`/opt/iron-legion/adguard/conf/`~~ | ~~Removed~~ | ~~N/A~~ |
| **Prometheus** | `/opt/iron-legion/prometheus/data/` | MK7 (retention: 15d local, 90d backup) | 520 GB |
| **Grafana** | `/opt/iron-legion/grafana/data/` | MK7 | < 500 MB |
| **Beszel** | `/opt/iron-legion/beszel/data/` | MK7 | < 1 GB |
@@ -302,7 +313,7 @@ traefik.http.middlewares.authelia.forwardauth.address: http://authelia:9091/api/
- **No VLANs.** Tailscale ACLs handle segment isolation.
- **ACL policy (draft):**
- `tag:admin` nodes (Bobby, Artemis) → all ports on all nodes
- `tag:services` (MK7, MK7, MK7, MK7) → only their assigned service ports, no cross-node SSH except via Tailscale SSH
- `tag:services` (MK7 manager + MK33, MK34, MK39, MK42 workers) → only their assigned service ports, no cross-node SSH except via Tailscale SSH
- `tag:user` (Bobby's phone, laptop) → HTTPS 443 on MK7 only, Jellyfin 8096 on MK7 directly
- **Default deny.** Any traffic not explicitly allowed in Tailscale ACL is dropped.
@@ -321,7 +332,8 @@ traefik.http.middlewares.authelia.forwardauth.address: http://authelia:9091/api/
| Order | Service | Target Node | Why First | Dependencies |
|-------|---------|-------------|-----------|--------------|
| 1 | **Technitium DNS** | MK7 | Name resolution for internal services | None |
| 2 | **Pi-hole** | MK7 | Recursive DNS + ad-block | Technitium (via conditional forwarding) |
| 2 | **Technitium DNS** | MK7 | Authoritative + recursive + ad-block | N/A — single service |
| ~~AdGuard Home~~ | ~~Removed~~ | ~~—~~ | ~~Technitium replaces AdGuard~~ |
| 3 | **Traefik** | MK7 | Edge router for all HTTP ingress | DNS (needs `*.labs.internal` to resolve) |
| 4 | **Authelia** | MK7 | Auth layer before exposing any mgmt UI | Traefik (depends on ForwardAuth middleware) |
| 5 | **Portainer** | MK7 | Container management UI | Traefik + Authelia (for secured access) |
@@ -375,7 +387,7 @@ traefik.http.middlewares.authelia.forwardauth.address: http://authelia:9091/api/
|---|----------|--------|----------------------|
| 1 | **Domain name** — Does Bobby own a domain (e.g., `bobbysh.me`) or do we use a fake TLD (`labs.internal`)? | **Critical** — TLS certs, Authelia, and DNS all depend on this. | Use `labs.internal` + self-signed CA |
| 2 | **Technitium upstream** — DoH, DoT, or plain UDP to upstream resolver (e.g., Cloudflare 1.1.1.1)? | Low — can default to DoH | DoH → `https://cloudflare-dns.com/dns-query` |
| 3 | **Pi-hole vs Technitium conflict**Both run on MK7 port 53. Run Pi-hole on non-standard port with Technitium as conditional forwarder? Or separate nodes? | **Critical** — port 53 collision | Technitium on 53, Pi-hole on 5053, forward to Pi-hole from Technitium |
| 3 | **AdGuard Home vs Technitium layout**AdGuard runs on port 3000, Technitium on 53. No collision, but conditional forwarding from Technitium to AdGuard needs config. | Low — both run independently | Technitium uses upstream AdGuard for recursive queries |
| 4 | **Jellyfin media storage** — External USB on MK7? SMB share? NVMe? | Medium | External USB mounted at `/media` on MK7 |
| 5 | **Backup target on MK7** — Capacity? Dedicated drive? Rsync target path? | Medium | `/backups/<service-name>/` on MK7 secondary storage |
| 6 | **Nextcloud database** — Use existing PostgreSQL on MK7, or deploy Nextcloud AIO (bundled)? | Medium — affects resource allocation on MK7 | Deploy standalone PostgreSQL container on MK7 for Nextcloud AIO is too heavy |
@@ -385,7 +397,7 @@ traefik.http.middlewares.authelia.forwardauth.address: http://authelia:9091/api/
| 10 | **Beszel alert thresholds** — CPU %, memory %, disk % triggers not defined. | Low | Defaults in Beszel container |
## Outstanding Decisions Required
1. **Pi-hole inclusion** — Not in Bobby's original list. I added it as a DNS-layer complement to Technitium. **Remove if Bobby doesn't want it.**
1. ~~Pi-hole inclusion~~**Resolved.** Technitium built-in ad blocking replaces Pi-hole.
2. **Authelia two-factor method** — TOTP via app (Google Authenticator) vs WebAuthn/FIDO2 keys?
3. **Home vs remote access** — If Bobby wants to share Jellyfin with friends/family outside Tailscale, public domain + Authelia guard is required.
@@ -411,10 +423,9 @@ traefik.http.middlewares.authelia.forwardauth.address: http://authelia:9091/api/
| Prowlarr | `linuxserver/prowlarr` | `linuxserver` | 35,913,487 | 403 | 2026-05-25 | ✅ 200 |
| Vaultwarden | `vaultwarden/server` | `vaultwarden` | 287,182,978 | 1,454 | 2026-05-17 | ✅ 200 |
| Nextcloud | `nextcloud` | `library` | 1,011,978,204 | 4,485 | 2026-05-23 | ✅ 200 |
| Pi-hole | `pihole/pihole` | `pihole` | 961,220,209 | 2,943 | 2026-05-25 | ✅ 200 |
| Authelia | `authelia/authelia` | `authelia` | 75,183,682 | 208 | 2026-05-25 | ✅ 200 |
| **Authelia** | `authelia/authelia` | `authelia` | 75,183,682 | 208 | 2026-05-25 | ✅ 200 |
**Total unique images:** 16 (including Pi-hole)
**Total unique images:** 15
**Community health indicator:** All images have > 10 stars, > 1M pulls (except Beszel 32 stars, Homepage 40 stars — acceptable for young projects)
**Freshness:** All updated within 90 days except Beszel (30 days — still acceptable)
@@ -423,7 +434,7 @@ traefik.http.middlewares.authelia.forwardauth.address: http://authelia:9091/api/
~/.ansible-repo/new-build/
├── phase-1/ # Infrastructure
│ ├── technitium/
│ ├── pihole/
│ ├── adguard/
│ ├── traefik/
│ ├── authelia/
│ ├── portainer/

View File

@@ -5,9 +5,9 @@
| Chunk | Status | Commit | Notes |
|-------|--------|--------|-------|
| Chunk 1 — Purpose, Scope, Success Criteria | ✅ Complete | `73e42cc` | Merged into `homelab-services-stack-prd.md` |
| Chunk 2 — Constraints, Service Catalog, Network Architecture | 🔄 In Progress | — | Awaiting completion |
| Chunk 3 — Data & Persistence, Security Model | ⏳ Pending | — | Blocked on Chunk 2 |
| Chunk 4 — Deployment Phases, Open Questions, Appendix | ⏳ Pending | — | Blocked on Chunk 3 |
| Chunk 2 — Constraints, Service Catalog, Network Architecture | ✅ Complete | `a3fc718` | Reconciled with live fleet |
| Chunk 3 — Data & Persistence, Security Model | ✅ Complete | `b7cc09c` | Pi-hole fully removed, Technitium ad blocking canonical. ACL policy corrected. Split files + master PRD in sync. |
| Chunk 4 — Deployment Phases, Open Questions, Appendix | ✅ Complete | `f18b978` | All Pi-hole references purged. Split files + master PRD in sync. |
## Operational Documentation

View File

@@ -0,0 +1,238 @@
# Procedure: Remaster Proxmox VE ISOs for iVentoy Auto-Install
**Scope:** Remaster stock Proxmox VE ISOs with embedded auto-install answer URLs and locked UEFI gfxmode for PXE boot via iVentoy.
**Author:** F.R.I.D.A.Y.
**Date:** 2026-05-31
**Prerequisites:** Stock Proxmox VE ISO, `xorriso`, Python 3, iVentoy PXE server running.
---
## 1. Overview
iVentoy Free does NOT support per-MAC ISO binding. To provision each node with its own network config (IP, gateway, etc.), we remaster the stock Proxmox ISO:
1. Embed an `auto-installer-mode.toml` file pointing to a per-node answer file
2. Lock UEFI `gfxmode` to `1024x768` (remove `640x480` fallback)
3. Each ISO points to its own answer URL: `http://192.168.10.15:8080/pve/answers/mkNN.toml`
---
## 2. Answer File Structure
### iVentoy Answer Server
iVentoy runs a built-in HTTP server on `192.168.10.15:8080`. Answer files live in:
```
/opt/iventoy/user/answers/
├── mk33.toml
├── mk34.toml
├── mk39.toml
└── mk42.toml
```
### Per-Node Answer File Example (`mk33.toml`)
```toml
[target]
source = "from-dhcp" # Node IP assigned by iVentoy DHCP, NOT hardcoded
global]
keyboard = "en-us"
timezone = "America/Toronto"
[network]
iface = "eno1"
address = "192.168.7.33/18" # Static after install
gateway = "192.168.18.1"
dns = "192.168.7.7"
[root-password]
pwhash = "$y$j9T$YOUR_HASH_HERE" # Pre-hashed password
```
> **Important:** The `answer_url` in the embedded `auto-installer-mode.toml` points to the **server** (`192.168.10.15:8080`), not the node IP. The node IP comes from DHCP during PXE boot (`source = "from-dhcp"`).
---
## 3. Remaster Script
Save as `/tmp/remaster_pve_iso.py`:
```python
#!/usr/bin/env python3
"""
Remaster Proxmox VE ISO with embedded auto-install answer URL.
Locks UEFI gfxmode to 1024x768 (removes 640x480 fallback).
"""
import subprocess
import sys
import tempfile
import os
import shutil
# Node-specific config
NODE = sys.argv[1] # e.g., mk33
SRC_ISO = sys.argv[2] # e.g., proxmox-ve_9.2-1.iso
DST_ISO = f"proxmox-{NODE}-auto.iso"
ANSWER_URL = f"http://192.168.10.15:8080/pve/answers/{NODE}.toml"
# Create auto-installer-mode.toml
auto_installer_toml = f"""[proxmox-auto-installer]
answer_url = "{ANSWER_URL}"
"""
# Work in temp dir
with tempfile.TemporaryDirectory() as tmpdir:
# Extract ISO contents
subprocess.run(["xorriso", "-osirrox", "on", "-indev", SRC_ISO,
"-extract", "/", tmpdir], check=True)
# Write auto-installer-mode.toml into ISO root
ai_path = os.path.join(tmpdir, "auto-installer-mode.toml")
with open(ai_path, "w") as f:
f.write(auto_installer_toml)
# Patch grub.cfg: lock gfxmode to 1024x768 only
grub_path = os.path.join(tmpdir, "boot", "grub", "grub.cfg")
if os.path.exists(grub_path):
with open(grub_path, "r") as f:
content = f.read()
# Remove 640x480 fallback
content = content.replace("set gfxmode=1024x768,640x480",
"set gfxmode=1024x768")
with open(grub_path, "w") as f:
f.write(content)
print("Patched grub.cfg: gfxmode locked to 1024x768")
# Rebuild ISO with same boot properties
subprocess.run([
"xorriso", "-as", "mkisofs",
"-o", DST_ISO,
"-isohybrid-mbr", os.path.join(tmpdir, "usr", "lib", "ISOLINUX", "isohdpfx.bin"),
"-c", "boot.cat",
"-b", "isolinux/isolinux.bin",
"-no-emul-boot", "-boot-load-size", "4", "-boot-info-table",
"-eltorito-alt-boot",
"-e", "EFI/BOOT/BOOTX64.EFI",
"-no-emul-boot", "-isohybrid-gpt-basdat",
"-r", "-V", f"Proxmox-VE-Auto-{NODE}",
tmpdir
], check=True)
print(f"Created: {DST_ISO}")
print(f"Answer URL embedded: {ANSWER_URL}")
```
### Usage
```bash
# On Shield (iVentoy server)
python3 /tmp/remaster_pve_iso.py mk33 /opt/iventoy/iso/proxmox-ve_9.2-1.iso
python3 /tmp/remaster_pve_iso.py mk34 /opt/iventoy/iso/proxmox-ve_9.2-1.iso
python3 /tmp/remaster_pve_iso.py mk39 /opt/iventoy/iso/proxmox-ve_9.2-1.iso
python3 /tmp/remaster_pve_iso.py mk42 /opt/iventoy/iso/proxmox-ve_9.2-1.iso
# Move to iVentoy ISO directory
mv proxmox-mk33-auto.iso /opt/iventoy/iso/
mv proxmox-mk34-auto.iso /opt/iventoy/iso/
mv proxmox-mk39-auto.iso /opt/iventoy/iso/
mv proxmox-mk42-auto.iso /opt/iventoy/iso/
```
---
## 4. In-Place ISO Patching (gfxmode only)
If you already have remastered ISOs and only need to patch gfxmode:
```bash
# Extract grub.cfg from ISO, patch, replace in-place
ISO=/opt/iventoy/iso/proxmox-mk33-auto.iso
xorriso -cpx /boot/grub/grub.cfg /tmp/grub.cfg -< /dev/null -- "$ISO"
sed -i 's/set gfxmode=1024x768,640x480/set gfxmode=1024x768/' /tmp/grub.cfg
xorriso -boot_image any replay -map /tmp/grub.cfg /boot/grub/grub.cfg -- "$ISO"
```
> The `-boot_image any replay` flag preserves boot properties after file replacement.
---
## 5. Verification
```bash
# Confirm answer URL is embedded
strings /opt/iventoy/iso/proxmox-mk33-auto.iso | grep "192.168.10.15"
# Expected: http://192.168.10.15:8080/pve/answers/mk33.toml
# Confirm gfxmode is locked
xorriso -cpx /boot/grub/grub.cfg /tmp/verify.cfg -< /dev/null -- /opt/iventoy/iso/proxmox-mk33-auto.iso
grep gfxmode /tmp/verify.cfg
# Expected: set gfxmode=1024x768
```
---
## 6. iVentoy Configuration
### Web UI
- URL: `http://192.168.27.205:26000`
- Go to **ISO Management** → add remastered ISOs
### MAC Whitelist (Permit Mode)
Add node MACs to iVentoy whitelist:
```
E0-51-D8-1C-5D-56 (MK-33)
E0-51-D8-1C-5C-75 (MK-34)
PENDING (MK-39)
PENDING (MK-42)
```
Nodes must be in whitelist to PXE boot.
### DHCP Pool
- Subnet: `192.168.10.0/27`
- Range: `192.168.10.20` to `192.168.10.30`
- Nodes get temporary PXE IPs from this pool during install
---
## 7. Post-Install
After node installs and reboots:
1. Remove node MAC from iVentoy whitelist (node boots from local disk)
2. Node gets production IP from `/etc/network/interfaces` (set in answer file)
3. Verify: `ping 192.168.7.33` (or appropriate node IP)
---
## 8. iVentoy Pro Upgrade Notes
> **Status:** Awaiting private repo link from vendor.
Expected Pro features (to verify upon upgrade):
- Per-MAC ISO binding (may eliminate need for per-node remastered ISOs)
- Additional deployment modes
- Priority support
When the private repo link is received:
1. Clone the Pro repository
2. Review upgrade documentation in the repo
3. Backup current `/opt/iventoy/` configuration
4. Follow vendor upgrade procedure
5. Test with one node before fleet-wide rollout
---
## Rollback
```bash
# Remove remastered ISO
rm /opt/iventoy/iso/proxmox-mk33-auto.iso
# Re-add stock ISO in iVentoy Web UI
# Node will boot stock ISO -- manual install required
```
---
*Last updated: 2026-05-31*

View File

@@ -0,0 +1,165 @@
# Procedure: Deploy PegaProx on Docker Swarm
**Scope:** Deploy PegaProx (Proxmox VE cluster manager) as a Docker Swarm service on MK7.
**Author:** F.R.I.D.A.Y.
**Date:** 2026-05-31
**Prerequisites:** MK7 Swarm manager active, `traefik-public` overlay network exists.
---
## 1. Create Swarm Compose File
Save as `/tmp/pegaprox_swarm.yml` on MK7:
```yaml
version: "3.8"
services:
pegaprox:
image: pegaprox/pegaprox:latest
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.role == manager
ports:
- target: 5000
published: 5000
mode: host
protocol: tcp
- target: 5001
published: 5001
mode: host
protocol: tcp
- target: 5002
published: 5002
mode: host
protocol: tcp
networks:
- traefik-public
volumes:
- pegaprox-config:/app/config
environment:
- PEGAPROX_DEBUG=0
volumes:
pegaprox-config:
driver: local
networks:
traefik-public:
external: true
```
> **Critical:** `mode: host` is required. `ingress` mode breaks WebSocket VNC/SSH consoles because Swarm ingress routing does not support WebSocket upgrade properly.
---
## 2. Deploy Stack
```bash
ssh jarvis@mk7.ai.home
docker stack deploy -c /tmp/pegaprox_swarm.yml pegaprox
```
Verify:
```bash
docker service ls | grep pegaprox
docker ps | grep pegaprox
```
---
## 3. Verify Service Health
```bash
# HTTPS API
curl -sk https://192.168.7.7:5000/api/health
# Check container logs
docker logs $(docker ps -q -f name=pegaprox)
```
Expected: `{"status":"ok"}`
---
## 4. First Login & Password Change
1. Open `https://192.168.7.7:5000`
2. Login with default credentials:
- Username: `pegaprox`
- Password: `admin`
3. System will force password change on first login
4. API returns: `{"security_warning":"DEFAULT_PASSWORD","requires_password_change":true}`
---
## 5. API Notes for Automation
### CSRF Protection
All state-changing API calls (POST/PUT/PATCH/DELETE) must include:
```
X-Requested-With: XMLHttpRequest
```
Exempt paths (no CSRF header needed):
- `/api/auth/login`
- `/api/auth/setup`
- `/api/auth/oidc/*`
- `/api/auth/check`
- `/api/auth/validate`
- `/api/auth/logout`
- `/api/health`
- `/api/webauthn/auth/begin`
### Add Cluster
```bash
curl -sk -X POST https://192.168.7.7:5000/api/clusters \
-b cookies.txt \
-H "Content-Type: application/json" \
-H "X-Requested-With: XMLHttpRequest" \
-d '{
"name": "MK33",
"host": "192.168.7.33",
"user": "root@pam",
"pass": "YOUR_PVE_PASSWORD"
}'
```
> **CRITICAL:** `host` must be **bare IP only**. Do NOT append `:8006`. PegaProx appends the port internally. Supplying `192.168.7.33:8006` causes URL parse failure: `Failed to parse: https://[192.168.7.33:8006]:8006/...`
---
## 6. Backup Volume
```bash
# Backup PegaProx config + DB
docker run --rm -v pegaprox_pegaprox-config:/src -v /tmp:/dst alpine \
tar czf /dst/pegaprox-config-$(date +%Y%m%d).tar.gz -C /src .
```
---
## 7. Known Issues
| Issue | Cause | Fix |
|-------|-------|-----|
| WebSocket VNC/SSH broken | Swarm `ingress` mode strips upgrade headers | Use `mode: host` |
| URL parse error on add-cluster | `:8006` appended to host field | Use bare IP only |
| CSRF 403 on API calls | Missing `X-Requested-With` header | Add header to all state-changing calls |
| Self-signed cert warning | No CA-signed cert deployed | Accept in browser or deploy custom cert |
---
## Rollback
```bash
ssh jarvis@mk7.ai.home
docker stack rm pegaprox
docker volume rm pegaprox_pegaprox-config # WARNING: destroys all data
```
---
*Last updated: 2026-05-31*

View File

@@ -29,7 +29,7 @@ All services deployed on MK7 manager via `docker stack deploy`.
| `portainer` | Portainer CE | replicated | 1/1 | `9000` | `portainer.ai.home` |
| `prometheus` | Prometheus | replicated | 1/1 | `9090` | `prom.ai.home` |
| `technitium` | Technitium DNS | replicated | 1/1 | `53/tcp`, `53/udp`, `5380` | `dns.ai.home` |
| `adguard` | AdGuard Home | replicated | 1/1 | `3000`, `30053` | `adguard.ai.home` |
| ~~`adguard`~~ | ~~AdGuard Home~~ | ~~removed~~ | ~~—~~ | ~~—~~ | ~~`adguard.ai.home`~~ |
| ~~authelia~~ | ~~Authelia~~ | ~~deferred~~ | — | — | ~~`auth.ai.home`~~ |
> **Note:** Authelia deferred until local TLS is available (requires `https://auth.ai.home`).