Files
documentation/homelab-services-stack-prd.md
jarvis 484b2e6272 DNS topology: AdGuard removed, Technitium authoritative + DoT + ad blocking
- Remove AdGuard Home from all service catalogs, deployment phases,
  persistence tables, and network architecture docs
- Update Technitium notes: authoritative .ai.home zone, recursive resolver,
  DoT forwarder to Cloudflare (tls://1.1.1.1), built-in ad blocking
- Resolve open questions #2 (Technitium upstream) and #3 (AdGuard layout)
- Add dns-topology.md: complete DNS architecture diagram, zone details,
  client assignments, Tailscale integration, troubleshooting table,
  migration history (AdGuard deployed → paused → removed)
2026-05-29 21:01:24 -04:00

26 KiB
Raw Blame History


Iron Legion Homelab Services Stack — Purpose & Scope

Document ID

  • PRD: homelab-services-stack-prd.md
  • Date: 2026-05-25
  • Owner: Artemis (AI Foreman, Iron Legion Labs)
  • Authority: Commander Bobby

Purpose

Central canonical reference for all Docker/Compose-based services Iron Legion Labs intends to deploy across the fleet. This document exists to:

  1. Prevent duplicate research — every service's Docker image, metadata, and deployment pattern is captured once.
  2. Guide node placement — which service runs where, and why.
  3. Serve as the source of truth for Ansible-pull manifests, compose files, and future automation.

Scope

In Scope

  • Service catalog with DockerHub-verified images (name, namespace, description, pull count, stars, last update)
  • Category assignment (Network, Monitoring, Media, Security, Management, Infrastructure)
  • Recommended target node per service
  • Deployment phase priority
  • High-level network, data, and security architecture

Out of Scope

  • Detailed compose-file YAML (deferred to per-service deployment PRDs)
  • Specific Traefik middleware configurations (deferred to network PRD)
  • GPU passthrough configs for media transcode (deferred to Mark44 workload PRD)
  • Service-specific SSO/authelia rule authoring (deferred to security PRD)

Living Document

This PRD is append-only for new services. Modifications to existing entries require Bobby sign-off. Additions follow the raw-metadata-to-summary pattern established in Section 4.


Iron Legion Homelab Services Stack — Success Criteria

Done When

  1. Every service in the catalog has a verified DockerHub image with a non-stale last-update date (≤ 90 days old at time of cataloging)
  2. Every service has an assigned target node that respects the Node Assignments Locked policy
  3. Every service has a deployment phase (1, 2, or 3) agreed by Bobby
  4. Network ingress/egress flow is documented at the service level (who talks to whom, via what port/protocol)
  5. A single docker-compose.yml skeleton exists per phase, ready for population
  6. Bobby has read and approved this PRD; any objections are captured as blockers below

Verification Methods

  • DockerHub API freshness check: last_updated field within 90 days
  • Node lock compliance: cross-reference against fleet-ops.md node assignments
  • Compose skeleton existence: ls ~/.ansible-repo/new-build/phase-*.yml

Failure Modes

Failure Mitigation
DockerHub image stale or abandoned Flag for alternative image research
Node assignment conflicts with locked policy Escalate to Bobby immediately
Service dependency on another Phase 2+ service Note in Open Questions, defer deployment

Known Blockers

  • Authelia requires a domain + valid TLS cert. If Bobby does not want to expose to public internet, Traefik + internal Tailscale cert or self-signed CA required.
  • Technitium DNS upstream forwarding policy not yet specified (DoH, DoT, plain UDP?).

Iron Legion Homelab Services Stack — Constraints

Hard Constraints (Non-Negotiable)

  1. Bare metal over abstraction. Direct deployments preferred. Compose files are acceptable as orchestration glue, but no Docker Swarm mode, no Kubernetes, no abstraction layers Bobby cannot ssh into and debug.
  2. No nginx. Traefik is the sole edge router. No nginx reverse proxies, no nginx sidecars.
  3. No Tailscale serve/funnel. Services bind to 0.0.0.0 on their assigned node and are reachable via Tailscale mesh IP + port. No tailscale serve, no tailscale funnel.
  4. Node assignments locked. Services do not migrate between nodes without Bobby's explicit written direction.
  5. Patch upstream source when loopback/bind restrictions block direct deployment. Do not re-architect around the constraint.

Node Assignment Policy (as of 2026-05-25)

The G9 Swarm Cluster is the ONLY deployment target. Mark5, Bones, Neo, and Mark44 are NOT part of this homelab services stack.

Node Role Services Assigned
MK7 (mark-vii.ai.home) Swarm Manager ALL Phase 1 infrastructure: Traefik, Technitium DNS, Portainer, Prometheus, Beszel, Dozzle, Authelia, Homepage
MK33, MK34, MK39, MK42 Swarm Workers Phase 2 media stack (Jellyfin, Sonarr, Radarr, Prowlarr), distributed workloads, Vaultwarden, Nextcloud
Artemis AI Foreman / JARVIS Hermes Agent, Ansible-pull control plane — NOT a service host

Soft Constraints (Bobby Approval Required to Override)

  • Data residency: All persistent volumes live on-node. No NFS, no Ceph, no distributed storage unless explicitly approved.
  • Secret management: No plain-text secrets in compose files. Use .env files with file: mode 0600, or Vaultwarden if a secret store is needed.
  • Backup cadence: Every service with persistent state must have a documented backup target. Default: daily rsync to MK7 secondary storage.

Environment Assumptions

  • All nodes run Debian Trixie or compatible.
  • Docker Engine (not Docker Desktop) is installed on all target nodes.
  • Tailscale is up and meshed. All inter-node traffic is over Tailscale IPs.
  • docker compose plugin (v2) available, not legacy docker-compose standalone.

Iron Legion Homelab Services Stack — Service Catalog

Verified DockerHub Metadata (as of 2026-05-25)

Swarm Placement Legend

Placement Swarm Behavior
Global One replica on EVERY node (including manager)
Replicated (N) N replicas distributed across workers by scheduler
Manager Constraint Only on manager node(s)
Label Constraint Only on nodes with matching node.label

Placement Rules for 5-Node Swarm (1 manager + 4 workers)

  • MK7 = Manager (can run global services + manager-constrained services)
  • MK33, MK34, MK39, MK42 = Workers (run global services + replicated services)
  • No node labels yet — will label storage nodes (e.g., media storage) as Phase 3

Network Layer

Service Image Pulls Stars Updated Placement Notes
Traefik traefik 3.49B 3,634 2026-05-13 Global Every node receives ingress routing + Docker socket read-only
Technitium DNS technitium/dns-server 8.99M 156 2026-05-09 Manager Constraint Authoritative .ai.home + recursive DNS with DoT forwarder to Cloudflare, ad blocking enabled — port 53 on MK7 only
AdGuard Home adguard/adguardhome 170.7M 1,408 2026-05-25 Removed Replaced by Technitium built-in ad blocking

Monitoring / Observability

Service Image Pulls Stars Updated Placement Notes
Prometheus prom/prometheus 1.97B 2,064 2026-05-25 Manager Constraint Central scraping server on MK7
Prometheus Node Exporter prom/node-exporter Global Runs on every node — scrapes CPU/mem/disk
Grafana grafana/grafana 5.22B 3,540 2026-05-16 Replicated (1) Any worker (Phase 3, needs data history first)
Beszel Hub henrygd/beszel 12.58M 32 2026-04-30 Manager Constraint Central hub on MK7 collects metrics from agents
Beszel Agent henrygd/beszel-agent Pending Planned global — reports to hub. Not yet deployed.
Dozzle amir20/dozzle 309.6M 144 2026-05-25 Replicated (1) Any worker — read-only Docker socket

Management / Dashboard

Service Image Pulls Stars Updated Placement Notes
Portainer CE portainer/portainer-ce 1.46B 2,665 2026-05-20 Replicated (1) MK7 — agentless mode, no portainer-agent needed
Homepage gethomepage/homepage 1.31M 40 2026-05-25 Replicated (1) Any worker — all endpoints via env vars

Security / Identity

Service Image Pulls Stars Updated Placement Notes
Authelia authelia/authelia 75.2M 208 2026-05-25 Replicated (1) Any worker — Traefik ForwardAuth middleware

Existing External Services (NOT in Swarm)

Service Location Status Notes
Vaultwarden Neo (Nebuchadnezzar) Production Already deployed via Docker. Managed separately.
Nextcloud Neo (Nebuchadnezzar) Production Nextcloud AIO. NOT part of G9 Swarm stack.

These services live outside the G9 Swarm cluster. No migration planned unless Bobby explicitly requests it.

Media Stack (*arr + Jellyfin)

Service Image Pulls Stars Updated Placement Notes
Jellyfin jellyfin/jellyfin 370.4M 1,535 2026-05-25 Label Constraint Nodes with node.label.storage=media (Phase 3)
Sonarr linuxserver/sonarr 2.34B 2,118 2026-05-23 Replicated (1) Any worker — needs shared /downloads mount
Radarr linuxserver/radarr 2.36B 1,791 2026-05-25 Replicated (1) Any worker — needs shared /downloads mount
Prowlarr linuxserver/prowlarr 35.9M 403 2026-05-25 Replicated (1) Any worker — feeds Sonarr/Radarr via network

Total Services: 16 (catalog) + 3 (existing external) = 19 total fleet services

Swarm Services: 16 (includes global Beszel agent and node exporter)

Total DockerHub Pulls (aggregate): ~16.0B

All images updated within 90 days


Iron Legion Homelab Services Stack — Network Architecture

Ingress Flow

[Internet] → [Tailscale mesh] → [MK7: Traefik] → [Target Node: Service Port]

Traefik Role

  • Single entrypoint. Every HTTP/HTTPS service routes through Traefik on MK7.
  • Tailscale-native. Traefik binds to 0.0.0.0:80 and 0.0.0.0:443. No tailscale serve.
  • Service discovery via Docker labels. Each compose service exposes labels that Traefik reads from the Docker socket on MK7.
  • Docker socket access restricted. Traefik mounts a read-only Docker socket. No other service gets socket access.

Internal Traffic Patterns

Source Destination Protocol Port Notes
Traefik (MK7) Any service HTTP/HTTPS Varies Proxied via Tailscale IP
Beszel (MK7) Any node HTTP Varies Agent polls HTTP metrics endpoints (read-only)
Prometheus (MK7) Any node HTTP 9100 (node-exporter) Scrapes node and container metrics
Prowlarr (MK7) Indexer sites HTTPS 443 Outbound only
Sonarr/Radarr (MK7) Prowlarr HTTP 9696 Internal indexer lookup
Nextcloud (MK7) PostgreSQL (MK7) TCP 5432 DB traffic over Tailscale

DNS Resolution

Component Status Detail
Technitium (MK7) Deployed Container running, port 53/5380 open
*.ai.home zone Pending Not yet configured as authoritative — Tailscale MagicDNS currently handles name resolution
Technitium DNS (MK7) Active Authoritative .ai.home + recursive resolver + ad blocking on port 53.
AdGuard Home Removed Replaced by Technitium built-in ad blocking

Planned Chain (not yet active):

Client → Technitium (authoritative `.ai.home`? → return local record) → Technitium (recursive resolver + blocklist) → Cloudflare DoT / Root Servers

Current Fallback: Tailscale MagicDNS provides *.ai.home resolution via Tailscale IP addresses. Technitium will assume authority once zone records are populated.

  • Technitium DNS admin UI runs on port 5380.

Port Allocation (Reserved)

Port Service
53 DNS (Technitium)
80/443 HTTP/S (Traefik)
3000 Grafana
9090 Prometheus
9000 Portainer
8096 Jellyfin
8989 Sonarr
7878 Radarr
9696 Prowlarr
8080 Authelia (default)

TLS Strategy

  • Internal: Traefik generates self-signed certs for *.labs.internal. Authelia can enforce client-cert if needed.
  • External: Not applicable per no-Tailscale-funnel constraint. If Bobby later wants public access, Let's Encrypt via DNS challenge (Technitium controls the zone).

Iron Legion Homelab Services Stack — Data & Persistence

Volume Strategy

Every service with persistent state uses bind mounts to on-node directories. No named volumes, no NFS, no distributed storage.

Directory Convention

/opt/iron-legion/
├── service-name/
│   ├── data/           # Application data (databases, config, state)
│   ├── config/         # Static config files mounted read-only where possible
│   └── logs/           # Log output (optional, if not sent to stdout)

Per-Service Persistence

Service Data Path Backup Target Size Estimate
Traefik /opt/iron-legion/traefik/config/ /opt/iron-legion/traefik/certs/ MK7 (daily rsync) < 50 MB
Technitium DNS /opt/iron-legion/technitium/config/ MK7 < 10 MB
AdGuard Home /opt/iron-legion/adguard/work/ /opt/iron-legion/adguard/conf/ Removed N/A
Prometheus /opt/iron-legion/prometheus/data/ MK7 (retention: 15d local, 90d backup) 520 GB
Grafana /opt/iron-legion/grafana/data/ MK7 < 500 MB
Beszel /opt/iron-legion/beszel/data/ MK7 < 1 GB
Portainer /opt/iron-legion/portainer/data/ MK7 < 100 MB
Homepage /opt/iron-legion/homepage/config/ MK7 < 10 MB
Vaultwarden /opt/iron-legion/vaultwarden/data/ MK7 (encrypted) < 500 MB
Authelia /opt/iron-legion/authelia/config/ MK7 < 10 MB
Jellyfin /opt/iron-legion/jellyfin/config/ /opt/iron-legion/jellyfin/media/ None (media too large) < 1 GB config; media drive separate
Sonarr /opt/iron-legion/sonarr/config/ MK7 < 1 GB
Radarr /opt/iron-legion/radarr/config/ MK7 < 1 GB
Prowlarr /opt/iron-legion/prowlarr/config/ MK7 < 100 MB
Nextcloud /opt/iron-legion/nextcloud/data/ MK7 (snapshots) 1050 GB

Media Storage Exception

  • Jellyfin media lives on a separate mount (likely external USB/NVMe on MK7). Not backed up via rsync.
  • Sonarr/Radarr download staging to a shared /downloads bind mount, then hardlink/copy to Jellyfin media library.

Backup Tooling

  • Primary: rsync -a --delete to MK7 secondary storage daily at 03:00 local.
  • Vaultwarden: rsqlite3 dump + rsync (encrypted at rest on MK7).
  • Prometheus: snapshot API → rsync (not raw WAL files).

Secret Management

  • .env files live in /opt/iron-legion/service-name/.env, mode 0600.
  • Compose files use ${VAR_NAME} syntax, never literal strings.
  • Vaultwarden stores shared secrets (DB passwords, API keys). Artemis holds no secrets in memory.

Iron Legion Homelab Services Stack — Security Model

Authentication Layers

Layer Service Scope Notes
Edge Auth Authelia Traefik-secured endpoints MFA portal, session cookies
App Auth Vaultwarden Password vault Master password + 2FA
App Auth Portainer Container mgmt Built-in RBAC, can integrate LDAP
App Auth Nextcloud File collaboration Built-in, can integrate Authelia OIDC
OS Auth SSH keys Node access Tailscale SSH + local keypairs

Authelia Deployment Notes

  • Target node: MK7 (lightweight, sits beside Traefik)
  • Redirection URL: Set Authelia redirection_url to the base domain of services needing auth.
  • Backend storage: Uses SQLite initially. If Bobby wants HA, migrate to PostgreSQL on MK7.
  • Notification method: File-based (writes to /opt/iron-legion/authelia/notifications/) until SMTP/Discord is configured.
  • Rule granularity: Per-service access_control rules in configuration.yml. Default: one_factor for internal services, two_factor for management interfaces (Portainer, Grafana admin).

Traefik ↔ Authelia Integration

# Traefik middleware label (example)
traefik.http.routers.portainer.middlewares: authelia@docker
traefik.http.middlewares.authelia.forwardauth.address: http://authelia:9091/api/verify?rd=https://auth.labs.internal
  • No nginx. ForwardAuth middleware talks directly to Authelia over internal Docker network.
  • Bypass list: Prometheus scrape targets, Beszel agents, Technitium DNS queries — these are internal metrics/DNS, no auth required.

Secret Handling

Secret Type Storage Method Rotation Trigger
Authelia session secret .env file, 64-byte random hex On any Authelia config reload
Vaultwarden admin token .env file, 48-byte random Only on compromise
DB passwords (Nextcloud ↔ PostgreSQL) .env files on both nodes On any DB migration or rebuild
Tailscale auth keys Vaultwarden secure note On key expiry or node rebuild
API keys (indexers, Cloudflare) Vaultwarden secure note On key rotation by provider

Network Segmentation

  • No VLANs. Tailscale ACLs handle segment isolation.
  • ACL policy (draft):
    • tag:admin nodes (Bobby, Artemis) → all ports on all nodes
    • tag:services (MK7 manager + MK33, MK34, MK39, MK42 workers) → only their assigned service ports, no cross-node SSH except via Tailscale SSH
    • tag:user (Bobby's phone, laptop) → HTTPS 443 on MK7 only, Jellyfin 8096 on MK7 directly
  • Default deny. Any traffic not explicitly allowed in Tailscale ACL is dropped.

Monitoring for Security Events

  • Dozzle provides real-time log viewing but is NOT a SIEM.
  • Promtail/Loki not yet in catalog. If Bobby wants log aggregation + alerting, add to Phase 3.
  • Beszel alerts on anomalous CPU/memory — use as coarse intrusion detection proxy.

Iron Legion Homelab Services Stack — Deployment Phases

Phase 1: Infrastructure (Critical Path)

Goal: Get DNS, proxy, and basic monitoring alive. Everything else depends on this.

Order Service Target Node Why First Dependencies
1 Technitium DNS MK7 Name resolution for internal services None
2 Technitium DNS MK7 Authoritative + recursive + ad-block N/A — single service
AdGuard Home Removed Technitium replaces AdGuard
3 Traefik MK7 Edge router for all HTTP ingress DNS (needs *.labs.internal to resolve)
4 Authelia MK7 Auth layer before exposing any mgmt UI Traefik (depends on ForwardAuth middleware)
5 Portainer MK7 Container management UI Traefik + Authelia (for secured access)
6 Prometheus MK7 Metrics collection baseline None (scrape targets added in Phase 2)
7 Beszel MK7 Fleet resource overview None (agents installed per-node)
8 Dozzle MK7 Real-time log viewing None

Phase 1 milestone: All nodes report healthy in Beszel. Portainer accessible via auth portal. DNS resolves.


Phase 2: Media & File Collaboration

Goal: Self-hosted media acquisition and file sync.

Order Service Target Node Why Now Dependencies
9 Jellyfin MK7 Media playback (GPU transcode if MK7 has dGPU) None (file ingest later)
10 Sonarr MK7 TV management Jellyfin (pushes organized files)
11 Radarr MK7 Movie management Jellyfin (pushes organized files)
12 Prowlarr MK7 Indexer aggregation Sonarr + Radarr (feeds them)
13 Nextcloud MK7 File sync/collaboration PostgreSQL (on MK7)
14 Vaultwarden MK7 Password management None (standalone)

Phase 2 milestone: Media acquisition pipeline works end-to-end. Nextcloud syncs. Vaultwarden stores secrets.


Phase 3: Polish & Expansion

Goal: Dashboards, advanced monitoring, nice-to-haves.

Order Service Target Node Why Deferred Dependencies
15 Grafana MK7 Dashboards need metrics to be interesting Prometheus (needs data history)
16 Homepage MK7 Custom dashboard for everything All Phase 1+2 services (needs endpoints)
Promtail + Loki TBD Centralized logging Only if Dozzle is insufficient
Uptime-Kuma TBD External uptime monitoring Only if Beszel alerting is insufficient

Phase 3 milestone: Single-pane dashboard (Homepage) shows all services. Alerts route to Discord or email.

Deployment Cadence

  • One service per session. No mass deployments. Validate each before proceeding.
  • Rollback plan: docker compose down + mv /opt/iron-legion/service{,-failed-$(date +%s)}. Snapshot taken before each compose up.
  • Bobby approval required before Phase 2 begins. Phase 1 success must be demonstrated.

Iron Legion Homelab Services Stack — Open Questions & Blockers

Blocker Status

# Question Impact Default if Unresolved
1 Domain name — Does Bobby own a domain (e.g., bobbysh.me) or do we use a fake TLD (labs.internal)? Critical — TLS certs, Authelia, and DNS all depend on this. Use labs.internal + self-signed CA
2 Technitium upstream — DoH, DoT, or plain UDP to upstream resolver (e.g., Cloudflare 1.1.1.1)? Low — can default to DoH DoH → https://cloudflare-dns.com/dns-query
3 AdGuard Home vs Technitium layout — AdGuard runs on port 3000, Technitium on 53. No collision, but conditional forwarding from Technitium to AdGuard needs config. Low — both run independently Technitium uses upstream AdGuard for recursive queries
4 Jellyfin media storage — External USB on MK7? SMB share? NVMe? Medium External USB mounted at /media on MK7
5 Backup target on MK7 — Capacity? Dedicated drive? Rsync target path? Medium /backups/<service-name>/ on MK7 secondary storage
6 Nextcloud database — Use existing PostgreSQL on MK7, or deploy Nextcloud AIO (bundled)? Medium — affects resource allocation on MK7 Deploy standalone PostgreSQL container on MK7 for Nextcloud AIO is too heavy
7 GPU on MK7 — NVIDIA driver runtime for Jellyfin transcode? Low — falls back to CPU transcode Use jellyfin/jellyfin with NVIDIA_VISIBLE_DEVICES env if available
8 Notification routing — Discord webhook? SMTP? File only? Low — default file works File notifications in /opt/iron-legion/authelia/notifications/
9 Tailscale ACL policy — Draft exists in Section 7. Bobby must review and apply in Tailscale admin console. Low Stay permissive until Bobby approves
10 Beszel alert thresholds — CPU %, memory %, disk % triggers not defined. Low Defaults in Beszel container

Outstanding Decisions Required

  1. Pi-hole inclusionResolved. Technitium built-in ad blocking replaces Pi-hole.
  2. Authelia two-factor method — TOTP via app (Google Authenticator) vs WebAuthn/FIDO2 keys?
  3. Home vs remote access — If Bobby wants to share Jellyfin with friends/family outside Tailscale, public domain + Authelia guard is required.

Appendix A — Raw DockerHub Metadata Table

Full API response data captured 2026-05-25T16:45:00Z.

Service Full Image Namespace Pulls Stars Last Updated API Status
Traefik traefik library 3,490,588,071 3,634 2026-05-13 200
Technitium DNS technitium/dns-server technitium 8,989,831 156 2026-05-09 200
Homepage gethomepage/homepage gethomepage 1,305,710 40 2026-05-25 200
Beszel henrygd/beszel henrygd 12,578,135 32 2026-04-30 200
Dozzle amir20/dozzle amir20 309,561,399 144 2026-05-25 200
Grafana grafana/grafana grafana 5,220,434,031 3,540 2026-05-16 200
Prometheus prom/prometheus prom 1,966,043,381 2,064 2026-05-25 200
Portainer CE portainer/portainer-ce portainer 1,464,874,500 2,665 2026-05-20 200
Jellyfin jellyfin/jellyfin jellyfin 370,358,966 1,535 2026-05-25 200
Sonarr linuxserver/sonarr linuxserver 2,339,638,307 2,118 2026-05-23 200
Radarr linuxserver/radarr linuxserver 2,359,097,569 1,791 2026-05-25 200
Prowlarr linuxserver/prowlarr linuxserver 35,913,487 403 2026-05-25 200
Vaultwarden vaultwarden/server vaultwarden 287,182,978 1,454 2026-05-17 200
Nextcloud nextcloud library 1,011,978,204 4,485 2026-05-23 200
Authelia authelia/authelia authelia 75,183,682 208 2026-05-25 200

Total unique images: 15 Community health indicator: All images have > 10 stars, > 1M pulls (except Beszel 32 stars, Homepage 40 stars — acceptable for young projects) Freshness: All updated within 90 days except Beszel (30 days — still acceptable)

Appendix B — Compose Skeleton Directory Map

~/.ansible-repo/new-build/
├── phase-1/                    # Infrastructure
│   ├── technitium/
│   ├── adguard/
│   ├── traefik/
│   ├── authelia/
│   ├── portainer/
│   ├── prometheus/
│   ├── beszel/
│   └── dozzle/
├── phase-2/                    # Media + Files
│   ├── jellyfin/
│   ├── sonarr/
│   ├── radarr/
│   ├── prowlarr/
│   ├── nextcloud/
│   └── vaultwarden/
└── phase-3/                    # Dashboards + Polish
    ├── grafana/
    ├── homepage/
    └── loki/                   # Optional

Skeleton not yet created. Deferred until Bobby approves PRD.