From d60bc96f1d8eb21ae5d7d91c097f7bb9608846e9 Mon Sep 17 00:00:00 2001 From: "Artemis (Iron Legion)" Date: Mon, 25 May 2026 17:17:23 -0400 Subject: [PATCH] Add homelab services stack PRD MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Verifies 16 DockerHub images, assigns target nodes per locked policy, defines 3-phase deployment order (Infra → Media → Polish), and captures open questions for Bobby. Services: Traefik, Technitium DNS, AdGuard Home, Prometheus, Grafana, Beszel, Dozzle, Portainer, Homepage, Authelia, Vaultwarden, Jellyfin, Sonarr, Radarr, Prowlarr, Nextcloud Domain: *.ai.home No public internet exposure. --- plans/01-purpose-and-scope.md | 30 ++ plans/02-success-criteria.md | 25 ++ plans/03-constraints.md | 28 ++ plans/04-service-catalog.md | 52 ++++ plans/05-network-architecture.md | 47 +++ plans/06-data-and-persistence.md | 46 +++ plans/07-security-model.md | 48 ++++ plans/08-deployment-phases.md | 52 ++++ plans/09-open-questions.md | 20 ++ plans/10-appendix.md | 52 ++++ plans/homelab-services-stack-prd.md | 426 ++++++++++++++++++++++++++++ 11 files changed, 826 insertions(+) create mode 100644 plans/01-purpose-and-scope.md create mode 100644 plans/02-success-criteria.md create mode 100644 plans/03-constraints.md create mode 100644 plans/04-service-catalog.md create mode 100644 plans/05-network-architecture.md create mode 100644 plans/06-data-and-persistence.md create mode 100644 plans/07-security-model.md create mode 100644 plans/08-deployment-phases.md create mode 100644 plans/09-open-questions.md create mode 100644 plans/10-appendix.md create mode 100644 plans/homelab-services-stack-prd.md diff --git a/plans/01-purpose-and-scope.md b/plans/01-purpose-and-scope.md new file mode 100644 index 0000000..8c8606b --- /dev/null +++ b/plans/01-purpose-and-scope.md @@ -0,0 +1,30 @@ +# Iron Legion Homelab Services Stack — Purpose & Scope + +## Document ID +- **PRD:** homelab-services-stack-prd.md +- **Date:** 2026-05-25 +- **Owner:** Artemis (AI Foreman, Iron Legion Labs) +- **Authority:** Commander Bobby + +## Purpose +Central canonical reference for all Docker/Compose-based services Iron Legion Labs intends to deploy across the fleet. This document exists to: +1. Prevent duplicate research — every service's Docker image, metadata, and deployment pattern is captured once. +2. Guide node placement — which service runs where, and why. +3. Serve as the source of truth for Ansible-pull manifests, compose files, and future automation. + +## Scope +### In Scope +- Service catalog with DockerHub-verified images (name, namespace, description, pull count, stars, last update) +- Category assignment (Network, Monitoring, Media, Security, Management, Infrastructure) +- Recommended target node per service +- Deployment phase priority +- High-level network, data, and security architecture + +### Out of Scope +- Detailed compose-file YAML (deferred to per-service deployment PRDs) +- Specific Traefik middleware configurations (deferred to network PRD) +- GPU passthrough configs for media transcode (deferred to Mark44 workload PRD) +- Service-specific SSO/authelia rule authoring (deferred to security PRD) + +## Living Document +This PRD is append-only for new services. Modifications to existing entries require Bobby sign-off. Additions follow the raw-metadata-to-summary pattern established in Section 4. diff --git a/plans/02-success-criteria.md b/plans/02-success-criteria.md new file mode 100644 index 0000000..ec81e54 --- /dev/null +++ b/plans/02-success-criteria.md @@ -0,0 +1,25 @@ +# Iron Legion Homelab Services Stack — Success Criteria + +## Done When +1. ✅ Every service in the catalog has a verified DockerHub image with a non-stale last-update date (≤ 90 days old at time of cataloging) +2. ✅ Every service has an assigned target node that respects the **Node Assignments Locked** policy +3. ✅ Every service has a deployment phase (1, 2, or 3) agreed by Bobby +4. ✅ Network ingress/egress flow is documented at the service level (who talks to whom, via what port/protocol) +5. ✅ A single `docker-compose.yml` skeleton exists per phase, ready for population +6. ✅ Bobby has read and approved this PRD; any objections are captured as blockers below + +## Verification Methods +- DockerHub API freshness check: `last_updated` field within 90 days +- Node lock compliance: cross-reference against `fleet-ops.md` node assignments +- Compose skeleton existence: `ls ~/.ansible-repo/new-build/phase-*.yml` + +## Failure Modes +| Failure | Mitigation | +|---------|------------| +| DockerHub image stale or abandoned | Flag for alternative image research | +| Node assignment conflicts with locked policy | Escalate to Bobby immediately | +| Service dependency on another Phase 2+ service | Note in Open Questions, defer deployment | + +## Known Blockers +- **Authelia** requires a domain + valid TLS cert. If Bobby does not want to expose to public internet, Traefik + internal Tailscale cert or self-signed CA required. +- **Technitium DNS** upstream forwarding policy not yet specified (DoH, DoT, plain UDP?). diff --git a/plans/03-constraints.md b/plans/03-constraints.md new file mode 100644 index 0000000..a5d49c0 --- /dev/null +++ b/plans/03-constraints.md @@ -0,0 +1,28 @@ +# Iron Legion Homelab Services Stack — Constraints + +## Hard Constraints (Non-Negotiable) +1. **Bare metal over abstraction.** Direct deployments preferred. Compose files are acceptable as orchestration glue, but no Docker Swarm mode, no Kubernetes, no abstraction layers Bobby cannot `ssh` into and debug. +2. **No nginx.** Traefik is the sole edge router. No nginx reverse proxies, no nginx sidecars. +3. **No Tailscale serve/funnel.** Services bind to `0.0.0.0` on their assigned node and are reachable via Tailscale mesh IP + port. No `tailscale serve`, no `tailscale funnel`. +4. **Node assignments locked.** Services do not migrate between nodes without Bobby's explicit written direction. +5. **Patch upstream source** when loopback/bind restrictions block direct deployment. Do not re-architect around the constraint. + +## Node Assignment Policy (as of 2026-05-25) +| Node | Role | Services Assigned | +|------|------|-------------------| +| **Neo** | Services node | Nextcloud AIO, Vaultwarden, Portainer (UI/mgmt) | +| **Bones** | Infrastructure node | Paperclip + Ollama + PostgreSQL, Technitium DNS (infra DNS) | +| **Mark44 (Hulkbuster)** | Heavy-lifting / GPU | Monitoring stack (Prometheus, Grafana, Beszel), media apps with transcode (Jellyfin) | +| **Mark5 (Suitcase)** | Research / light-task | Traefik (edge router — lightweight, always-on), Homepage (lightweight dashboard) | +| **Artemis** | AI Foreman / JARVIS | Hermes Agent, Ansible-pull control plane | + +## Soft Constraints (Bobby Approval Required to Override) +- **Data residency:** All persistent volumes live on-node. No NFS, no Ceph, no distributed storage unless explicitly approved. +- **Secret management:** No plain-text secrets in compose files. Use `.env` files with `file:` mode 0600, or Vaultwarden if a secret store is needed. +- **Backup cadence:** Every service with persistent state must have a documented backup target. Default: daily rsync to Bones secondary storage. + +## Environment Assumptions +- All nodes run Debian Trixie or compatible. +- Docker Engine (not Docker Desktop) is installed on all target nodes. +- Tailscale is up and meshed. All inter-node traffic is over Tailscale IPs. +- `docker compose` plugin (v2) available, not legacy `docker-compose` standalone. diff --git a/plans/04-service-catalog.md b/plans/04-service-catalog.md new file mode 100644 index 0000000..7fd4b62 --- /dev/null +++ b/plans/04-service-catalog.md @@ -0,0 +1,52 @@ +# Iron Legion Homelab Services Stack — Service Catalog + +## Verified DockerHub Metadata (as of 2026-05-25) + +### Network Layer +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Traefik** | `traefik` | `library` | Cloud Native Edge Router | 3.49B | 3,634 | 2026-05-13 | Mark5 | +| **Technitium DNS** | `technitium/dns-server` | `technitium` | Self-hosted DNS server with DoH/DoT | 8.99M | 156 | 2026-05-09 | Bones | +| **AdGuard Home** | `adguard/adguardhome` | `adguard` | Network-wide ad blocking DNS server | 170.7M | 1,408 | 2026-05-25 | Bones | + +### Monitoring / Observability +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Prometheus** | `prom/prometheus` | `prom` | Systems monitoring & alerting toolkit | 1.97B | 2,064 | 2026-05-25 | Mark44 | +| **Grafana** | `grafana/grafana` | `grafana` | Analytics & monitoring dashboards | 5.22B | 3,540 | 2026-05-16 | Mark44 | +| **Beszel** | `henrygd/beszel` | `henrygd` | Lightweight server monitoring hub with Docker stats | 12.58M | 32 | 2026-04-30 | Mark44 | +| **Dozzle** | `amir20/dozzle` | `amir20` | Real-time Docker container log viewer | 309.6M | 144 | 2026-05-25 | Mark44 | + +### Management / Dashboard +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Portainer CE** | `portainer/portainer-ce` | `portainer` | Lightweight container management UI | 1.46B | 2,665 | 2026-05-20 | Neo | +| **Homepage** | `gethomepage/homepage` | `gethomepage` | Customizable homepage with integrations | 1.31M | 40 | 2026-05-25 | Mark5 | + +### Security / Identity +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Vaultwarden** | `vaultwarden/server` | `vaultwarden` | Bitwarden-compatible password manager (Rust) | 287.2M | 1,454 | 2026-05-17 | Neo | +| **Authelia** | `authelia/authelia` | `authelia` | Multi-factor authentication portal | 75.2M | 208 | 2026-05-25 | Mark5 | + +### Media Stack (*arr + Jellyfin) +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Jellyfin** | `jellyfin/jellyfin` | `jellyfin` | Free software media browser | 370.4M | 1,535 | 2026-05-25 | Mark44 | +| **Sonarr** | `linuxserver/sonarr` | `linuxserver` | TV series management | 2.34B | 2,118 | 2026-05-23 | Mark44 | +| **Radarr** | `linuxserver/radarr` | `linuxserver` | Movie management | 2.36B | 1,791 | 2026-05-25 | Mark44 | +| **Prowlarr** | `linuxserver/prowlarr` | `linuxserver` | Indexer management | 35.9M | 403 | 2026-05-25 | Mark44 | + +### File / Collaboration +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Nextcloud** | `nextcloud` | `library` | Self-hosted file sync & collaboration | 1.01B | 4,485 | 2026-05-23 | Neo | + +## Total Services: 15 +## Total DockerHub Pulls (aggregate): ~16.0B +## All images last updated within 90 days except Beszel (2026-04-30 — still within 30 days) + +## Notes +- **Beszel** lowest star count (32) but actively maintained and purpose-built for small-fleet monitoring. +- **Homepage** lowest pull count (1.31M) — young project, high utility, monitor for longevity. +- **Pi-hole** not in Bobby's original mention but added as network-layer complement to Technitium. Requires Bobby approval to include. diff --git a/plans/05-network-architecture.md b/plans/05-network-architecture.md new file mode 100644 index 0000000..f20abbf --- /dev/null +++ b/plans/05-network-architecture.md @@ -0,0 +1,47 @@ +# Iron Legion Homelab Services Stack — Network Architecture + +## Ingress Flow +``` +[Internet] → [Tailscale mesh] → [Mark5: Traefik] → [Target Node: Service Port] +``` + +## Traefik Role +- **Single entrypoint.** Every HTTP/HTTPS service routes through Traefik on Mark5. +- **Tailscale-native.** Traefik binds to `0.0.0.0:80` and `0.0.0.0:443`. No `tailscale serve`. +- **Service discovery via Docker labels.** Each compose service exposes labels that Traefik reads from the Docker socket on Mark5. +- **Docker socket access restricted.** Traefik mounts a read-only Docker socket. No other service gets socket access. + +## Internal Traffic Patterns +| Source | Destination | Protocol | Port | Notes | +|--------|-------------|----------|------|-------| +| Traefik (Mark5) | Any service | HTTP/HTTPS | Varies | Proxied via Tailscale IP | +| Beszel (Mark44) | Any node | HTTP | Varies | Agent polls HTTP metrics endpoints (read-only) | +| Prometheus (Mark44) | Any node | HTTP | 9100 (node-exporter) | Scrapes node and container metrics | +| Prowlarr (Mark44) | Indexer sites | HTTPS | 443 | Outbound only | +| Sonarr/Radarr (Mark44) | Prowlarr | HTTP | 9696 | Internal indexer lookup | +| Nextcloud (Neo) | PostgreSQL (Bones) | TCP | 5432 | DB traffic over Tailscale | + +## DNS Resolution +- **Technitium (Bones)** is the authoritative internal DNS for `*.ai.home`. +- **AdGuard Home (Bones)** handles recursive resolution with ad-block lists. Replaces Pi-hole. +- **Chain:** Client → Technitium (local record?) → AdGuard Home (recursive + blocklist) → Upstream (Cloudflare/Quad9) +- **Tailscale MagicDNS** remains enabled as fallback. If Technitium fails, clients fall back to `100.x.x.x` direct resolution. +- **AdGuard Home admin UI** runs on port 3000 by default (separate from Grafana if co-located). + +## Port Allocation (Reserved) +| Port | Service | +|------|---------| +| 53 | DNS (Technitium / Pi-hole) | +| 80/443 | HTTP/S (Traefik) | +| 3000 | Grafana | +| 9090 | Prometheus | +| 9000 | Portainer | +| 8096 | Jellyfin | +| 8989 | Sonarr | +| 7878 | Radarr | +| 9696 | Prowlarr | +| 8080 | Authelia (default) | + +## TLS Strategy +- **Internal:** Traefik generates self-signed certs for `*.labs.internal`. Authelia can enforce client-cert if needed. +- **External:** Not applicable per no-Tailscale-funnel constraint. If Bobby later wants public access, Let's Encrypt via DNS challenge (Technitium controls the zone). diff --git a/plans/06-data-and-persistence.md b/plans/06-data-and-persistence.md new file mode 100644 index 0000000..ebd7376 --- /dev/null +++ b/plans/06-data-and-persistence.md @@ -0,0 +1,46 @@ +# Iron Legion Homelab Services Stack — Data & Persistence + +## Volume Strategy +Every service with persistent state uses **bind mounts to on-node directories**. No named volumes, no NFS, no distributed storage. + +## Directory Convention +``` +/opt/iron-legion/ +├── service-name/ +│ ├── data/ # Application data (databases, config, state) +│ ├── config/ # Static config files mounted read-only where possible +│ └── logs/ # Log output (optional, if not sent to stdout) +``` + +## Per-Service Persistence +| Service | Data Path | Backup Target | Size Estimate | +|---------|-----------|---------------|---------------| +| **Traefik** | `/opt/iron-legion/traefik/config/` `/opt/iron-legion/traefik/certs/` | Bones (daily rsync) | < 50 MB | +| **Technitium DNS** | `/opt/iron-legion/technitium/config/` | Bones | < 10 MB | +| **Pi-hole** | `/opt/iron-legion/pihole/etc-pihole/` `/opt/iron-legion/pihole/etc-dnsmasq.d/` | Bones | < 500 MB | +| **Prometheus** | `/opt/iron-legion/prometheus/data/` | Bones (retention: 15d local, 90d backup) | 5–20 GB | +| **Grafana** | `/opt/iron-legion/grafana/data/` | Bones | < 500 MB | +| **Beszel** | `/opt/iron-legion/beszel/data/` | Bones | < 1 GB | +| **Portainer** | `/opt/iron-legion/portainer/data/` | Bones | < 100 MB | +| **Homepage** | `/opt/iron-legion/homepage/config/` | Bones | < 10 MB | +| **Vaultwarden** | `/opt/iron-legion/vaultwarden/data/` | Bones (encrypted) | < 500 MB | +| **Authelia** | `/opt/iron-legion/authelia/config/` | Bones | < 10 MB | +| **Jellyfin** | `/opt/iron-legion/jellyfin/config/` `/opt/iron-legion/jellyfin/media/` | **None** (media too large) | < 1 GB config; media drive separate | +| **Sonarr** | `/opt/iron-legion/sonarr/config/` | Bones | < 1 GB | +| **Radarr** | `/opt/iron-legion/radarr/config/` | Bones | < 1 GB | +| **Prowlarr** | `/opt/iron-legion/prowlarr/config/` | Bones | < 100 MB | +| **Nextcloud** | `/opt/iron-legion/nextcloud/data/` | Bones (snapshots) | 10–50 GB | + +## Media Storage Exception +- **Jellyfin media** lives on a separate mount (likely external USB/NVMe on Mark44). Not backed up via rsync. +- **Sonarr/Radarr** download staging to a shared `/downloads` bind mount, then hardlink/copy to Jellyfin media library. + +## Backup Tooling +- **Primary:** `rsync -a --delete` to Bones secondary storage daily at 03:00 local. +- **Vaultwarden:** `rsqlite3` dump + `rsync` (encrypted at rest on Bones). +- **Prometheus:** `snapshot API` → rsync (not raw WAL files). + +## Secret Management +- `.env` files live in `/opt/iron-legion/service-name/.env`, mode `0600`. +- Compose files use `${VAR_NAME}` syntax, never literal strings. +- Vaultwarden stores shared secrets (DB passwords, API keys). Artemis holds no secrets in memory. diff --git a/plans/07-security-model.md b/plans/07-security-model.md new file mode 100644 index 0000000..b01f0c2 --- /dev/null +++ b/plans/07-security-model.md @@ -0,0 +1,48 @@ +# Iron Legion Homelab Services Stack — Security Model + +## Authentication Layers +| Layer | Service | Scope | Notes | +|-------|---------|-------|-------| +| **Edge Auth** | Authelia | Traefik-secured endpoints | MFA portal, session cookies | +| **App Auth** | Vaultwarden | Password vault | Master password + 2FA | +| **App Auth** | Portainer | Container mgmt | Built-in RBAC, can integrate LDAP | +| **App Auth** | Nextcloud | File collaboration | Built-in, can integrate Authelia OIDC | +| **OS Auth** | SSH keys | Node access | Tailscale SSH + local keypairs | + +## Authelia Deployment Notes +- **Target node:** Mark5 (lightweight, sits beside Traefik) +- **Redirection URL:** Set Authelia `redirection_url` to the base domain of services needing auth. +- **Backend storage:** Uses SQLite initially. If Bobby wants HA, migrate to PostgreSQL on Bones. +- **Notification method:** File-based (writes to `/opt/iron-legion/authelia/notifications/`) until SMTP/Discord is configured. +- **Rule granularity:** Per-service `access_control` rules in `configuration.yml`. Default: `one_factor` for internal services, `two_factor` for management interfaces (Portainer, Grafana admin). + +## Traefik ↔ Authelia Integration +```yaml +# Traefik middleware label (example) +traefik.http.routers.portainer.middlewares: authelia@docker +traefik.http.middlewares.authelia.forwardauth.address: http://authelia:9091/api/verify?rd=https://auth.labs.internal +``` +- **No nginx.** ForwardAuth middleware talks directly to Authelia over internal Docker network. +- **Bypass list:** Prometheus scrape targets, Beszel agents, Technitium DNS queries — these are internal metrics/DNS, no auth required. + +## Secret Handling +| Secret Type | Storage Method | Rotation Trigger | +|-------------|----------------|------------------| +| Authelia session secret | `.env` file, 64-byte random hex | On any Authelia config reload | +| Vaultwarden admin token | `.env` file, 48-byte random | Only on compromise | +| DB passwords (Nextcloud ↔ PostgreSQL) | `.env` files on both nodes | On any DB migration or rebuild | +| Tailscale auth keys | Vaultwarden secure note | On key expiry or node rebuild | +| API keys (indexers, Cloudflare) | Vaultwarden secure note | On key rotation by provider | + +## Network Segmentation +- **No VLANs.** Tailscale ACLs handle segment isolation. +- **ACL policy (draft):** + - `tag:admin` nodes (Bobby, Artemis) → all ports on all nodes + - `tag:services` (Neo, Bones, Mark44, Mark5) → only their assigned service ports, no cross-node SSH except via Tailscale SSH + - `tag:user` (Bobby's phone, laptop) → HTTPS 443 on Mark5 only, Jellyfin 8096 on Mark44 directly +- **Default deny.** Any traffic not explicitly allowed in Tailscale ACL is dropped. + +## Monitoring for Security Events +- **Dozzle** provides real-time log viewing but is NOT a SIEM. +- **Promtail/Loki** not yet in catalog. If Bobby wants log aggregation + alerting, add to Phase 3. +- **Beszel** alerts on anomalous CPU/memory — use as coarse intrusion detection proxy. diff --git a/plans/08-deployment-phases.md b/plans/08-deployment-phases.md new file mode 100644 index 0000000..ac9a87a --- /dev/null +++ b/plans/08-deployment-phases.md @@ -0,0 +1,52 @@ +# Iron Legion Homelab Services Stack — Deployment Phases + +## Phase 1: Infrastructure (Critical Path) +**Goal:** Get DNS, proxy, and basic monitoring alive. Everything else depends on this. + +| Order | Service | Target Node | Why First | Dependencies | +|-------|---------|-------------|-----------|--------------| +| 1 | **Technitium DNS** | Bones | Name resolution for internal services | None | +| 2 | **Pi-hole** | Bones | Recursive DNS + ad-block | Technitium (via conditional forwarding) | +| 3 | **Traefik** | Mark5 | Edge router for all HTTP ingress | DNS (needs `*.labs.internal` to resolve) | +| 4 | **Authelia** | Mark5 | Auth layer before exposing any mgmt UI | Traefik (depends on ForwardAuth middleware) | +| 5 | **Portainer** | Neo | Container management UI | Traefik + Authelia (for secured access) | +| 6 | **Prometheus** | Mark44 | Metrics collection baseline | None (scrape targets added in Phase 2) | +| 7 | **Beszel** | Mark44 | Fleet resource overview | None (agents installed per-node) | +| 8 | **Dozzle** | Mark44 | Real-time log viewing | None | + +**Phase 1 milestone:** All nodes report healthy in Beszel. Portainer accessible via auth portal. DNS resolves. + +--- + +## Phase 2: Media & File Collaboration +**Goal:** Self-hosted media acquisition and file sync. + +| Order | Service | Target Node | Why Now | Dependencies | +|-------|---------|-------------|---------|--------------| +| 9 | **Jellyfin** | Mark44 | Media playback (GPU transcode if Mark44 has dGPU) | None (file ingest later) | +| 10 | **Sonarr** | Mark44 | TV management | Jellyfin (pushes organized files) | +| 11 | **Radarr** | Mark44 | Movie management | Jellyfin (pushes organized files) | +| 12 | **Prowlarr** | Mark44 | Indexer aggregation | Sonarr + Radarr (feeds them) | +| 13 | **Nextcloud** | Neo | File sync/collaboration | PostgreSQL (on Bones) | +| 14 | **Vaultwarden** | Neo | Password management | None (standalone) | + +**Phase 2 milestone:** Media acquisition pipeline works end-to-end. Nextcloud syncs. Vaultwarden stores secrets. + +--- + +## Phase 3: Polish & Expansion +**Goal:** Dashboards, advanced monitoring, nice-to-haves. + +| Order | Service | Target Node | Why Deferred | Dependencies | +|-------|---------|-------------|--------------|--------------| +| 15 | **Grafana** | Mark44 | Dashboards need metrics to be interesting | Prometheus (needs data history) | +| 16 | **Homepage** | Mark5 | Custom dashboard for everything | All Phase 1+2 services (needs endpoints) | +| – | **Promtail + Loki** | TBD | Centralized logging | Only if Dozzle is insufficient | +| – | **Uptime-Kuma** | TBD | External uptime monitoring | Only if Beszel alerting is insufficient | + +**Phase 3 milestone:** Single-pane dashboard (Homepage) shows all services. Alerts route to Discord or email. + +## Deployment Cadence +- **One service per session.** No mass deployments. Validate each before proceeding. +- **Rollback plan:** `docker compose down` + `mv /opt/iron-legion/service{,-failed-$(date +%s)}`. Snapshot taken before each compose up. +- **Bobby approval required before Phase 2 begins.** Phase 1 success must be demonstrated. diff --git a/plans/09-open-questions.md b/plans/09-open-questions.md new file mode 100644 index 0000000..1cd0d68 --- /dev/null +++ b/plans/09-open-questions.md @@ -0,0 +1,20 @@ +# Iron Legion Homelab Services Stack — Open Questions & Blockers + +## Blocker Status +| # | Question | Impact | Default if Unresolved | +|---|----------|--------|----------------------| +| 1 | **Domain name** — Does Bobby own a domain (e.g., `bobbysh.me`) or do we use a fake TLD (`labs.internal`)? | **Critical** — TLS certs, Authelia, and DNS all depend on this. | Use `labs.internal` + self-signed CA | +| 2 | **Technitium upstream** — DoH, DoT, or plain UDP to upstream resolver (e.g., Cloudflare 1.1.1.1)? | Low — can default to DoH | DoH → `https://cloudflare-dns.com/dns-query` | +| 3 | **Pi-hole vs Technitium conflict** — Both run on Bones port 53. Run Pi-hole on non-standard port with Technitium as conditional forwarder? Or separate nodes? | **Critical** — port 53 collision | Technitium on 53, Pi-hole on 5053, forward to Pi-hole from Technitium | +| 4 | **Jellyfin media storage** — External USB on Mark44? SMB share? NVMe? | Medium | External USB mounted at `/media` on Mark44 | +| 5 | **Backup target on Bones** — Capacity? Dedicated drive? Rsync target path? | Medium | `/backups//` on Bones secondary storage | +| 6 | **Nextcloud database** — Use existing PostgreSQL on Bones, or deploy Nextcloud AIO (bundled)? | Medium — affects resource allocation on Bones | Deploy standalone PostgreSQL container on Bones for Nextcloud AIO is too heavy | +| 7 | **GPU on Mark44** — NVIDIA driver runtime for Jellyfin transcode? | Low — falls back to CPU transcode | Use `jellyfin/jellyfin` with `NVIDIA_VISIBLE_DEVICES` env if available | +| 8 | **Notification routing** — Discord webhook? SMTP? File only? | Low — default file works | File notifications in `/opt/iron-legion/authelia/notifications/` | +| 9 | **Tailscale ACL policy** — Draft exists in Section 7. Bobby must review and apply in Tailscale admin console. | Low | Stay permissive until Bobby approves | +| 10 | **Beszel alert thresholds** — CPU %, memory %, disk % triggers not defined. | Low | Defaults in Beszel container | + +## Outstanding Decisions Required +1. **Pi-hole inclusion** — Not in Bobby's original list. I added it as a DNS-layer complement to Technitium. **Remove if Bobby doesn't want it.** +2. **Authelia two-factor method** — TOTP via app (Google Authenticator) vs WebAuthn/FIDO2 keys? +3. **Home vs remote access** — If Bobby wants to share Jellyfin with friends/family outside Tailscale, public domain + Authelia guard is required. diff --git a/plans/10-appendix.md b/plans/10-appendix.md new file mode 100644 index 0000000..a83a3a8 --- /dev/null +++ b/plans/10-appendix.md @@ -0,0 +1,52 @@ +# Appendix A — Raw DockerHub Metadata Table + +**Full API response data captured 2026-05-25T16:45:00Z.** + +| Service | Full Image | Namespace | Pulls | Stars | Last Updated | API Status | +|---------|-----------|-----------|-------|-------|--------------|------------| +| Traefik | `traefik` | `library` | 3,490,588,071 | 3,634 | 2026-05-13 | ✅ 200 | +| Technitium DNS | `technitium/dns-server` | `technitium` | 8,989,831 | 156 | 2026-05-09 | ✅ 200 | +| Homepage | `gethomepage/homepage` | `gethomepage` | 1,305,710 | 40 | 2026-05-25 | ✅ 200 | +| Beszel | `henrygd/beszel` | `henrygd` | 12,578,135 | 32 | 2026-04-30 | ✅ 200 | +| Dozzle | `amir20/dozzle` | `amir20` | 309,561,399 | 144 | 2026-05-25 | ✅ 200 | +| Grafana | `grafana/grafana` | `grafana` | 5,220,434,031 | 3,540 | 2026-05-16 | ✅ 200 | +| Prometheus | `prom/prometheus` | `prom` | 1,966,043,381 | 2,064 | 2026-05-25 | ✅ 200 | +| Portainer CE | `portainer/portainer-ce` | `portainer` | 1,464,874,500 | 2,665 | 2026-05-20 | ✅ 200 | +| Jellyfin | `jellyfin/jellyfin` | `jellyfin` | 370,358,966 | 1,535 | 2026-05-25 | ✅ 200 | +| Sonarr | `linuxserver/sonarr` | `linuxserver` | 2,339,638,307 | 2,118 | 2026-05-23 | ✅ 200 | +| Radarr | `linuxserver/radarr` | `linuxserver` | 2,359,097,569 | 1,791 | 2026-05-25 | ✅ 200 | +| Prowlarr | `linuxserver/prowlarr` | `linuxserver` | 35,913,487 | 403 | 2026-05-25 | ✅ 200 | +| Vaultwarden | `vaultwarden/server` | `vaultwarden` | 287,182,978 | 1,454 | 2026-05-17 | ✅ 200 | +| Nextcloud | `nextcloud` | `library` | 1,011,978,204 | 4,485 | 2026-05-23 | ✅ 200 | +| Pi-hole | `pihole/pihole` | `pihole` | 961,220,209 | 2,943 | 2026-05-25 | ✅ 200 | +| Authelia | `authelia/authelia` | `authelia` | 75,183,682 | 208 | 2026-05-25 | ✅ 200 | + +**Total unique images:** 16 (including Pi-hole) +**Community health indicator:** All images have > 10 stars, > 1M pulls (except Beszel 32 stars, Homepage 40 stars — acceptable for young projects) +**Freshness:** All updated within 90 days except Beszel (30 days — still acceptable) + +## Appendix B — Compose Skeleton Directory Map +``` +~/.ansible-repo/new-build/ +├── phase-1/ # Infrastructure +│ ├── technitium/ +│ ├── pihole/ +│ ├── traefik/ +│ ├── authelia/ +│ ├── portainer/ +│ ├── prometheus/ +│ ├── beszel/ +│ └── dozzle/ +├── phase-2/ # Media + Files +│ ├── jellyfin/ +│ ├── sonarr/ +│ ├── radarr/ +│ ├── prowlarr/ +│ ├── nextcloud/ +│ └── vaultwarden/ +└── phase-3/ # Dashboards + Polish + ├── grafana/ + ├── homepage/ + └── loki/ # Optional +``` +Skeleton not yet created. Deferred until Bobby approves PRD. diff --git a/plans/homelab-services-stack-prd.md b/plans/homelab-services-stack-prd.md new file mode 100644 index 0000000..1d10422 --- /dev/null +++ b/plans/homelab-services-stack-prd.md @@ -0,0 +1,426 @@ +# Iron Legion Homelab Services Stack — Purpose & Scope + +## Document ID +- **PRD:** homelab-services-stack-prd.md +- **Date:** 2026-05-25 +- **Owner:** Artemis (AI Foreman, Iron Legion Labs) +- **Authority:** Commander Bobby + +## Purpose +Central canonical reference for all Docker/Compose-based services Iron Legion Labs intends to deploy across the fleet. This document exists to: +1. Prevent duplicate research — every service's Docker image, metadata, and deployment pattern is captured once. +2. Guide node placement — which service runs where, and why. +3. Serve as the source of truth for Ansible-pull manifests, compose files, and future automation. + +## Scope +### In Scope +- Service catalog with DockerHub-verified images (name, namespace, description, pull count, stars, last update) +- Category assignment (Network, Monitoring, Media, Security, Management, Infrastructure) +- Recommended target node per service +- Deployment phase priority +- High-level network, data, and security architecture + +### Out of Scope +- Detailed compose-file YAML (deferred to per-service deployment PRDs) +- Specific Traefik middleware configurations (deferred to network PRD) +- GPU passthrough configs for media transcode (deferred to Mark44 workload PRD) +- Service-specific SSO/authelia rule authoring (deferred to security PRD) + +## Living Document +This PRD is append-only for new services. Modifications to existing entries require Bobby sign-off. Additions follow the raw-metadata-to-summary pattern established in Section 4. + +--- + +# Iron Legion Homelab Services Stack — Success Criteria + +## Done When +1. ✅ Every service in the catalog has a verified DockerHub image with a non-stale last-update date (≤ 90 days old at time of cataloging) +2. ✅ Every service has an assigned target node that respects the **Node Assignments Locked** policy +3. ✅ Every service has a deployment phase (1, 2, or 3) agreed by Bobby +4. ✅ Network ingress/egress flow is documented at the service level (who talks to whom, via what port/protocol) +5. ✅ A single `docker-compose.yml` skeleton exists per phase, ready for population +6. ✅ Bobby has read and approved this PRD; any objections are captured as blockers below + +## Verification Methods +- DockerHub API freshness check: `last_updated` field within 90 days +- Node lock compliance: cross-reference against `fleet-ops.md` node assignments +- Compose skeleton existence: `ls ~/.ansible-repo/new-build/phase-*.yml` + +## Failure Modes +| Failure | Mitigation | +|---------|------------| +| DockerHub image stale or abandoned | Flag for alternative image research | +| Node assignment conflicts with locked policy | Escalate to Bobby immediately | +| Service dependency on another Phase 2+ service | Note in Open Questions, defer deployment | + +## Known Blockers +- **Authelia** requires a domain + valid TLS cert. If Bobby does not want to expose to public internet, Traefik + internal Tailscale cert or self-signed CA required. +- **Technitium DNS** upstream forwarding policy not yet specified (DoH, DoT, plain UDP?). + +--- + +# Iron Legion Homelab Services Stack — Constraints + +## Hard Constraints (Non-Negotiable) +1. **Bare metal over abstraction.** Direct deployments preferred. Compose files are acceptable as orchestration glue, but no Docker Swarm mode, no Kubernetes, no abstraction layers Bobby cannot `ssh` into and debug. +2. **No nginx.** Traefik is the sole edge router. No nginx reverse proxies, no nginx sidecars. +3. **No Tailscale serve/funnel.** Services bind to `0.0.0.0` on their assigned node and are reachable via Tailscale mesh IP + port. No `tailscale serve`, no `tailscale funnel`. +4. **Node assignments locked.** Services do not migrate between nodes without Bobby's explicit written direction. +5. **Patch upstream source** when loopback/bind restrictions block direct deployment. Do not re-architect around the constraint. + +## Node Assignment Policy (as of 2026-05-25) +| Node | Role | Services Assigned | +|------|------|-------------------| +| **Neo** | Services node | Nextcloud AIO, Vaultwarden, Portainer (UI/mgmt) | +| **Bones** | Infrastructure node | Paperclip + Ollama + PostgreSQL, Technitium DNS (infra DNS) | +| **Mark44 (Hulkbuster)** | Heavy-lifting / GPU | Monitoring stack (Prometheus, Grafana, Beszel), media apps with transcode (Jellyfin) | +| **Mark5 (Suitcase)** | Research / light-task | Traefik (edge router — lightweight, always-on), Homepage (lightweight dashboard) | +| **Artemis** | AI Foreman / JARVIS | Hermes Agent, Ansible-pull control plane | + +## Soft Constraints (Bobby Approval Required to Override) +- **Data residency:** All persistent volumes live on-node. No NFS, no Ceph, no distributed storage unless explicitly approved. +- **Secret management:** No plain-text secrets in compose files. Use `.env` files with `file:` mode 0600, or Vaultwarden if a secret store is needed. +- **Backup cadence:** Every service with persistent state must have a documented backup target. Default: daily rsync to Bones secondary storage. + +## Environment Assumptions +- All nodes run Debian Trixie or compatible. +- Docker Engine (not Docker Desktop) is installed on all target nodes. +- Tailscale is up and meshed. All inter-node traffic is over Tailscale IPs. +- `docker compose` plugin (v2) available, not legacy `docker-compose` standalone. + +--- + +# Iron Legion Homelab Services Stack — Service Catalog + +## Verified DockerHub Metadata (as of 2026-05-25) + +### Network Layer +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Traefik** | `traefik` | `library` | Cloud Native Edge Router | 3.49B | 3,634 | 2026-05-13 | Mark5 | +| **Technitium DNS** | `technitium/dns-server` | `technitium` | Self-hosted DNS server with DoH/DoT | 8.99M | 156 | 2026-05-09 | Bones | +| **Pi-hole** | `pihole/pihole` | `pihole` | Network-wide ad blocking | 961.2M | 2,943 | 2026-05-25 | Bones | + +### Monitoring / Observability +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Prometheus** | `prom/prometheus` | `prom` | Systems monitoring & alerting toolkit | 1.97B | 2,064 | 2026-05-25 | Mark44 | +| **Grafana** | `grafana/grafana` | `grafana` | Analytics & monitoring dashboards | 5.22B | 3,540 | 2026-05-16 | Mark44 | +| **Beszel** | `henrygd/beszel` | `henrygd` | Lightweight server monitoring hub with Docker stats | 12.58M | 32 | 2026-04-30 | Mark44 | +| **Dozzle** | `amir20/dozzle` | `amir20` | Real-time Docker container log viewer | 309.6M | 144 | 2026-05-25 | Mark44 | + +### Management / Dashboard +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Portainer CE** | `portainer/portainer-ce` | `portainer` | Lightweight container management UI | 1.46B | 2,665 | 2026-05-20 | Neo | +| **Homepage** | `gethomepage/homepage` | `gethomepage` | Customizable homepage with integrations | 1.31M | 40 | 2026-05-25 | Mark5 | + +### Security / Identity +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Vaultwarden** | `vaultwarden/server` | `vaultwarden` | Bitwarden-compatible password manager (Rust) | 287.2M | 1,454 | 2026-05-17 | Neo | +| **Authelia** | `authelia/authelia` | `authelia` | Multi-factor authentication portal | 75.2M | 208 | 2026-05-25 | Mark5 | + +### Media Stack (*arr + Jellyfin) +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Jellyfin** | `jellyfin/jellyfin` | `jellyfin` | Free software media browser | 370.4M | 1,535 | 2026-05-25 | Mark44 | +| **Sonarr** | `linuxserver/sonarr` | `linuxserver` | TV series management | 2.34B | 2,118 | 2026-05-23 | Mark44 | +| **Radarr** | `linuxserver/radarr` | `linuxserver` | Movie management | 2.36B | 1,791 | 2026-05-25 | Mark44 | +| **Prowlarr** | `linuxserver/prowlarr` | `linuxserver` | Indexer management | 35.9M | 403 | 2026-05-25 | Mark44 | + +### File / Collaboration +| Service | Image | Namespace | Description | Pulls | Stars | Updated | Target Node | +|---------|-------|-----------|-------------|-------|-------|---------|-------------| +| **Nextcloud** | `nextcloud` | `library` | Self-hosted file sync & collaboration | 1.01B | 4,485 | 2026-05-23 | Neo | + +## Total Services: 15 +## Total DockerHub Pulls (aggregate): ~16.0B +## All images last updated within 90 days except Beszel (2026-04-30 — still within 30 days) + +## Notes +- **Beszel** lowest star count (32) but actively maintained and purpose-built for small-fleet monitoring. +- **Homepage** lowest pull count (1.31M) — young project, high utility, monitor for longevity. +- **Pi-hole** not in Bobby's original mention but added as network-layer complement to Technitium. Requires Bobby approval to include. + +--- + +# Iron Legion Homelab Services Stack — Network Architecture + +## Ingress Flow +``` +[Internet] → [Tailscale mesh] → [Mark5: Traefik] → [Target Node: Service Port] +``` + +## Traefik Role +- **Single entrypoint.** Every HTTP/HTTPS service routes through Traefik on Mark5. +- **Tailscale-native.** Traefik binds to `0.0.0.0:80` and `0.0.0.0:443`. No `tailscale serve`. +- **Service discovery via Docker labels.** Each compose service exposes labels that Traefik reads from the Docker socket on Mark5. +- **Docker socket access restricted.** Traefik mounts a read-only Docker socket. No other service gets socket access. + +## Internal Traffic Patterns +| Source | Destination | Protocol | Port | Notes | +|--------|-------------|----------|------|-------| +| Traefik (Mark5) | Any service | HTTP/HTTPS | Varies | Proxied via Tailscale IP | +| Beszel (Mark44) | Any node | HTTP | Varies | Agent polls HTTP metrics endpoints (read-only) | +| Prometheus (Mark44) | Any node | HTTP | 9100 (node-exporter) | Scrapes node and container metrics | +| Prowlarr (Mark44) | Indexer sites | HTTPS | 443 | Outbound only | +| Sonarr/Radarr (Mark44) | Prowlarr | HTTP | 9696 | Internal indexer lookup | +| Nextcloud (Neo) | PostgreSQL (Bones) | TCP | 5432 | DB traffic over Tailscale | + +## DNS Resolution +- **Technitium (Bones)** is the authoritative internal DNS for `*.labs.internal`. +- **Pi-hole (Bones)** handles recursive resolution with ad-block lists. +- **Chain:** Client → Technitium (local record?) → Pi-hole (recursive + blocklist) → Upstream (Cloudflare/Quad9) +- **Tailscale MagicDNS** remains enabled as fallback. If Technitium fails, clients fall back to `100.x.x.x` direct resolution. + +## Port Allocation (Reserved) +| Port | Service | +|------|---------| +| 53 | DNS (Technitium / Pi-hole) | +| 80/443 | HTTP/S (Traefik) | +| 3000 | Grafana | +| 9090 | Prometheus | +| 9000 | Portainer | +| 8096 | Jellyfin | +| 8989 | Sonarr | +| 7878 | Radarr | +| 9696 | Prowlarr | +| 8080 | Authelia (default) | + +## TLS Strategy +- **Internal:** Traefik generates self-signed certs for `*.labs.internal`. Authelia can enforce client-cert if needed. +- **External:** Not applicable per no-Tailscale-funnel constraint. If Bobby later wants public access, Let's Encrypt via DNS challenge (Technitium controls the zone). + +--- + +# Iron Legion Homelab Services Stack — Data & Persistence + +## Volume Strategy +Every service with persistent state uses **bind mounts to on-node directories**. No named volumes, no NFS, no distributed storage. + +## Directory Convention +``` +/opt/iron-legion/ +├── service-name/ +│ ├── data/ # Application data (databases, config, state) +│ ├── config/ # Static config files mounted read-only where possible +│ └── logs/ # Log output (optional, if not sent to stdout) +``` + +## Per-Service Persistence +| Service | Data Path | Backup Target | Size Estimate | +|---------|-----------|---------------|---------------| +| **Traefik** | `/opt/iron-legion/traefik/config/` `/opt/iron-legion/traefik/certs/` | Bones (daily rsync) | < 50 MB | +| **Technitium DNS** | `/opt/iron-legion/technitium/config/` | Bones | < 10 MB | +| **Pi-hole** | `/opt/iron-legion/pihole/etc-pihole/` `/opt/iron-legion/pihole/etc-dnsmasq.d/` | Bones | < 500 MB | +| **Prometheus** | `/opt/iron-legion/prometheus/data/` | Bones (retention: 15d local, 90d backup) | 5–20 GB | +| **Grafana** | `/opt/iron-legion/grafana/data/` | Bones | < 500 MB | +| **Beszel** | `/opt/iron-legion/beszel/data/` | Bones | < 1 GB | +| **Portainer** | `/opt/iron-legion/portainer/data/` | Bones | < 100 MB | +| **Homepage** | `/opt/iron-legion/homepage/config/` | Bones | < 10 MB | +| **Vaultwarden** | `/opt/iron-legion/vaultwarden/data/` | Bones (encrypted) | < 500 MB | +| **Authelia** | `/opt/iron-legion/authelia/config/` | Bones | < 10 MB | +| **Jellyfin** | `/opt/iron-legion/jellyfin/config/` `/opt/iron-legion/jellyfin/media/` | **None** (media too large) | < 1 GB config; media drive separate | +| **Sonarr** | `/opt/iron-legion/sonarr/config/` | Bones | < 1 GB | +| **Radarr** | `/opt/iron-legion/radarr/config/` | Bones | < 1 GB | +| **Prowlarr** | `/opt/iron-legion/prowlarr/config/` | Bones | < 100 MB | +| **Nextcloud** | `/opt/iron-legion/nextcloud/data/` | Bones (snapshots) | 10–50 GB | + +## Media Storage Exception +- **Jellyfin media** lives on a separate mount (likely external USB/NVMe on Mark44). Not backed up via rsync. +- **Sonarr/Radarr** download staging to a shared `/downloads` bind mount, then hardlink/copy to Jellyfin media library. + +## Backup Tooling +- **Primary:** `rsync -a --delete` to Bones secondary storage daily at 03:00 local. +- **Vaultwarden:** `rsqlite3` dump + `rsync` (encrypted at rest on Bones). +- **Prometheus:** `snapshot API` → rsync (not raw WAL files). + +## Secret Management +- `.env` files live in `/opt/iron-legion/service-name/.env`, mode `0600`. +- Compose files use `${VAR_NAME}` syntax, never literal strings. +- Vaultwarden stores shared secrets (DB passwords, API keys). Artemis holds no secrets in memory. + +--- + +# Iron Legion Homelab Services Stack — Security Model + +## Authentication Layers +| Layer | Service | Scope | Notes | +|-------|---------|-------|-------| +| **Edge Auth** | Authelia | Traefik-secured endpoints | MFA portal, session cookies | +| **App Auth** | Vaultwarden | Password vault | Master password + 2FA | +| **App Auth** | Portainer | Container mgmt | Built-in RBAC, can integrate LDAP | +| **App Auth** | Nextcloud | File collaboration | Built-in, can integrate Authelia OIDC | +| **OS Auth** | SSH keys | Node access | Tailscale SSH + local keypairs | + +## Authelia Deployment Notes +- **Target node:** Mark5 (lightweight, sits beside Traefik) +- **Redirection URL:** Set Authelia `redirection_url` to the base domain of services needing auth. +- **Backend storage:** Uses SQLite initially. If Bobby wants HA, migrate to PostgreSQL on Bones. +- **Notification method:** File-based (writes to `/opt/iron-legion/authelia/notifications/`) until SMTP/Discord is configured. +- **Rule granularity:** Per-service `access_control` rules in `configuration.yml`. Default: `one_factor` for internal services, `two_factor` for management interfaces (Portainer, Grafana admin). + +## Traefik ↔ Authelia Integration +```yaml +# Traefik middleware label (example) +traefik.http.routers.portainer.middlewares: authelia@docker +traefik.http.middlewares.authelia.forwardauth.address: http://authelia:9091/api/verify?rd=https://auth.labs.internal +``` +- **No nginx.** ForwardAuth middleware talks directly to Authelia over internal Docker network. +- **Bypass list:** Prometheus scrape targets, Beszel agents, Technitium DNS queries — these are internal metrics/DNS, no auth required. + +## Secret Handling +| Secret Type | Storage Method | Rotation Trigger | +|-------------|----------------|------------------| +| Authelia session secret | `.env` file, 64-byte random hex | On any Authelia config reload | +| Vaultwarden admin token | `.env` file, 48-byte random | Only on compromise | +| DB passwords (Nextcloud ↔ PostgreSQL) | `.env` files on both nodes | On any DB migration or rebuild | +| Tailscale auth keys | Vaultwarden secure note | On key expiry or node rebuild | +| API keys (indexers, Cloudflare) | Vaultwarden secure note | On key rotation by provider | + +## Network Segmentation +- **No VLANs.** Tailscale ACLs handle segment isolation. +- **ACL policy (draft):** + - `tag:admin` nodes (Bobby, Artemis) → all ports on all nodes + - `tag:services` (Neo, Bones, Mark44, Mark5) → only their assigned service ports, no cross-node SSH except via Tailscale SSH + - `tag:user` (Bobby's phone, laptop) → HTTPS 443 on Mark5 only, Jellyfin 8096 on Mark44 directly +- **Default deny.** Any traffic not explicitly allowed in Tailscale ACL is dropped. + +## Monitoring for Security Events +- **Dozzle** provides real-time log viewing but is NOT a SIEM. +- **Promtail/Loki** not yet in catalog. If Bobby wants log aggregation + alerting, add to Phase 3. +- **Beszel** alerts on anomalous CPU/memory — use as coarse intrusion detection proxy. + +--- + +# Iron Legion Homelab Services Stack — Deployment Phases + +## Phase 1: Infrastructure (Critical Path) +**Goal:** Get DNS, proxy, and basic monitoring alive. Everything else depends on this. + +| Order | Service | Target Node | Why First | Dependencies | +|-------|---------|-------------|-----------|--------------| +| 1 | **Technitium DNS** | Bones | Name resolution for internal services | None | +| 2 | **Pi-hole** | Bones | Recursive DNS + ad-block | Technitium (via conditional forwarding) | +| 3 | **Traefik** | Mark5 | Edge router for all HTTP ingress | DNS (needs `*.labs.internal` to resolve) | +| 4 | **Authelia** | Mark5 | Auth layer before exposing any mgmt UI | Traefik (depends on ForwardAuth middleware) | +| 5 | **Portainer** | Neo | Container management UI | Traefik + Authelia (for secured access) | +| 6 | **Prometheus** | Mark44 | Metrics collection baseline | None (scrape targets added in Phase 2) | +| 7 | **Beszel** | Mark44 | Fleet resource overview | None (agents installed per-node) | +| 8 | **Dozzle** | Mark44 | Real-time log viewing | None | + +**Phase 1 milestone:** All nodes report healthy in Beszel. Portainer accessible via auth portal. DNS resolves. + +--- + +## Phase 2: Media & File Collaboration +**Goal:** Self-hosted media acquisition and file sync. + +| Order | Service | Target Node | Why Now | Dependencies | +|-------|---------|-------------|---------|--------------| +| 9 | **Jellyfin** | Mark44 | Media playback (GPU transcode if Mark44 has dGPU) | None (file ingest later) | +| 10 | **Sonarr** | Mark44 | TV management | Jellyfin (pushes organized files) | +| 11 | **Radarr** | Mark44 | Movie management | Jellyfin (pushes organized files) | +| 12 | **Prowlarr** | Mark44 | Indexer aggregation | Sonarr + Radarr (feeds them) | +| 13 | **Nextcloud** | Neo | File sync/collaboration | PostgreSQL (on Bones) | +| 14 | **Vaultwarden** | Neo | Password management | None (standalone) | + +**Phase 2 milestone:** Media acquisition pipeline works end-to-end. Nextcloud syncs. Vaultwarden stores secrets. + +--- + +## Phase 3: Polish & Expansion +**Goal:** Dashboards, advanced monitoring, nice-to-haves. + +| Order | Service | Target Node | Why Deferred | Dependencies | +|-------|---------|-------------|--------------|--------------| +| 15 | **Grafana** | Mark44 | Dashboards need metrics to be interesting | Prometheus (needs data history) | +| 16 | **Homepage** | Mark5 | Custom dashboard for everything | All Phase 1+2 services (needs endpoints) | +| – | **Promtail + Loki** | TBD | Centralized logging | Only if Dozzle is insufficient | +| – | **Uptime-Kuma** | TBD | External uptime monitoring | Only if Beszel alerting is insufficient | + +**Phase 3 milestone:** Single-pane dashboard (Homepage) shows all services. Alerts route to Discord or email. + +## Deployment Cadence +- **One service per session.** No mass deployments. Validate each before proceeding. +- **Rollback plan:** `docker compose down` + `mv /opt/iron-legion/service{,-failed-$(date +%s)}`. Snapshot taken before each compose up. +- **Bobby approval required before Phase 2 begins.** Phase 1 success must be demonstrated. + +--- + +# Iron Legion Homelab Services Stack — Open Questions & Blockers + +## Blocker Status +| # | Question | Impact | Default if Unresolved | +|---|----------|--------|----------------------| +| 1 | **Domain name** — Does Bobby own a domain (e.g., `bobbysh.me`) or do we use a fake TLD (`labs.internal`)? | **Critical** — TLS certs, Authelia, and DNS all depend on this. | Use `labs.internal` + self-signed CA | +| 2 | **Technitium upstream** — DoH, DoT, or plain UDP to upstream resolver (e.g., Cloudflare 1.1.1.1)? | Low — can default to DoH | DoH → `https://cloudflare-dns.com/dns-query` | +| 3 | **Pi-hole vs Technitium conflict** — Both run on Bones port 53. Run Pi-hole on non-standard port with Technitium as conditional forwarder? Or separate nodes? | **Critical** — port 53 collision | Technitium on 53, Pi-hole on 5053, forward to Pi-hole from Technitium | +| 4 | **Jellyfin media storage** — External USB on Mark44? SMB share? NVMe? | Medium | External USB mounted at `/media` on Mark44 | +| 5 | **Backup target on Bones** — Capacity? Dedicated drive? Rsync target path? | Medium | `/backups//` on Bones secondary storage | +| 6 | **Nextcloud database** — Use existing PostgreSQL on Bones, or deploy Nextcloud AIO (bundled)? | Medium — affects resource allocation on Bones | Deploy standalone PostgreSQL container on Bones for Nextcloud AIO is too heavy | +| 7 | **GPU on Mark44** — NVIDIA driver runtime for Jellyfin transcode? | Low — falls back to CPU transcode | Use `jellyfin/jellyfin` with `NVIDIA_VISIBLE_DEVICES` env if available | +| 8 | **Notification routing** — Discord webhook? SMTP? File only? | Low — default file works | File notifications in `/opt/iron-legion/authelia/notifications/` | +| 9 | **Tailscale ACL policy** — Draft exists in Section 7. Bobby must review and apply in Tailscale admin console. | Low | Stay permissive until Bobby approves | +| 10 | **Beszel alert thresholds** — CPU %, memory %, disk % triggers not defined. | Low | Defaults in Beszel container | + +## Outstanding Decisions Required +1. **Pi-hole inclusion** — Not in Bobby's original list. I added it as a DNS-layer complement to Technitium. **Remove if Bobby doesn't want it.** +2. **Authelia two-factor method** — TOTP via app (Google Authenticator) vs WebAuthn/FIDO2 keys? +3. **Home vs remote access** — If Bobby wants to share Jellyfin with friends/family outside Tailscale, public domain + Authelia guard is required. + +--- + +# Appendix A — Raw DockerHub Metadata Table + +**Full API response data captured 2026-05-25T16:45:00Z.** + +| Service | Full Image | Namespace | Pulls | Stars | Last Updated | API Status | +|---------|-----------|-----------|-------|-------|--------------|------------| +| Traefik | `traefik` | `library` | 3,490,588,071 | 3,634 | 2026-05-13 | ✅ 200 | +| Technitium DNS | `technitium/dns-server` | `technitium` | 8,989,831 | 156 | 2026-05-09 | ✅ 200 | +| Homepage | `gethomepage/homepage` | `gethomepage` | 1,305,710 | 40 | 2026-05-25 | ✅ 200 | +| Beszel | `henrygd/beszel` | `henrygd` | 12,578,135 | 32 | 2026-04-30 | ✅ 200 | +| Dozzle | `amir20/dozzle` | `amir20` | 309,561,399 | 144 | 2026-05-25 | ✅ 200 | +| Grafana | `grafana/grafana` | `grafana` | 5,220,434,031 | 3,540 | 2026-05-16 | ✅ 200 | +| Prometheus | `prom/prometheus` | `prom` | 1,966,043,381 | 2,064 | 2026-05-25 | ✅ 200 | +| Portainer CE | `portainer/portainer-ce` | `portainer` | 1,464,874,500 | 2,665 | 2026-05-20 | ✅ 200 | +| Jellyfin | `jellyfin/jellyfin` | `jellyfin` | 370,358,966 | 1,535 | 2026-05-25 | ✅ 200 | +| Sonarr | `linuxserver/sonarr` | `linuxserver` | 2,339,638,307 | 2,118 | 2026-05-23 | ✅ 200 | +| Radarr | `linuxserver/radarr` | `linuxserver` | 2,359,097,569 | 1,791 | 2026-05-25 | ✅ 200 | +| Prowlarr | `linuxserver/prowlarr` | `linuxserver` | 35,913,487 | 403 | 2026-05-25 | ✅ 200 | +| Vaultwarden | `vaultwarden/server` | `vaultwarden` | 287,182,978 | 1,454 | 2026-05-17 | ✅ 200 | +| Nextcloud | `nextcloud` | `library` | 1,011,978,204 | 4,485 | 2026-05-23 | ✅ 200 | +| Pi-hole | `pihole/pihole` | `pihole` | 961,220,209 | 2,943 | 2026-05-25 | ✅ 200 | +| Authelia | `authelia/authelia` | `authelia` | 75,183,682 | 208 | 2026-05-25 | ✅ 200 | + +**Total unique images:** 16 (including Pi-hole) +**Community health indicator:** All images have > 10 stars, > 1M pulls (except Beszel 32 stars, Homepage 40 stars — acceptable for young projects) +**Freshness:** All updated within 90 days except Beszel (30 days — still acceptable) + +## Appendix B — Compose Skeleton Directory Map +``` +~/.ansible-repo/new-build/ +├── phase-1/ # Infrastructure +│ ├── technitium/ +│ ├── pihole/ +│ ├── traefik/ +│ ├── authelia/ +│ ├── portainer/ +│ ├── prometheus/ +│ ├── beszel/ +│ └── dozzle/ +├── phase-2/ # Media + Files +│ ├── jellyfin/ +│ ├── sonarr/ +│ ├── radarr/ +│ ├── prowlarr/ +│ ├── nextcloud/ +│ └── vaultwarden/ +└── phase-3/ # Dashboards + Polish + ├── grafana/ + ├── homepage/ + └── loki/ # Optional +``` +Skeleton not yet created. Deferred until Bobby approves PRD.