diff --git a/reports/mk7-service-restoration-report.md b/reports/mk7-service-restoration-report.md new file mode 100644 index 0000000..9ba679d --- /dev/null +++ b/reports/mk7-service-restoration-report.md @@ -0,0 +1,149 @@ +# MK7 Service Restoration Report + +**Date:** 2026-06-01 +**Author:** F.R.I.D.A.Y. +**Status:** All services restored online + +--- + +## Problem + +MK7 (Swarm Manager, 192.168.7.7) had all Docker Swarm stacks stopped after physical relocation. Only `pegaprox` stack remained running from a previous manual deployment. Primary services (Traefik, Technitium, Portainer, n8n, Homepage, Beszel, Dozzle, Authelia, Prometheus, node-exporter) were all offline. + +--- + +## Root Causes + +1. **Primary cause:** MK7 was physically relocated, Docker Swarm services were intentionally stopped during migration and never restarted. +2. **Secondary cause (Authelia failure):** When services were redeployed, Authelia crashed due to NTP clock synchronization failure. `systemd-timesyncd` was pointing to stale NTP server `192.168.128.33` (Shield PXE DHCP drift), causing certificate validity checks to fail. +3. **Network config drift:** `/etc/systemd/timesyncd.conf.d/` contained a cloud-init NTP config pointing to the wrong IP. + +--- + +## Actions Taken + +### Phase 1: Service Redeployment + +Located compose files at `/opt/iron-legion/docker-swarm/` and individually deployed all stacks: + +```bash +# Deployed stacks +docker stack deploy -c traefik/compose.yml traefik +docker stack deploy -c portainer/compose.yml portainer +docker stack deploy -c technitium/compose.yml technitium +docker stack deploy -c homepage/compose.yml homepage +docker stack deploy -c n8n/n8n-stack.yml n8n +docker stack deploy -c beszel/compose.yml beszel +docker stack deploy -c dozzle/compose.yml dozzle +docker stack deploy -c authelia/compose.yml authelia +docker stack deploy -c prometheus/compose.yml prometheus +docker stack deploy -c node-exporter/compose.yml node-exporter +``` + +All stacks converged successfully. + +### Phase 2: NTP / Authelia Fix + +**Problem identified:** Authelia container logs showed: +``` +error="the system clock is not synchronized accurately enough with the configured NTP server" provider=ntp +``` + +**Investigation:** +```bash +systemctl status systemd-timesyncd +# Status: "Connecting to time server 192.168.128.33:123" +``` + +**Fix applied:** +```bash +# Removed stale cloud-init NTP config +rm -f /etc/systemd/timesyncd.conf.d/*.conf + +# Reset timesyncd to default (uses pool.ntp.org fallbacks) +echo '[Time]' | sudo tee /etc/systemd/timesyncd.conf +sudo systemctl restart systemd-timesyncd + +# Verified sync +timedatectl status | grep "System clock synchronized: yes" +``` + +**Result:** `System clock synchronized: yes` — Authelia restarted successfully. + +### Phase 3: MK-42 Worker Node Reintegration + +**Discovery:** MK-42 (192.168.0.196) was online and had Docker installed but Swarm was inactive. + +**Action:** +```bash +# On MK-42 +ssh jarvis@192.168.0.196 +docker swarm leave --force # Not in swarm, just confirming +docker swarm join --token SWMTKN-1-5po7nh34gige4jj7psqyc2pe8puf66yvpzvq3o4suy2kzqa5om-7tobwwhz2tvmo7wmg5yk7m5jd 192.168.7.7:2377 +``` + +**Result:** MK-42 joined Swarm as a worker node. Now available for workload scheduling. + +--- + +## Final Service Status + +| Stack | Service | Status | Replicas | Notes | +|-------|---------|--------|----------|-------| +| traefik | traefik | ✅ Running | 1/1 | Global mode on manager, healthy | +| portainer | portainer | ✅ Running | 1/1 | Replicated on manager | +| technitium | technitium | ✅ Running | 1/1 | Ports 53/5380 exposed (host mode) | +| homepage | homepage | ✅ Running | 1/1 | Replicated on manager | +| n8n | postgres | ✅ Running | 1/1 | Healthy | +| n8n | pgadmin | ✅ Running | 1/1 | — | +| n8n | n8n | ✅ Running | 1/1 | Healthy | +| beszel | beszel-hub | ✅ Running | 1/1 | Port 8090 exposed | +| dozzle | dozzle | ✅ Running | 1/1 | Port 8081 exposed | +| authelia | authelia | ✅ Running | 1/1 | After NTP fix | +| prometheus | prometheus | ✅ Running | 1/1 | — | +| node-exporter | node-exporter | ✅ Running | 1/1 | Global mode | +| pegaprox | pegaprox | ✅ Running | 1/1 | Already running (unchanged) | + +**Swarm nodes:** +| ID | Hostname | Status | Availability | Manager | +|----|----------|--------|--------------|---------| +| x6xr2s6... | mark-vii.ai.home | Ready | Active | Leader | +| x46ce7y... | mk-42 | Ready | Active | — (Worker) | + +--- + +## Health Checks Verified + +```bash +❯ curl -s http://localhost:8080/ping → OK (Traefik) +❯ curl -s http://localhost:9000/api/status → {"Version":"2.39.2",...} (Portainer) +❯ curl -s http://localhost:5380 → Technitium HTML (DNS UI) +❯ curl -s http://localhost:8090 → Beszel HTML +❯ curl -s http://localhost:5678/healthz → OK (n8n) +❯ curl -s http://localhost:8081/api/health → OK (Dozzle) +``` + +All services responding on expected ports. + +--- + +## File Changes on MK7 + +| File | Action | Reason | +|------|--------|--------| +| `/etc/systemd/timesyncd.conf.d/*.conf` | Deleted | Stale cloud-init NTP config pointing to wrong IP | +| `/etc/systemd/timesyncd.conf` | Reset to `[Time]` only | Restore default NTP behavior | +| `/opt/iron-legion/docker-swarm/deploy.sh` | Modified | Removed reference to missing `adguard` stack (not deployed) | + +--- + +## Notes for Future Operations + +1. **NTP drift on relocated nodes:** Always verify `timedatectl status` after moving hardware. Cloud-init may inject stale NTP configs. +2. **AdGuard removed:** The `deploy.sh` previously referenced an `adguard` stack that no longer exists (AdGuard was removed in favor of Technitium's built-in blocking). The script was updated to skip it. +3. **MK-42 as Swarm worker:** MK-42 is now available for container scheduling but has not been labeled for specific workloads. If you want PVE services on it, consider deploying a VM first or using it as a bare Swarm worker. +4. **No Tailscale on MK-42:** As requested, MK-42 joins via LAN IP only. No Tailscale client installed. + +--- + +*Last updated: 2026-06-01 by F.R.I.D.A.Y.* diff --git a/reports/netbird-evaluation-report.md b/reports/netbird-evaluation-report.md new file mode 100644 index 0000000..4e9fa31 --- /dev/null +++ b/reports/netbird-evaluation-report.md @@ -0,0 +1,344 @@ +# Netbird Self-Hosted Control Plane — Evaluation Report + +**Author:** F.R.I.D.A.Y. ( Hermes Agent ) +**Date:** 2026-05-31 +**Status:** Draft — for Commander review before deployment +**Scope:** Evaluate Netbird self-hosted control plane as a potential replacement or complement to Tailscale mesh networking for the Iron Legion fleet. + +--- + +## Executive Summary + +Netbird is an open-source, WireGuard-based mesh VPN that provides peer-to-peer connectivity with a centralized management plane. As of v0.71.4 (May 2026), it now offers **two deployment models** for self-hosting: + +1. **Quickstart (single-container, recommended for new deployments)** — Combined management + signal + relay in one `netbird-server` container with embedded Dex IdP. ~5-minute setup via `getting-started.sh` with built-in Traefik and automatic TLS. +2. **Advanced (multi-container, legacy but supported)** — Separate services (management, signal, coturn, relay, dashboard) configured via `management.json` and `docker-compose.yml`. + +**Key finding:** Netbird now supports running **behind an existing reverse proxy** (Traefik, Nginx, Caddy) as a first-class deployment option. This is significant for the Iron Legion because MK7 already runs Traefik for `*.ai.home` services — we can integrate Netbird without adding a new public-facing edge. + +--- + +## What Netbird Offers (vs. Tailscale) + +| Feature | Tailscale | Netbird | +|---------|-----------|---------| +| Underlay protocol | WireGuard | WireGuard | +| Control plane | Tailscale Co. cloud | **Self-hostable** | +| NAT traversal | DERP relays (cloud-hosted) | Self-hosted Coturn + Relay | +| Identity provider | Tailscale accounts / SSO via Auth0, etc. | **Embedded Dex** / Any OIDC IdP | +| Network routes | ✅ | ✅ | +| DNS split-brain | MagicDNS | Network-wide DNS | +| Reverse proxy / funnel | Tailscale Funnel (public) | **Built-in reverse proxy via Netbird Proxy** | +| Access controls | ACL policies | **Group + peer policies** | +| Linux clients | ✅ | ✅ | +| Windows | ✅ | ✅ | +| Mobile (iOS/Android) | ✅ | ✅ | +| Browser client | ❌ | ✅ | +| Open-source | Client only | **Fully open-source** | + +**For the Iron Legion:** The primary advantage of Netbird is **full ownership of the control plane**. Tailscale depends on Tailscale Inc. infrastructure for coordination and DERP relays; Netbird brings both under our control. + +--- + +## Architecture Overview + +### Quickstart (v0.29+, Recommended) + +``` +[Public Internet] + | + +-- TCP 80/443 --> Traefik (built-in or external) + | | + | +-- Dashboard UI (web) + | +-- Management API (gRPC over HTTPS) + | +-- Signal (gRPC over HTTPS, HTTP/2 ALPN) + | +-- Relay (WebSocket over HTTPS) + | + +-- UDP 3478 --> Coturn (STUN/TURN) + | + +-- UDP 49152-65535 --> TURN relay ports (legacy) +``` + +**Combined server container** (`netbird-server`) consolidates: +- Management Service — peer orchestration, ACLs, routes, DNS +- Signal Service — WebRTC signaling for direct WireGuard connections +- Relay Service — WebSocket relay for fallback when direct p2p fails +- Embedded Dex — built-in identity provider (local users + external OIDC) +- Dashboard — web management UI + +**New in v0.29:** Management and Signal share port 443 via HTTP/2 ALPN. Previously required separate ports (33073 for management gRPC, 10000 for signal gRPC, 33080 for relay). + +### Advanced (legacy multi-container) + +- `management` — API server + dashboard +- `signal` — WebRTC signaling +- `relay` — WebSocket fallback relay +- `coturn` — TURN/STUN server +- `dashboard` — React UI +- External IdP required (or Dex deployed separately) + +**Iron Legion recommendation:** Use the **Quickstart model** unless there's a hard requirement for a separate IdP (Authelia, Keycloak, etc.) that cannot run alongside the embedded Dex. + +--- + +## Deployment Options for Iron Legion + +### Option A: Docker Swarm on MK7 (Recommended for Low Friction) + +Deploy Netbird as a Docker Swarm stack on MK7, using the **existing Traefik** as the reverse proxy. + +**Pros:** +- Already running Swarm + Traefik on MK7 +- No new VM or LXC to provision +- Can share `traefik-public` network +- Traefik handles TLS certs via internal CA or Let's Encrypt + +**Cons:** +- MK7 is already the Swarm manager + DNS + proxy — adding mesh control plane means more load on the same node +- If MK7 goes down, both the mesh *and* the Web UI/proxy go down + +**Port mapping on MK7:** +| Port | Protocol | Service | +|------|----------|---------| +| 80 | TCP | HTTP (redirect + ACME challenge) | +| 443 | TCP | HTTPS (Dashboard, Management, Signal, Relay) | +| 3478 | UDP | Coturn STUN/TURN | + +> Note: v0.29+ consolidated ports reduce firewall complexity. If all clients run v0.29+, only need 80/443 + 3478. Legacy clients need 33073, 10000, 33080, and UDP 49152-65535. + +### Option B: Dedicated LXC on Proxmox (Recommended for Resilience) + +Deploy Netbird control plane as an LXC container on one of the Proxmox nodes (MK33/34/39/42), with port forwards via `iptables` or host networking. + +**Pros:** +- Isolated from Docker Swarm failures +- Can colocate with MK7 for low latency but separate failure domain +- Easier backups via Proxmox scheduled snapshot + +**Cons:** +- Requires provisioning an LXC first +- Need to forward UDP 3478 + TCP 443 from host to container + +**Recommended node:** MK39 (Gemini) — currently underutilized, stable node. + +### Option C: PVE VM (Heavy, Overkill) + +Full VM on Proxmox — unnecessary overhead for a coordination server. + +**Verdict:** Option B (LXC on MK39) for resilience, or Option A (Swarm on MK7) if simplicity is preferred. + +--- + +## Reverse Proxy Integration + +The `getting-started.sh` script supports **6 reverse proxy modes**: + +| Option | Reverse Proxy | Iron Legion Fit | +|--------|-------------|------------------| +| `[0]` | Built-in Traefik (new container) | Works but redundant — we already have Traefik | +| `[1]` | External Traefik (labels only) | **Best fit for Option A** — generates Docker labels for existing Traefik | +| `[2]` | Nginx (config template) | Not needed — already running Traefik | +| `[3]` | Nginx Proxy Manager | Not needed | +| `[4]` | External Caddy | Not needed | +| `[5]` | Other/Manual | Fallback if Traefik ALPN doesn't work | + +**Iron Legion choice:** Option `[1]` — "Existing Traefik" labels. This generates: +- `traefik.enable=true` +- `traefik.http.routers.netbird-.rule=Host(...)` +- `traefik.http.services.netbird-.loadbalancer.server.port=...` +- Labels for each endpoint: Dashboard (443), Management gRPC (443), Signal gRPC (443), Relay WebSocket (443) + +### Required Traefik EntryPoints + +Already configured on MK7 Traefik: +- `web` (:80) — redirect to HTTPS +- `websecure` (:443) — HTTPS + gRPC via HTTP/2 +- `traefik-dashboard` (:8080) — dashboard + +**No new entrypoints needed.** All Netbird services multiplex over 443 via HTTP/2 ALPN. + +--- + +## DNS Requirements + +Netbird needs two DNS records: + +| Type | Record | Points To | +|------|--------|-----------| +| A | `netbird.ai.home` | MK7 (192.168.7.7) or MK39 LXC IP | +| CNAME | `*.netbird.ai.home` | `netbird.ai.home` | + +The wildcard is required for Netbird Proxy — each exposed internal service gets a subdomain (e.g., `service.netbird.ai.home`). + +**Technitium DNS update:** Add: +- `netbird.ai.home` → A → 192.168.7.7 (or LXC IP if Option B) +- `*.netbird.ai.home` → CNAME → `netbird.ai.home` + +> Note: Netbird clients on the mesh resolve `*.netbird.selfhosted` internally. The `ai.home` DNS is only needed for the dashboard web UI and proxy subdomains. + +--- + +## Authentication Strategy + +Netbird Quickstart includes an **embedded Dex** identity provider with local user management. This is sufficient for Iron Legion's current needs. + +**Two paths:** + +### Path 1: Embedded Dex Only (Recommended for Review) +- Local user accounts created via Netbird Dashboard +- No dependence on external IdP +- Username/password or personal access tokens +- Can migrate to external IdP later without re-enrolling devices + +### Path 2: Integrate with Existing Authelia (Future) +- Authelia on MK7 supports OIDC (added in v4.38+) +- Netbird can authenticate against Authelia as the IdP +- Single sign-on across all fleet services +- More complex setup — save for Phase 2 + +**Recommendation:** Start with Path 1 (embedded Dex). It's fully functional, requires zero extra infrastructure, and can be migrated to Authelia OIDC later. + +--- + +## Tailscale Coexistence + +Netbird and Tailscale **can run simultaneously** on the same nodes because they use different WireGuard interfaces and port ranges: +- Tailscale: UDP 41641 (WireGuard), port 443/TCP (DERP) +- Netbird: UDP 51820 (WireGuard), UDP 3478 (TURN), TCP 443 (management/signal) + +**Potential conflicts:** +- Both want UDP high-ports for NAT traversal — OS assigns ephemeral ports, typically fine +- Both manipulate iptables/routing tables — could interfere with default routes +- DNS resolution: Tailscale MagicDNS vs. Netbird DNS — whichever binds `/etc/resolv.conf` last wins + +**Recommended coexistence strategy:** +- Primary mesh: Tailscale (currently working, MagicDNS configured for `ai.home`) +- Secondary / evaluation: Netbird on a subset of nodes +- Use Netbird for specific access-control use cases (e.g., expose certain services via Netbird Proxy) +- Do NOT set Netbird as default route unless Tailscale is decommissioned + +--- + +## Netbird Proxy — Replacing Traefik? + +**Commander question:** "Run alongside possibly replace Traefik as the reverse proxy" + +**Answer:** Netbird Proxy is NOT a reverse proxy replacement for Traefik. It solves a **different problem**: + +- **Traefik** (existing on MK7): Routes `*.ai.home` traffic *within* the LAN/WAN to Docker containers. It handles HTTP/HTTPS ingress for services like Portainer, PegaProx, Technitium, etc. +- **Netbird Proxy**: Exposes internal Netbird mesh services *to the public internet* via subdomain routing, secured by Netbird's access policies. Think of it as a Tailscale Funnel equivalent. + +**Example:** +- `prometheus.internal.ai.home` is only reachable inside the LAN → traefik routes to Prometheus +- `prometheus.netbird.ai.home` could be exposed to a remote user's laptop via Netbird Proxy with per-user ACLs + +**Verdict:** Keep Traefik. Netbird Proxy complements it for selective external exposure, not replaces it. + +--- + +## Resource Requirements + +### Quickstart (single container) +| Resource | Min | Recommended | +|----------|-----|-------------| +| CPU | 1 core | 2 cores | +| RAM | 2 GB | 4 GB | +| Disk | 10 GB | 20 GB | +| Network | Public IP + DNS | Same | + +### Advanced (multi-container) +| Resource | Min | Recommended | +|----------|-----|-------------| +| CPU | 2 cores | 4 cores | +| RAM | 4 GB | 8 GB | +| Disk | 20 GB | 40 GB | +| Network | Same | Same | + +**Iron Legion:** Either MK7 (18 cores, 15 GB RAM) or a Proxmox LXC (easily provisioned with 4 GB RAM, 2 cores) are well within these limits. + +--- + +## Deployment Effort Estimate + +| Phase | Task | Time | Notes | +|-------|------|------|-------| +| P0 | Review this report | — | Commander decision point | +| P1 | Add DNS records to Technitium | 15 min | `netbird.ai.home` + wildcard | +| P2 | Deploy Netbird (Quickstart Option A or B) | 30 min | Run `getting-started.sh`, select option [1] or [0] | +| P3 | Create first admin user via `/setup` | 5 min | Web browser | +| P4 | Install Netbird client on test nodes | 20 min | 2-3 nodes for validation | +| P5 | Configure network routes + ACLs | 45 min | Mirror Tailscale access patterns | +| P6 | Evaluate coexistence vs. Tailscale replacement | Ongoing | 1-2 week trial period | + +**Total hands-on time (if approved):** ~2 hours (+ evaluation period). + +--- + +## Known Issues / Gotchas + +1. **ALPN / HTTP/2 requirement:** Netbird v0.29+ consolidated ports require HTTP/2 + ALPN on the reverse proxy. Traefik supports this natively. Nginx requires explicit `http2` directive on `listen`. + +2. **Legacy clients:** If any Iron Legion device runs an older Netbird client (< v0.29), you'll need the legacy ports (33073, 10000, 33080, UDP 49152-65535). Allfleet devices should use latest client. + +3. **Coturn on cloud VMs:** Oracle Cloud and Hetzner require firewall rules for UDP 3478 beyond just VM-level. Not applicable for LAN but noted for future cloud expansion. + +4. **First user setup:** The `/setup` page is **only accessible when zero users exist**. After first admin creation, it redirects to `/login`. To create additional admins, use Dashboard → Settings → Identity Providers or API with PAT. + +5. **NTP dependency:** Authelia failed on MK7 due to unsynchronized clock (see MK7 restoration report). Netbird's management service also checks certificate validity — ensure NTP sync on the host. + +6. **Wildcard DNS for Proxy:** If enabling Netbird Proxy, the wildcard CNAME is mandatory. Without it, exposed service subdomains won't resolve. + +--- + +## Recommendations + +### Immediate (Pre-Deployment) +1. ✅ Commander reviews this report +2. ✅ Decide Option A (Swarm on MK7) vs. Option B (LXC on MK39) +3. ✅ If Option A: verify Traefik HTTP/2 ALPN is active + +### Short-Term (If Approved) +1. Deploy Netbird Quickstart with embedded Dex +2. Add `netbird.ai.home` + wildcard to Technitium DNS +3. Install clients on 2-3 test nodes (Cinnamint, Artemis, MK42) +4. Mirror one Tailscale route in Netbird for comparison + +### Long-Term (Evaluation After 2 Weeks) +1. Compare latency/connection reliability vs. Tailscale +2. Evaluate Netbird Proxy for selective external access +3. Decide: coexist, replace Tailscale, or decommission Netbird +4. If replacing: migrate MagicDNS zones to Netbird DNS, update all `.ai.home` client configs + +--- + +## References + +- Netbird Docs (Self-Hosted Quickstart): https://docs.netbird.io/selfhosted/selfhosted-quickstart +- Netbird Docs (Advanced Guide): https://docs.netbird.io/selfhosted/selfhosted-guide +- GitHub (infrastructure files): https://github.com/netbirdio/netbird/tree/v0.71.4/infrastructure_files +- Quickstart install script: `curl -fsSL https://github.com/netbirdio/netbird/releases/latest/download/getting-started.sh | bash` +- Reverse Proxy Configuration: https://docs.netbird.io/selfhosted/reverse-proxy +- Upgrade / Migration Guide: https://docs.netbird.io/selfhosted/maintenance + +--- + +## Appendix: Netbird vs Tailscale Detailed Comparison + +| Aspect | Tailscale | Netbird Self-Hosted | +|--------|-----------|---------------------| +| Control plane ownership | ❌ Tailscale Inc. | ✅ Fully owned | +| Relay ownership | ❌ Tailscale DERP | ✅ Self-hosted Coturn | +| Cost | Free tier limited; enterprise paid | Free; unlimited | +| Identity | External IdP or Tailscale | Embedded Dex or any OIDC | +| Web dashboard | ✅ | ✅ (self-hosted) | +| API | ✅ | ✅ (REST + gRPC) | +| SCIM provisioning | ❌ (manual) | ✅ (Enterprise) | +| Network segmentation / ACLs | Yes (JSON ACL) | Yes (groups + policies) | +| Exit nodes | ✅ | ✅ | +| Subnet routers | ✅ | ✅ | +| Browser client | ❌ | ✅ (WebRTC-based) | +| Mobile NAT busting | DERP | TURN + direct p2p | + +--- + +*Report generated 2026-05-31 by F.R.I.D.A.Y. — awaiting Commander review.*