Files
documentation/reports/netbird-evaluation-report.md
F.R.I.D.A.Y. 3da2689e4d Add fleet operational reports
- mk7-service-restoration-report.md: Restored Swarm stacks after relocation, fixed NTP drift, rejoined MK-42 as worker
- netbird-evaluation-report.md: Full evaluation of self-hosted Netbird control plane for tailscale coexistence/replacement

Author: F.R.I.D.A.Y.
2026-06-01 07:45:13 -04:00

345 lines
15 KiB
Markdown

# Netbird Self-Hosted Control Plane — Evaluation Report
**Author:** F.R.I.D.A.Y. ( Hermes Agent )
**Date:** 2026-05-31
**Status:** Draft — for Commander review before deployment
**Scope:** Evaluate Netbird self-hosted control plane as a potential replacement or complement to Tailscale mesh networking for the Iron Legion fleet.
---
## Executive Summary
Netbird is an open-source, WireGuard-based mesh VPN that provides peer-to-peer connectivity with a centralized management plane. As of v0.71.4 (May 2026), it now offers **two deployment models** for self-hosting:
1. **Quickstart (single-container, recommended for new deployments)** — Combined management + signal + relay in one `netbird-server` container with embedded Dex IdP. ~5-minute setup via `getting-started.sh` with built-in Traefik and automatic TLS.
2. **Advanced (multi-container, legacy but supported)** — Separate services (management, signal, coturn, relay, dashboard) configured via `management.json` and `docker-compose.yml`.
**Key finding:** Netbird now supports running **behind an existing reverse proxy** (Traefik, Nginx, Caddy) as a first-class deployment option. This is significant for the Iron Legion because MK7 already runs Traefik for `*.ai.home` services — we can integrate Netbird without adding a new public-facing edge.
---
## What Netbird Offers (vs. Tailscale)
| Feature | Tailscale | Netbird |
|---------|-----------|---------|
| Underlay protocol | WireGuard | WireGuard |
| Control plane | Tailscale Co. cloud | **Self-hostable** |
| NAT traversal | DERP relays (cloud-hosted) | Self-hosted Coturn + Relay |
| Identity provider | Tailscale accounts / SSO via Auth0, etc. | **Embedded Dex** / Any OIDC IdP |
| Network routes | ✅ | ✅ |
| DNS split-brain | MagicDNS | Network-wide DNS |
| Reverse proxy / funnel | Tailscale Funnel (public) | **Built-in reverse proxy via Netbird Proxy** |
| Access controls | ACL policies | **Group + peer policies** |
| Linux clients | ✅ | ✅ |
| Windows | ✅ | ✅ |
| Mobile (iOS/Android) | ✅ | ✅ |
| Browser client | ❌ | ✅ |
| Open-source | Client only | **Fully open-source** |
**For the Iron Legion:** The primary advantage of Netbird is **full ownership of the control plane**. Tailscale depends on Tailscale Inc. infrastructure for coordination and DERP relays; Netbird brings both under our control.
---
## Architecture Overview
### Quickstart (v0.29+, Recommended)
```
[Public Internet]
|
+-- TCP 80/443 --> Traefik (built-in or external)
| |
| +-- Dashboard UI (web)
| +-- Management API (gRPC over HTTPS)
| +-- Signal (gRPC over HTTPS, HTTP/2 ALPN)
| +-- Relay (WebSocket over HTTPS)
|
+-- UDP 3478 --> Coturn (STUN/TURN)
|
+-- UDP 49152-65535 --> TURN relay ports (legacy)
```
**Combined server container** (`netbird-server`) consolidates:
- Management Service — peer orchestration, ACLs, routes, DNS
- Signal Service — WebRTC signaling for direct WireGuard connections
- Relay Service — WebSocket relay for fallback when direct p2p fails
- Embedded Dex — built-in identity provider (local users + external OIDC)
- Dashboard — web management UI
**New in v0.29:** Management and Signal share port 443 via HTTP/2 ALPN. Previously required separate ports (33073 for management gRPC, 10000 for signal gRPC, 33080 for relay).
### Advanced (legacy multi-container)
- `management` — API server + dashboard
- `signal` — WebRTC signaling
- `relay` — WebSocket fallback relay
- `coturn` — TURN/STUN server
- `dashboard` — React UI
- External IdP required (or Dex deployed separately)
**Iron Legion recommendation:** Use the **Quickstart model** unless there's a hard requirement for a separate IdP (Authelia, Keycloak, etc.) that cannot run alongside the embedded Dex.
---
## Deployment Options for Iron Legion
### Option A: Docker Swarm on MK7 (Recommended for Low Friction)
Deploy Netbird as a Docker Swarm stack on MK7, using the **existing Traefik** as the reverse proxy.
**Pros:**
- Already running Swarm + Traefik on MK7
- No new VM or LXC to provision
- Can share `traefik-public` network
- Traefik handles TLS certs via internal CA or Let's Encrypt
**Cons:**
- MK7 is already the Swarm manager + DNS + proxy — adding mesh control plane means more load on the same node
- If MK7 goes down, both the mesh *and* the Web UI/proxy go down
**Port mapping on MK7:**
| Port | Protocol | Service |
|------|----------|---------|
| 80 | TCP | HTTP (redirect + ACME challenge) |
| 443 | TCP | HTTPS (Dashboard, Management, Signal, Relay) |
| 3478 | UDP | Coturn STUN/TURN |
> Note: v0.29+ consolidated ports reduce firewall complexity. If all clients run v0.29+, only need 80/443 + 3478. Legacy clients need 33073, 10000, 33080, and UDP 49152-65535.
### Option B: Dedicated LXC on Proxmox (Recommended for Resilience)
Deploy Netbird control plane as an LXC container on one of the Proxmox nodes (MK33/34/39/42), with port forwards via `iptables` or host networking.
**Pros:**
- Isolated from Docker Swarm failures
- Can colocate with MK7 for low latency but separate failure domain
- Easier backups via Proxmox scheduled snapshot
**Cons:**
- Requires provisioning an LXC first
- Need to forward UDP 3478 + TCP 443 from host to container
**Recommended node:** MK39 (Gemini) — currently underutilized, stable node.
### Option C: PVE VM (Heavy, Overkill)
Full VM on Proxmox — unnecessary overhead for a coordination server.
**Verdict:** Option B (LXC on MK39) for resilience, or Option A (Swarm on MK7) if simplicity is preferred.
---
## Reverse Proxy Integration
The `getting-started.sh` script supports **6 reverse proxy modes**:
| Option | Reverse Proxy | Iron Legion Fit |
|--------|-------------|------------------|
| `[0]` | Built-in Traefik (new container) | Works but redundant — we already have Traefik |
| `[1]` | External Traefik (labels only) | **Best fit for Option A** — generates Docker labels for existing Traefik |
| `[2]` | Nginx (config template) | Not needed — already running Traefik |
| `[3]` | Nginx Proxy Manager | Not needed |
| `[4]` | External Caddy | Not needed |
| `[5]` | Other/Manual | Fallback if Traefik ALPN doesn't work |
**Iron Legion choice:** Option `[1]` — "Existing Traefik" labels. This generates:
- `traefik.enable=true`
- `traefik.http.routers.netbird-<service>.rule=Host(...)`
- `traefik.http.services.netbird-<service>.loadbalancer.server.port=...`
- Labels for each endpoint: Dashboard (443), Management gRPC (443), Signal gRPC (443), Relay WebSocket (443)
### Required Traefik EntryPoints
Already configured on MK7 Traefik:
- `web` (:80) — redirect to HTTPS
- `websecure` (:443) — HTTPS + gRPC via HTTP/2
- `traefik-dashboard` (:8080) — dashboard
**No new entrypoints needed.** All Netbird services multiplex over 443 via HTTP/2 ALPN.
---
## DNS Requirements
Netbird needs two DNS records:
| Type | Record | Points To |
|------|--------|-----------|
| A | `netbird.ai.home` | MK7 (192.168.7.7) or MK39 LXC IP |
| CNAME | `*.netbird.ai.home` | `netbird.ai.home` |
The wildcard is required for Netbird Proxy — each exposed internal service gets a subdomain (e.g., `service.netbird.ai.home`).
**Technitium DNS update:** Add:
- `netbird.ai.home` → A → 192.168.7.7 (or LXC IP if Option B)
- `*.netbird.ai.home` → CNAME → `netbird.ai.home`
> Note: Netbird clients on the mesh resolve `*.netbird.selfhosted` internally. The `ai.home` DNS is only needed for the dashboard web UI and proxy subdomains.
---
## Authentication Strategy
Netbird Quickstart includes an **embedded Dex** identity provider with local user management. This is sufficient for Iron Legion's current needs.
**Two paths:**
### Path 1: Embedded Dex Only (Recommended for Review)
- Local user accounts created via Netbird Dashboard
- No dependence on external IdP
- Username/password or personal access tokens
- Can migrate to external IdP later without re-enrolling devices
### Path 2: Integrate with Existing Authelia (Future)
- Authelia on MK7 supports OIDC (added in v4.38+)
- Netbird can authenticate against Authelia as the IdP
- Single sign-on across all fleet services
- More complex setup — save for Phase 2
**Recommendation:** Start with Path 1 (embedded Dex). It's fully functional, requires zero extra infrastructure, and can be migrated to Authelia OIDC later.
---
## Tailscale Coexistence
Netbird and Tailscale **can run simultaneously** on the same nodes because they use different WireGuard interfaces and port ranges:
- Tailscale: UDP 41641 (WireGuard), port 443/TCP (DERP)
- Netbird: UDP 51820 (WireGuard), UDP 3478 (TURN), TCP 443 (management/signal)
**Potential conflicts:**
- Both want UDP high-ports for NAT traversal — OS assigns ephemeral ports, typically fine
- Both manipulate iptables/routing tables — could interfere with default routes
- DNS resolution: Tailscale MagicDNS vs. Netbird DNS — whichever binds `/etc/resolv.conf` last wins
**Recommended coexistence strategy:**
- Primary mesh: Tailscale (currently working, MagicDNS configured for `ai.home`)
- Secondary / evaluation: Netbird on a subset of nodes
- Use Netbird for specific access-control use cases (e.g., expose certain services via Netbird Proxy)
- Do NOT set Netbird as default route unless Tailscale is decommissioned
---
## Netbird Proxy — Replacing Traefik?
**Commander question:** "Run alongside possibly replace Traefik as the reverse proxy"
**Answer:** Netbird Proxy is NOT a reverse proxy replacement for Traefik. It solves a **different problem**:
- **Traefik** (existing on MK7): Routes `*.ai.home` traffic *within* the LAN/WAN to Docker containers. It handles HTTP/HTTPS ingress for services like Portainer, PegaProx, Technitium, etc.
- **Netbird Proxy**: Exposes internal Netbird mesh services *to the public internet* via subdomain routing, secured by Netbird's access policies. Think of it as a Tailscale Funnel equivalent.
**Example:**
- `prometheus.internal.ai.home` is only reachable inside the LAN → traefik routes to Prometheus
- `prometheus.netbird.ai.home` could be exposed to a remote user's laptop via Netbird Proxy with per-user ACLs
**Verdict:** Keep Traefik. Netbird Proxy complements it for selective external exposure, not replaces it.
---
## Resource Requirements
### Quickstart (single container)
| Resource | Min | Recommended |
|----------|-----|-------------|
| CPU | 1 core | 2 cores |
| RAM | 2 GB | 4 GB |
| Disk | 10 GB | 20 GB |
| Network | Public IP + DNS | Same |
### Advanced (multi-container)
| Resource | Min | Recommended |
|----------|-----|-------------|
| CPU | 2 cores | 4 cores |
| RAM | 4 GB | 8 GB |
| Disk | 20 GB | 40 GB |
| Network | Same | Same |
**Iron Legion:** Either MK7 (18 cores, 15 GB RAM) or a Proxmox LXC (easily provisioned with 4 GB RAM, 2 cores) are well within these limits.
---
## Deployment Effort Estimate
| Phase | Task | Time | Notes |
|-------|------|------|-------|
| P0 | Review this report | — | Commander decision point |
| P1 | Add DNS records to Technitium | 15 min | `netbird.ai.home` + wildcard |
| P2 | Deploy Netbird (Quickstart Option A or B) | 30 min | Run `getting-started.sh`, select option [1] or [0] |
| P3 | Create first admin user via `/setup` | 5 min | Web browser |
| P4 | Install Netbird client on test nodes | 20 min | 2-3 nodes for validation |
| P5 | Configure network routes + ACLs | 45 min | Mirror Tailscale access patterns |
| P6 | Evaluate coexistence vs. Tailscale replacement | Ongoing | 1-2 week trial period |
**Total hands-on time (if approved):** ~2 hours (+ evaluation period).
---
## Known Issues / Gotchas
1. **ALPN / HTTP/2 requirement:** Netbird v0.29+ consolidated ports require HTTP/2 + ALPN on the reverse proxy. Traefik supports this natively. Nginx requires explicit `http2` directive on `listen`.
2. **Legacy clients:** If any Iron Legion device runs an older Netbird client (< v0.29), you'll need the legacy ports (33073, 10000, 33080, UDP 49152-65535). Allfleet devices should use latest client.
3. **Coturn on cloud VMs:** Oracle Cloud and Hetzner require firewall rules for UDP 3478 beyond just VM-level. Not applicable for LAN but noted for future cloud expansion.
4. **First user setup:** The `/setup` page is **only accessible when zero users exist**. After first admin creation, it redirects to `/login`. To create additional admins, use Dashboard → Settings → Identity Providers or API with PAT.
5. **NTP dependency:** Authelia failed on MK7 due to unsynchronized clock (see MK7 restoration report). Netbird's management service also checks certificate validity — ensure NTP sync on the host.
6. **Wildcard DNS for Proxy:** If enabling Netbird Proxy, the wildcard CNAME is mandatory. Without it, exposed service subdomains won't resolve.
---
## Recommendations
### Immediate (Pre-Deployment)
1. ✅ Commander reviews this report
2. ✅ Decide Option A (Swarm on MK7) vs. Option B (LXC on MK39)
3. ✅ If Option A: verify Traefik HTTP/2 ALPN is active
### Short-Term (If Approved)
1. Deploy Netbird Quickstart with embedded Dex
2. Add `netbird.ai.home` + wildcard to Technitium DNS
3. Install clients on 2-3 test nodes (Cinnamint, Artemis, MK42)
4. Mirror one Tailscale route in Netbird for comparison
### Long-Term (Evaluation After 2 Weeks)
1. Compare latency/connection reliability vs. Tailscale
2. Evaluate Netbird Proxy for selective external access
3. Decide: coexist, replace Tailscale, or decommission Netbird
4. If replacing: migrate MagicDNS zones to Netbird DNS, update all `.ai.home` client configs
---
## References
- Netbird Docs (Self-Hosted Quickstart): https://docs.netbird.io/selfhosted/selfhosted-quickstart
- Netbird Docs (Advanced Guide): https://docs.netbird.io/selfhosted/selfhosted-guide
- GitHub (infrastructure files): https://github.com/netbirdio/netbird/tree/v0.71.4/infrastructure_files
- Quickstart install script: `curl -fsSL https://github.com/netbirdio/netbird/releases/latest/download/getting-started.sh | bash`
- Reverse Proxy Configuration: https://docs.netbird.io/selfhosted/reverse-proxy
- Upgrade / Migration Guide: https://docs.netbird.io/selfhosted/maintenance
---
## Appendix: Netbird vs Tailscale Detailed Comparison
| Aspect | Tailscale | Netbird Self-Hosted |
|--------|-----------|---------------------|
| Control plane ownership | ❌ Tailscale Inc. | ✅ Fully owned |
| Relay ownership | ❌ Tailscale DERP | ✅ Self-hosted Coturn |
| Cost | Free tier limited; enterprise paid | Free; unlimited |
| Identity | External IdP or Tailscale | Embedded Dex or any OIDC |
| Web dashboard | ✅ | ✅ (self-hosted) |
| API | ✅ | ✅ (REST + gRPC) |
| SCIM provisioning | ❌ (manual) | ✅ (Enterprise) |
| Network segmentation / ACLs | Yes (JSON ACL) | Yes (groups + policies) |
| Exit nodes | ✅ | ✅ |
| Subnet routers | ✅ | ✅ |
| Browser client | ❌ | ✅ (WebRTC-based) |
| Mobile NAT busting | DERP | TURN + direct p2p |
---
*Report generated 2026-05-31 by F.R.I.D.A.Y. — awaiting Commander review.*