Files

F.R.I.D.A.Y. 3da2689e4d Add fleet operational reports

- mk7-service-restoration-report.md: Restored Swarm stacks after relocation, fixed NTP drift, rejoined MK-42 as worker
- netbird-evaluation-report.md: Full evaluation of self-hosted Netbird control plane for tailscale coexistence/replacement

Author: F.R.I.D.A.Y.

2026-06-01 07:45:13 -04:00

5.7 KiB

Raw Blame History

MK7 Service Restoration Report

Date: 2026-06-01 Author: F.R.I.D.A.Y. Status: All services restored online

Problem

MK7 (Swarm Manager, 192.168.7.7) had all Docker Swarm stacks stopped after physical relocation. Only pegaprox stack remained running from a previous manual deployment. Primary services (Traefik, Technitium, Portainer, n8n, Homepage, Beszel, Dozzle, Authelia, Prometheus, node-exporter) were all offline.

Root Causes

Primary cause: MK7 was physically relocated, Docker Swarm services were intentionally stopped during migration and never restarted.
Secondary cause (Authelia failure): When services were redeployed, Authelia crashed due to NTP clock synchronization failure. systemd-timesyncd was pointing to stale NTP server 192.168.128.33 (Shield PXE DHCP drift), causing certificate validity checks to fail.
Network config drift: /etc/systemd/timesyncd.conf.d/ contained a cloud-init NTP config pointing to the wrong IP.

Actions Taken

Phase 1: Service Redeployment

Located compose files at /opt/iron-legion/docker-swarm/ and individually deployed all stacks:

# Deployed stacks
docker stack deploy -c traefik/compose.yml traefik
docker stack deploy -c portainer/compose.yml portainer
docker stack deploy -c technitium/compose.yml technitium
docker stack deploy -c homepage/compose.yml homepage
docker stack deploy -c n8n/n8n-stack.yml n8n
docker stack deploy -c beszel/compose.yml beszel
docker stack deploy -c dozzle/compose.yml dozzle
docker stack deploy -c authelia/compose.yml authelia
docker stack deploy -c prometheus/compose.yml prometheus
docker stack deploy -c node-exporter/compose.yml node-exporter

All stacks converged successfully.

Phase 2: NTP / Authelia Fix

Problem identified: Authelia container logs showed:

error="the system clock is not synchronized accurately enough with the configured NTP server" provider=ntp

Investigation:

systemctl status systemd-timesyncd
# Status: "Connecting to time server 192.168.128.33:123"

Fix applied:

# Removed stale cloud-init NTP config
rm -f /etc/systemd/timesyncd.conf.d/*.conf

# Reset timesyncd to default (uses pool.ntp.org fallbacks)
echo '[Time]' | sudo tee /etc/systemd/timesyncd.conf
sudo systemctl restart systemd-timesyncd

# Verified sync
timedatectl status | grep "System clock synchronized: yes"

Result: System clock synchronized: yes — Authelia restarted successfully.

Phase 3: MK-42 Worker Node Reintegration

Discovery: MK-42 (192.168.0.196) was online and had Docker installed but Swarm was inactive.

Action:

# On MK-42
ssh jarvis@192.168.0.196
docker swarm leave --force  # Not in swarm, just confirming
docker swarm join --token SWMTKN-1-5po7nh34gige4jj7psqyc2pe8puf66yvpzvq3o4suy2kzqa5om-7tobwwhz2tvmo7wmg5yk7m5jd 192.168.7.7:2377

Result: MK-42 joined Swarm as a worker node. Now available for workload scheduling.

Final Service Status

Stack	Service	Status	Replicas	Notes
traefik	traefik	✅ Running	1/1	Global mode on manager, healthy
portainer	portainer	✅ Running	1/1	Replicated on manager
technitium	technitium	✅ Running	1/1	Ports 53/5380 exposed (host mode)
homepage	homepage	✅ Running	1/1	Replicated on manager
n8n	postgres	✅ Running	1/1	Healthy
n8n	pgadmin	✅ Running	1/1	—
n8n	n8n	✅ Running	1/1	Healthy
beszel	beszel-hub	✅ Running	1/1	Port 8090 exposed
dozzle	dozzle	✅ Running	1/1	Port 8081 exposed
authelia	authelia	✅ Running	1/1	After NTP fix
prometheus	prometheus	✅ Running	1/1	—
node-exporter	node-exporter	✅ Running	1/1	Global mode
pegaprox	pegaprox	✅ Running	1/1	Already running (unchanged)

Swarm nodes:

ID	Hostname	Status	Availability	Manager
x6xr2s6...	mark-vii.ai.home	Ready	Active	Leader
x46ce7y...	mk-42	Ready	Active	— (Worker)

Health Checks Verified

❯ curl -s http://localhost:8080/ping        → OK (Traefik)
❯ curl -s http://localhost:9000/api/status  → {"Version":"2.39.2",...} (Portainer)
❯ curl -s http://localhost:5380             → Technitium HTML (DNS UI)
❯ curl -s http://localhost:8090             → Beszel HTML
❯ curl -s http://localhost:5678/healthz     → OK (n8n)
❯ curl -s http://localhost:8081/api/health  → OK (Dozzle)

All services responding on expected ports.

File Changes on MK7

File	Action	Reason
`/etc/systemd/timesyncd.conf.d/*.conf`	Deleted	Stale cloud-init NTP config pointing to wrong IP
`/etc/systemd/timesyncd.conf`	Reset to `[Time]` only	Restore default NTP behavior
`/opt/iron-legion/docker-swarm/deploy.sh`	Modified	Removed reference to missing `adguard` stack (not deployed)

Notes for Future Operations

NTP drift on relocated nodes: Always verify timedatectl status after moving hardware. Cloud-init may inject stale NTP configs.
AdGuard removed: The deploy.sh previously referenced an adguard stack that no longer exists (AdGuard was removed in favor of Technitium's built-in blocking). The script was updated to skip it.
MK-42 as Swarm worker: MK-42 is now available for container scheduling but has not been labeled for specific workloads. If you want PVE services on it, consider deploying a VM first or using it as a bare Swarm worker.
No Tailscale on MK-42: As requested, MK-42 joins via LAN IP only. No Tailscale client installed.

Last updated: 2026-06-01 by F.R.I.D.A.Y.

5.7 KiB Raw Blame History Unescape Escape