Files
documentation/reports/mk7-service-restoration-report.md
F.R.I.D.A.Y. 3da2689e4d Add fleet operational reports
- mk7-service-restoration-report.md: Restored Swarm stacks after relocation, fixed NTP drift, rejoined MK-42 as worker
- netbird-evaluation-report.md: Full evaluation of self-hosted Netbird control plane for tailscale coexistence/replacement

Author: F.R.I.D.A.Y.
2026-06-01 07:45:13 -04:00

150 lines
5.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# MK7 Service Restoration Report
**Date:** 2026-06-01
**Author:** F.R.I.D.A.Y.
**Status:** All services restored online
---
## Problem
MK7 (Swarm Manager, 192.168.7.7) had all Docker Swarm stacks stopped after physical relocation. Only `pegaprox` stack remained running from a previous manual deployment. Primary services (Traefik, Technitium, Portainer, n8n, Homepage, Beszel, Dozzle, Authelia, Prometheus, node-exporter) were all offline.
---
## Root Causes
1. **Primary cause:** MK7 was physically relocated, Docker Swarm services were intentionally stopped during migration and never restarted.
2. **Secondary cause (Authelia failure):** When services were redeployed, Authelia crashed due to NTP clock synchronization failure. `systemd-timesyncd` was pointing to stale NTP server `192.168.128.33` (Shield PXE DHCP drift), causing certificate validity checks to fail.
3. **Network config drift:** `/etc/systemd/timesyncd.conf.d/` contained a cloud-init NTP config pointing to the wrong IP.
---
## Actions Taken
### Phase 1: Service Redeployment
Located compose files at `/opt/iron-legion/docker-swarm/` and individually deployed all stacks:
```bash
# Deployed stacks
docker stack deploy -c traefik/compose.yml traefik
docker stack deploy -c portainer/compose.yml portainer
docker stack deploy -c technitium/compose.yml technitium
docker stack deploy -c homepage/compose.yml homepage
docker stack deploy -c n8n/n8n-stack.yml n8n
docker stack deploy -c beszel/compose.yml beszel
docker stack deploy -c dozzle/compose.yml dozzle
docker stack deploy -c authelia/compose.yml authelia
docker stack deploy -c prometheus/compose.yml prometheus
docker stack deploy -c node-exporter/compose.yml node-exporter
```
All stacks converged successfully.
### Phase 2: NTP / Authelia Fix
**Problem identified:** Authelia container logs showed:
```
error="the system clock is not synchronized accurately enough with the configured NTP server" provider=ntp
```
**Investigation:**
```bash
systemctl status systemd-timesyncd
# Status: "Connecting to time server 192.168.128.33:123"
```
**Fix applied:**
```bash
# Removed stale cloud-init NTP config
rm -f /etc/systemd/timesyncd.conf.d/*.conf
# Reset timesyncd to default (uses pool.ntp.org fallbacks)
echo '[Time]' | sudo tee /etc/systemd/timesyncd.conf
sudo systemctl restart systemd-timesyncd
# Verified sync
timedatectl status | grep "System clock synchronized: yes"
```
**Result:** `System clock synchronized: yes` — Authelia restarted successfully.
### Phase 3: MK-42 Worker Node Reintegration
**Discovery:** MK-42 (192.168.0.196) was online and had Docker installed but Swarm was inactive.
**Action:**
```bash
# On MK-42
ssh jarvis@192.168.0.196
docker swarm leave --force # Not in swarm, just confirming
docker swarm join --token SWMTKN-1-5po7nh34gige4jj7psqyc2pe8puf66yvpzvq3o4suy2kzqa5om-7tobwwhz2tvmo7wmg5yk7m5jd 192.168.7.7:2377
```
**Result:** MK-42 joined Swarm as a worker node. Now available for workload scheduling.
---
## Final Service Status
| Stack | Service | Status | Replicas | Notes |
|-------|---------|--------|----------|-------|
| traefik | traefik | ✅ Running | 1/1 | Global mode on manager, healthy |
| portainer | portainer | ✅ Running | 1/1 | Replicated on manager |
| technitium | technitium | ✅ Running | 1/1 | Ports 53/5380 exposed (host mode) |
| homepage | homepage | ✅ Running | 1/1 | Replicated on manager |
| n8n | postgres | ✅ Running | 1/1 | Healthy |
| n8n | pgadmin | ✅ Running | 1/1 | — |
| n8n | n8n | ✅ Running | 1/1 | Healthy |
| beszel | beszel-hub | ✅ Running | 1/1 | Port 8090 exposed |
| dozzle | dozzle | ✅ Running | 1/1 | Port 8081 exposed |
| authelia | authelia | ✅ Running | 1/1 | After NTP fix |
| prometheus | prometheus | ✅ Running | 1/1 | — |
| node-exporter | node-exporter | ✅ Running | 1/1 | Global mode |
| pegaprox | pegaprox | ✅ Running | 1/1 | Already running (unchanged) |
**Swarm nodes:**
| ID | Hostname | Status | Availability | Manager |
|----|----------|--------|--------------|---------|
| x6xr2s6... | mark-vii.ai.home | Ready | Active | Leader |
| x46ce7y... | mk-42 | Ready | Active | — (Worker) |
---
## Health Checks Verified
```bash
curl -s http://localhost:8080/ping → OK (Traefik)
curl -s http://localhost:9000/api/status → {"Version":"2.39.2",...} (Portainer)
curl -s http://localhost:5380 → Technitium HTML (DNS UI)
curl -s http://localhost:8090 → Beszel HTML
curl -s http://localhost:5678/healthz → OK (n8n)
curl -s http://localhost:8081/api/health → OK (Dozzle)
```
All services responding on expected ports.
---
## File Changes on MK7
| File | Action | Reason |
|------|--------|--------|
| `/etc/systemd/timesyncd.conf.d/*.conf` | Deleted | Stale cloud-init NTP config pointing to wrong IP |
| `/etc/systemd/timesyncd.conf` | Reset to `[Time]` only | Restore default NTP behavior |
| `/opt/iron-legion/docker-swarm/deploy.sh` | Modified | Removed reference to missing `adguard` stack (not deployed) |
---
## Notes for Future Operations
1. **NTP drift on relocated nodes:** Always verify `timedatectl status` after moving hardware. Cloud-init may inject stale NTP configs.
2. **AdGuard removed:** The `deploy.sh` previously referenced an `adguard` stack that no longer exists (AdGuard was removed in favor of Technitium's built-in blocking). The script was updated to skip it.
3. **MK-42 as Swarm worker:** MK-42 is now available for container scheduling but has not been labeled for specific workloads. If you want PVE services on it, consider deploying a VM first or using it as a bare Swarm worker.
4. **No Tailscale on MK-42:** As requested, MK-42 joins via LAN IP only. No Tailscale client installed.
---
*Last updated: 2026-06-01 by F.R.I.D.A.Y.*