Add fleet operational reports
- mk7-service-restoration-report.md: Restored Swarm stacks after relocation, fixed NTP drift, rejoined MK-42 as worker - netbird-evaluation-report.md: Full evaluation of self-hosted Netbird control plane for tailscale coexistence/replacement Author: F.R.I.D.A.Y.
This commit is contained in:
149
reports/mk7-service-restoration-report.md
Normal file
149
reports/mk7-service-restoration-report.md
Normal file
@@ -0,0 +1,149 @@
|
||||
# MK7 Service Restoration Report
|
||||
|
||||
**Date:** 2026-06-01
|
||||
**Author:** F.R.I.D.A.Y.
|
||||
**Status:** All services restored online
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
MK7 (Swarm Manager, 192.168.7.7) had all Docker Swarm stacks stopped after physical relocation. Only `pegaprox` stack remained running from a previous manual deployment. Primary services (Traefik, Technitium, Portainer, n8n, Homepage, Beszel, Dozzle, Authelia, Prometheus, node-exporter) were all offline.
|
||||
|
||||
---
|
||||
|
||||
## Root Causes
|
||||
|
||||
1. **Primary cause:** MK7 was physically relocated, Docker Swarm services were intentionally stopped during migration and never restarted.
|
||||
2. **Secondary cause (Authelia failure):** When services were redeployed, Authelia crashed due to NTP clock synchronization failure. `systemd-timesyncd` was pointing to stale NTP server `192.168.128.33` (Shield PXE DHCP drift), causing certificate validity checks to fail.
|
||||
3. **Network config drift:** `/etc/systemd/timesyncd.conf.d/` contained a cloud-init NTP config pointing to the wrong IP.
|
||||
|
||||
---
|
||||
|
||||
## Actions Taken
|
||||
|
||||
### Phase 1: Service Redeployment
|
||||
|
||||
Located compose files at `/opt/iron-legion/docker-swarm/` and individually deployed all stacks:
|
||||
|
||||
```bash
|
||||
# Deployed stacks
|
||||
docker stack deploy -c traefik/compose.yml traefik
|
||||
docker stack deploy -c portainer/compose.yml portainer
|
||||
docker stack deploy -c technitium/compose.yml technitium
|
||||
docker stack deploy -c homepage/compose.yml homepage
|
||||
docker stack deploy -c n8n/n8n-stack.yml n8n
|
||||
docker stack deploy -c beszel/compose.yml beszel
|
||||
docker stack deploy -c dozzle/compose.yml dozzle
|
||||
docker stack deploy -c authelia/compose.yml authelia
|
||||
docker stack deploy -c prometheus/compose.yml prometheus
|
||||
docker stack deploy -c node-exporter/compose.yml node-exporter
|
||||
```
|
||||
|
||||
All stacks converged successfully.
|
||||
|
||||
### Phase 2: NTP / Authelia Fix
|
||||
|
||||
**Problem identified:** Authelia container logs showed:
|
||||
```
|
||||
error="the system clock is not synchronized accurately enough with the configured NTP server" provider=ntp
|
||||
```
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
systemctl status systemd-timesyncd
|
||||
# Status: "Connecting to time server 192.168.128.33:123"
|
||||
```
|
||||
|
||||
**Fix applied:**
|
||||
```bash
|
||||
# Removed stale cloud-init NTP config
|
||||
rm -f /etc/systemd/timesyncd.conf.d/*.conf
|
||||
|
||||
# Reset timesyncd to default (uses pool.ntp.org fallbacks)
|
||||
echo '[Time]' | sudo tee /etc/systemd/timesyncd.conf
|
||||
sudo systemctl restart systemd-timesyncd
|
||||
|
||||
# Verified sync
|
||||
timedatectl status | grep "System clock synchronized: yes"
|
||||
```
|
||||
|
||||
**Result:** `System clock synchronized: yes` — Authelia restarted successfully.
|
||||
|
||||
### Phase 3: MK-42 Worker Node Reintegration
|
||||
|
||||
**Discovery:** MK-42 (192.168.0.196) was online and had Docker installed but Swarm was inactive.
|
||||
|
||||
**Action:**
|
||||
```bash
|
||||
# On MK-42
|
||||
ssh jarvis@192.168.0.196
|
||||
docker swarm leave --force # Not in swarm, just confirming
|
||||
docker swarm join --token SWMTKN-1-5po7nh34gige4jj7psqyc2pe8puf66yvpzvq3o4suy2kzqa5om-7tobwwhz2tvmo7wmg5yk7m5jd 192.168.7.7:2377
|
||||
```
|
||||
|
||||
**Result:** MK-42 joined Swarm as a worker node. Now available for workload scheduling.
|
||||
|
||||
---
|
||||
|
||||
## Final Service Status
|
||||
|
||||
| Stack | Service | Status | Replicas | Notes |
|
||||
|-------|---------|--------|----------|-------|
|
||||
| traefik | traefik | ✅ Running | 1/1 | Global mode on manager, healthy |
|
||||
| portainer | portainer | ✅ Running | 1/1 | Replicated on manager |
|
||||
| technitium | technitium | ✅ Running | 1/1 | Ports 53/5380 exposed (host mode) |
|
||||
| homepage | homepage | ✅ Running | 1/1 | Replicated on manager |
|
||||
| n8n | postgres | ✅ Running | 1/1 | Healthy |
|
||||
| n8n | pgadmin | ✅ Running | 1/1 | — |
|
||||
| n8n | n8n | ✅ Running | 1/1 | Healthy |
|
||||
| beszel | beszel-hub | ✅ Running | 1/1 | Port 8090 exposed |
|
||||
| dozzle | dozzle | ✅ Running | 1/1 | Port 8081 exposed |
|
||||
| authelia | authelia | ✅ Running | 1/1 | After NTP fix |
|
||||
| prometheus | prometheus | ✅ Running | 1/1 | — |
|
||||
| node-exporter | node-exporter | ✅ Running | 1/1 | Global mode |
|
||||
| pegaprox | pegaprox | ✅ Running | 1/1 | Already running (unchanged) |
|
||||
|
||||
**Swarm nodes:**
|
||||
| ID | Hostname | Status | Availability | Manager |
|
||||
|----|----------|--------|--------------|---------|
|
||||
| x6xr2s6... | mark-vii.ai.home | Ready | Active | Leader |
|
||||
| x46ce7y... | mk-42 | Ready | Active | — (Worker) |
|
||||
|
||||
---
|
||||
|
||||
## Health Checks Verified
|
||||
|
||||
```bash
|
||||
❯ curl -s http://localhost:8080/ping → OK (Traefik)
|
||||
❯ curl -s http://localhost:9000/api/status → {"Version":"2.39.2",...} (Portainer)
|
||||
❯ curl -s http://localhost:5380 → Technitium HTML (DNS UI)
|
||||
❯ curl -s http://localhost:8090 → Beszel HTML
|
||||
❯ curl -s http://localhost:5678/healthz → OK (n8n)
|
||||
❯ curl -s http://localhost:8081/api/health → OK (Dozzle)
|
||||
```
|
||||
|
||||
All services responding on expected ports.
|
||||
|
||||
---
|
||||
|
||||
## File Changes on MK7
|
||||
|
||||
| File | Action | Reason |
|
||||
|------|--------|--------|
|
||||
| `/etc/systemd/timesyncd.conf.d/*.conf` | Deleted | Stale cloud-init NTP config pointing to wrong IP |
|
||||
| `/etc/systemd/timesyncd.conf` | Reset to `[Time]` only | Restore default NTP behavior |
|
||||
| `/opt/iron-legion/docker-swarm/deploy.sh` | Modified | Removed reference to missing `adguard` stack (not deployed) |
|
||||
|
||||
---
|
||||
|
||||
## Notes for Future Operations
|
||||
|
||||
1. **NTP drift on relocated nodes:** Always verify `timedatectl status` after moving hardware. Cloud-init may inject stale NTP configs.
|
||||
2. **AdGuard removed:** The `deploy.sh` previously referenced an `adguard` stack that no longer exists (AdGuard was removed in favor of Technitium's built-in blocking). The script was updated to skip it.
|
||||
3. **MK-42 as Swarm worker:** MK-42 is now available for container scheduling but has not been labeled for specific workloads. If you want PVE services on it, consider deploying a VM first or using it as a bare Swarm worker.
|
||||
4. **No Tailscale on MK-42:** As requested, MK-42 joins via LAN IP only. No Tailscale client installed.
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2026-06-01 by F.R.I.D.A.Y.*
|
||||
Reference in New Issue
Block a user