Files

F.R.I.D.A.Y. 3dd46ca963 PVE cluster formation: MK33/MK34/MK39 as pve-swarm. NFS active. HA groups configured. N150 corrected.

2026-06-04 20:59:11 -04:00

4.8 KiB

Raw Blame History

PVE 3-Node HA Cluster for Iron Legion

Status: Draft | Author: Artemis | Date: 2026-06-04

1. Objective

Configure MK33, MK34, and MK39 as a Proxmox VE 3-node cluster with shared NFS storage from TrueNAS. Enable manual live migration of VMs/LXCs between nodes, and optionally automatic HA failover for critical workloads.

2. Current State

Node	CPU	RAM	Storage	Role
MK33 (Silver Centurion)	Intel N150 4c/4t	16GB	Local SSD	PVE HA
MK34 (Southpaw)	Intel N150 4c/4t	16GB	Local SSD	PVE HA
MK39 (Gemini)	Intel N150 4c/4t	16GB	Local SSD	PVE HA (spare)
TrueNAS SCALE	4c	11GB	HDD pool	NFS server

All nodes on 192.168.0.0/18. TrueNAS at 192.168.16.254.

3. Architecture

3.1 Cluster Model: Proxmox 3-Node Cluster (No Ceph)

MK33 (192.168.7.33) ──┐
                       ├─ Corosync Ring ── Shared NFS (TrueNAS)
MK34 (192.168.7.34) ──┤
                       │
MK39 (192.168.7.39) ──┘

Quorum: 3-node cluster = 2 votes needed for quorum. If one node dies, remaining 2 form quorum.
Shared storage: TrueNAS NFSv4.2 export /mnt/Ice/Backup
HA manager: Proxmox HA services (pve-ha-crm, pve-ha-lrm) for automatic restart

3.2 Storage Flow

Build on local disk → Test workload → Shutdown → Move disk to NFS → Restart on NFS
                                          ↓
If node fails: HA manager detects → Restarts VM/LXC on surviving node (same NFS disk)

3.3 Workload Planning

Type	Count per node	Resources each
VM (general)	1	4 vCPU, 4096 MB RAM
LXC (lightweight)	5–10	1 vCPU, 512–1024 MB RAM

Total per node estimated: 9–14 vCPUs (but N100 is 4c/4t — LXCs share cores opportunistically via cgroups) Total RAM per node: VM 4GB + 5×1GB LXCs = ~9GB allocated, 7GB headroom

4. Pros vs Cons

4.1 3-Node Cluster (Recommended)

Pros:

Unified web UI for all 3 nodes from any one node
Live migration of VMs/LXCs between nodes (zero downtime)
Automatic HA failover for critical VMs/LXCs
Quorum maintained with 2 of 3 nodes online
Shared NFS storage means VMs are portable across nodes

Cons:

Corosync ring traffic adds minor network overhead
If 2 nodes fail simultaneously, quorum lost, cluster stops
HA failover is restart (brief downtime), not live migration
N100 CPU is modest — 3 VMs + 15 LXCs across cluster is tight but workable

4.2 Standalone Nodes (Current)

Pros:

Simple, no cluster complexity
Node failure doesn't affect others
No Corosync network overhead

Cons:

No live migration — moving a VM requires export/import
No automatic failover — manual intervention if node dies
3 separate web UIs to manage

5. Implementation Plan

Phase 1: Cluster Formation

Add all 3 nodes to /etc/hosts on each node (or DNS via Technitium)
On MK33: pvecm create iron-legion
On MK34/MK39: pvecm add 192.168.7.33
Verify: pvecm status shows 3 nodes, quorum 2/3

Phase 2: NFS Storage Setup

Ensure TrueNAS exports /mnt/Ice/Backup with:
- NFSv4.2
- maproot or mapall to root (PVE nodes need root access)
- ACL allows 192.168.0.0/18
On PVE Datacenter → Storage → Add → NFS:
- ID: truenas-backup
- Server: 192.168.16.254
- Export: /mnt/Ice/Backup
- Content: images,rootdir
Verify storage shows on all 3 nodes

Phase 3: HA Configuration

Proxmox HA → Add groups:
- critical: nodes mk33,mk34,mk39 (any node)
- local-only: single-node constraint for local-disk VMs
For each VM/LXC on NFS storage:
- Datacenter → HA → Add → Select VM → Group critical → Start on any
Start fencing daemon if IPMI/ watchdog available (optional for N100)

Phase 4: Workload Migration Testing

Build a test LXC on local storage
Migrate disk to NFS: Move disk → target truenas-backup
Verify LXC starts from NFS
Test live migration: right-click → Migrate → select target node
Test HA failover: power off source node, verify restart on surviving node

6. Open Questions

Do we need HA fencing? (IPMI not available on N100 — watchdog only)
Should we reserve one node as "management" and only run LXCs on two?
What's the Tailscale story — do we bind Corosync to LAN only or also Tailscale?

7. Decision Points

Decision	Option A	Option B
Cluster type	3-node with quorum (recommended)	2-node + witness (not recommended)
HA level	Manual migration only	Full HA with auto-restart
Storage	NFS only (current)	Add local Ceph later
Resource reserve	1 node mostly idle	Distribute evenly

Awaiting Commander Bobby review and approval.

4.8 KiB Raw Blame History Unescape Escape