Files
documentation/PRD Drafts/ARCHIVED-terraform-proxmox-lxc-automation.md
F.R.I.D.A.Y. c1bb49d51a Terraform LXC PRD: promote validated draft to PRDs, archive stale F.R.I.D.A.Y. draft
- terraform-lxc-deployment.md → PRDs/ (validated, tested, canonical)
- terraform-proxmox-lxc-automation.md → ARCHIVED- (superseded by live POC)
- Matches Phase 1 POC results from terraform-pve repo
2026-06-04 22:58:19 -04:00

636 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PRD: Terraform LXC Automation for Proxmox VE 9.2
**Status:** Draft — Pending Commander Bobby Review
**Author:** F.R.I.D.A.Y.
**Date:** 2026-06-01
**Provider:** `bpg/proxmox` (actively maintained, 11M+ downloads)
**Target:** Proxmox VE 9.2 / Debian Trixie
---
## 1. Purpose & Scope
This PRD defines the architecture, configuration patterns, and operational workflow for automating LXC container lifecycle management on Proxmox VE 9.2 clusters using Terraform and the actively maintained `bpg/proxmox` provider.
**In scope:**
- Terraform provider configuration and authentication
- LXC resource definitions (`proxmox_virtual_environment_container`)
- Cloud-init / template-based provisioning
- Network configuration (static IP, DHCP, bridge)
- Storage allocation (rootfs on any PVE backend)
- State management and CI/CD integration patterns
**Out of scope:**
- VM (QEMU/KVM) provisioning
- PVE cluster topology changes
- Backup/restore automation (separate PRD)
---
## 2. Success Criteria
| # | Criterion | How Verified |
|---|-----------|-------------|
| 1 | A single `terraform apply` creates a working LXC with SSH access | `ssh root@<lxc-ip>` succeeds |
| 2 | LXCs are provisioned from official cloud-image templates | Template downloaded via `proxmox_virtual_environment_download_file` |
| 3 | Network is configurable per-LXC (DHCP or static CIDR) | `ip addr` inside container matches TF config |
| 4 | Rootfs lives on user-selected storage (not hardcoded to `local-lvm`) | `pvesm status` shows volume on target datastore |
| 5 | State is stored remotely (S3-compatible or Terraform Cloud) | `terraform state list` works from any machine |
| 6 | Destroy and recreate is idempotent | `terraform destroy && terraform apply` yields identical result |
---
## 3. Provider Selection
### Why `bpg/proxmox` (not `telmate/proxmox`)
| Provider | Maintenance | Downloads | LXC Support | Notes |
|----------|-------------|-----------|-------------|-------|
| `bpg/proxmox` | ✅ Active (v0.108.0, June 2026) | 11.8M+ | Full | Community-tier, comprehensive docs, supports PVE 9.x |
| `telmate/proxmox` | ❌ Stale (last release ~2023) | Legacy | Partial | Deprecated; lacks PVE 9.x features |
**Decision:** Use `bpg/proxmox` exclusively. The `telmate` provider is unmaintained and incompatible with PVE 9.2 API changes.
**Provider block (minimum):**
```hcl
terraform {
required_providers {
proxmox = {
source = "bpg/proxmox"
version = "~> 0.108"
}
}
}
provider "proxmox" {
endpoint = "https://192.168.7.33:8006/"
username = "root@pam"
password = var.proxmox_password # or PROXMOX_VE_PASSWORD env var
insecure = true # self-signed TLS
}
```
---
## 4. Authentication Matrix
| Method | Use Case | Config | Security |
|--------|----------|--------|----------|
| **API Token** | Production, CI/CD | `api_token = "root@pam!mytoken=abc123…"` | Highest — revocable, fine-grained |
| **Username/Password** | Development, one-offs | `username = "root@pam"`, `password = "…"` | Medium — password in env |
| **Auth Ticket** | TOTP-enabled accounts | Pre-authenticate, pass ticket | High — short-lived |
**Recommendation for Iron Legion:**
- **Development:** Use `PROXMOX_VE_PASSWORD` environment variable
- **CI/CD (future):** Create a PVE API token with `PVEFarmAdmin` or custom role, store in CI secrets
---
## 5. Sample Project Structure
```
terraform-proxmox-lxc/
├── README.md
├── main.tf # Provider + backend config
├── variables.tf # Input variables
├── terraform.tfvars.example # Sample values (gitignored)
├── outputs.tf # Useful outputs (IPs, IDs)
├── versions.tf # Required providers + TF version
├── modules/
│ └── lxc/
│ ├── main.tf # proxmox_virtual_environment_container resource
│ ├── variables.tf # Module inputs
│ └── outputs.tf # Module outputs
├── environments/
│ ├── dev/
│ │ ├── main.tf # Calls modules with dev vars
│ │ └── terraform.tfvars
│ └── prod/
│ ├── main.tf
│ └── terraform.tfvars
└── templates/
└── ubuntu-25.04-cloudimg.yaml # Cloud-init user-data (optional)
```
### Key Files
#### `versions.tf`
```hcl
terraform {
required_version = ">= 1.5.0"
required_providers {
proxmox = {
source = "bpg/proxmox"
version = "~> 0.108"
}
random = {
source = "hashicorp/random"
version = "~> 3.6"
}
tls = {
source = "hashicorp/tls"
version = "~> 4.0"
}
}
# Remote state — S3-compatible (Minio, Garage, AWS S3)
backend "s3" {
bucket = "iron-legion-terraform"
key = "proxmox-lxc/terraform.tfstate"
region = "us-east-1"
endpoint = "https://s3.nb.bobbysh.me"
use_path_style = true
# Skip AWS-specific validations for self-hosted S3
skip_credentials_validation = true
skip_metadata_api_check = true
skip_region_validation = true
skip_requesting_account_id = true
}
}
```
#### `variables.tf`
```hcl
variable "proxmox_endpoint" {
description = "PVE API URL"
type = string
default = "https://192.168.7.33:8006/"
}
variable "proxmox_node" {
description = "Target PVE node name"
type = string
default = "mk33"
}
variable "ssh_public_key" {
description = "SSH public key for root access"
type = string
}
variable "lxc_configs" {
description = "Map of LXC configurations"
type = map(object({
vm_id = number
hostname = string
cores = optional(number, 2)
memory = optional(number, 2048)
disk_size = optional(number, 8)
datastore_id = optional(string, "local-lvm")
ip_address = optional(string, "dhcp")
gateway = optional(string, null)
template_url = optional(string, "https://mirrors.servercentral.com/ubuntu-cloud-images/releases/25.04/release/ubuntu-25.04-server-cloudimg-amd64-root.tar.xz")
features = optional(object({
nesting = optional(bool, true)
fuse = optional(bool, false)
keyctl = optional(bool, false)
}), {})
}))
}
```
#### `modules/lxc/main.tf`
```hcl
resource "proxmox_virtual_environment_download_file" "lxc_template" {
for_each = var.lxc_configs
content_type = "vztmpl"
datastore_id = "local"
node_name = var.proxmox_node
url = each.value.template_url
file_name = "${each.key}-template.tar.xz"
overwrite = false
}
resource "proxmox_virtual_environment_container" "lxc" {
for_each = var.lxc_configs
node_name = var.proxmox_node
vm_id = each.value.vm_id
description = "Managed by Terraform — ${each.key}"
unprivileged = true
features {
nesting = each.value.features.nesting
fuse = each.value.features.fuse
keyctl = each.value.features.keyctl
}
cpu {
cores = each.value.cores
units = 1024
}
memory {
dedicated = each.value.memory
swap = 0
}
disk {
datastore_id = each.value.datastore_id
size = each.value.disk_size
}
initialization {
hostname = each.value.hostname
ip_config {
ipv4 {
address = each.value.ip_address
gateway = each.value.gateway
}
}
user_account {
keys = [var.ssh_public_key]
password = random_password.lxc_root[each.key].result
}
}
network_interface {
name = "veth0"
bridge = "vmbr0"
}
operating_system {
template_file_id = proxmox_virtual_environment_download_file.lxc_template[each.key].id
type = "ubuntu"
}
startup {
order = "3"
up_delay = "60"
down_delay = "60"
}
depends_on = [proxmox_virtual_environment_download_file.lxc_template]
}
resource "random_password" "lxc_root" {
for_each = var.lxc_configs
length = 16
special = true
override_special = "_%@"
}
```
#### `modules/lxc/variables.tf`
```hcl
variable "proxmox_node" {
type = string
}
variable "ssh_public_key" {
type = string
}
variable "lxc_configs" {
type = map(object({
vm_id = number
hostname = string
cores = optional(number, 2)
memory = optional(number, 2048)
disk_size = optional(number, 8)
datastore_id = optional(string, "local-lvm")
ip_address = optional(string, "dhcp")
gateway = optional(string, null)
template_url = optional(string)
features = optional(object({
nesting = optional(bool, true)
fuse = optional(bool, false)
keyctl = optional(bool, false)
}), {})
}))
}
```
#### `modules/lxc/outputs.tf`
```hcl
output "lxc_ids" {
description = "Map of LXC names to VM IDs"
value = { for k, v in proxmox_virtual_environment_container.lxc : k => v.vm_id }
}
output "lxc_ips" {
description = "Map of LXC names to IPv4 addresses"
value = { for k, v in proxmox_virtual_environment_container.lxc : k => v.ipv4 }
}
output "lxc_passwords" {
description = "Map of LXC names to root passwords (sensitive)"
value = { for k, v in random_password.lxc_root : k => v.result }
sensitive = true
}
```
#### `environments/dev/main.tf`
```hcl
module "dev_lxcs" {
source = "../../modules/lxc"
proxxmox_node = "mk33"
ssh_public_key = file("~/.ssh/id_ed25519.pub")
lxc_configs = {
"dev-nextcloud" = {
vm_id = 2100
hostname = "dev-nextcloud"
cores = 4
memory = 4096
disk_size = 16
datastore_id = "local-zfs"
ip_address = "192.168.7.100/24"
gateway = "192.168.7.1"
}
"dev-vaultwarden" = {
vm_id = 2101
hostname = "dev-vaultwarden"
cores = 2
memory = 2048
disk_size = 8
datastore_id = "local-zfs"
ip_address = "192.168.7.101/24"
gateway = "192.168.7.1"
}
}
}
```
---
## 6. Resource Reference — `proxmox_virtual_environment_container`
### Critical Arguments
| Block | Key | Required | Default | Description |
|-------|-----|----------|---------|-------------|
| — | `node_name` | ✅ | — | PVE node to create on |
| — | `vm_id` | ✅ | — | Unique numeric ID (100999999999) |
| — | `unprivileged` | ❌ | `true` | Run as unprivileged container |
| `features` | `nesting` | ❌ | `false` | Enable nested containers (needed for Docker-in-LXC) |
| `features` | `fuse` | ❌ | `false` | Enable FUSE mounts |
| `cpu` | `cores` | ❌ | `1` | vCPU cores |
| `memory` | `dedicated` | ❌ | `512` | RAM in MB |
| `disk` | `datastore_id` | ❌ | `local` | Storage pool for rootfs |
| `disk` | `size` | ❌ | `4` | Rootfs size in GB |
| `initialization` | `hostname` | ✅ | — | DNS-compatible hostname |
| `initialization.ip_config.ipv4` | `address` | ✅ | — | CIDR or `dhcp` |
| `initialization.ip_config.ipv4` | `gateway` | ❌ | — | Required for static IP |
| `initialization.user_account` | `keys` | ❌ | — | SSH authorized_keys |
| `network_interface` | `name` | ✅ | — | `veth0` |
| `network_interface` | `bridge` | ❌ | `vmbr0` | Bridge to attach |
| `operating_system` | `template_file_id` | ✅ | — | Downloaded template or `local:vztmpl/…` |
| `operating_system` | `type` | ❌ | `unmanaged` | `ubuntu`, `debian`, `alpine`, etc. |
### Important Notes
- **Template download** uses `proxmox_virtual_environment_download_file` — caches template per-node, avoids re-download
- **Cloud-init** is embedded in the `initialization` block — no separate cloud-init drive needed for LXC
- **Nesting = true** is required for any LXC running Docker or systemd-nspawn
- **Datastore** is backend-agnostic: `local-lvm`, `local-zfs`, `tank-zfs`, `ceph-rbd`, NFS, etc. all work
---
## 7. Data Sources
Use data sources to query existing infrastructure without managing it:
```hcl
data "proxmox_virtual_environment_datastores" "available" {
node_name = var.proxmox_node
}
data "proxmox_virtual_environment_nodes" "cluster" {}
data "proxmox_virtual_environment_container" "existing" {
node_name = var.proxmox_node # or specify target node explicitly
vm_id = 2001
}
```
**Common use cases:**
- Validate a datastore exists before creating a disk
- Read an existing LXCs IP to populate a DNS record (Technitium)
- List nodes for multi-node placement logic
---
## 8. State Management
### Recommended: S3-Compatible Backend
Iron Legion already runs self-hosted services. A Garage or Minio instance on a fleet storage node (e.g., Neo) can serve as the Terraform state backend:
```hcl
terraform {
backend "s3" {
bucket = "iron-legion-terraform"
key = "proxmox-lxc/dev.tfstate"
region = "us-east-1"
endpoint = "https://s3.nb.bobbysh.me"
use_path_style = true
skip_credentials_validation = true
skip_metadata_api_check = true
skip_region_validation = true
skip_requesting_account_id = true
}
}
```
### State Locking (Critical for Team Use)
Add a DynamoDB-compatible table or use a native locking mechanism. If S3 backend does not support locking, wrap `terraform apply` in a CI pipeline that serializes runs.
---
## Optional: Atlantis Web UI for Terraform PR Automation
### What Atlantis Is
Atlantis is a self-hosted web application that listens for webhook events from Git repositories and runs `terraform plan` / `terraform apply` automatically inside PR/MR workflows. It posts plan output back to the PR as comments, enforces approval gates, and locks workspaces to prevent concurrent applies.
### Can Atlantis Manage LXC Resources via `bpg/proxmox`?
**Yes.** Atlantis is a Terraform orchestration layer, not a provider. It supports any Terraform provider including `bpg/proxmox`. The workflow is:
1. Developer opens a PR adding/modifying `.tf` files defining LXC containers
2. Atlantis receives the webhook and runs `terraform plan` in a isolated directory
3. Plan output posted as a PR comment — team reviews before approval
4. After approval (or `atlantis apply` comment), Atlantis runs `terraform apply`
### Atlantis Docker Compose (Self-Hosted)
```yaml
services:
atlantis:
image: ghcr.io/runatlantis/atlantis:latest
ports:
- "4141:4141"
volumes:
- ${HOME}/.ssh:/home/atlantis/.ssh:ro # Git SSH key
- /var/run/docker.sock:/var/run/docker.sock:ro # if using Docker TF provider
- atlantis-data:/home/atlantis/.atlantis
environment:
ATLANTIS_GH_USER: "iron-legion-bot" # or ATLANTIS_GITLAB_USER / ATLANTIS_GITEA_USER
ATLANTIS_GH_TOKEN: "${ATLANTIS_GH_TOKEN}" # personal access token
ATLANTIS_REPO_ALLOWLIST: "github.com/Iron-Legion/*"
ATLANTIS_GH_WEBHOOK_SECRET: "${WEBHOOK_SECRET}"
# For Gitea:
# ATLANTIS_GITEA_USER: "iron-legion-bot"
# ATLANTIS_GITEA_TOKEN: "${GITEA_TOKEN}"
# ATLANTIS_GITEA_WEBHOOK_SECRET: "${WEBHOOK_SECRET}"
command: server
restart: unless-stopped
# Optional: Redis for distributed locking in multi-replica setups
# redis:
# image: redis:8-alpine
# volumes:
# - redis-data:/data
# restart: always
volumes:
atlantis-data:
driver: local
```
### Key Features
- **Plan Comments:** Every PR gets an auto-generated `terraform plan` comment
- **Apply Locking:** One apply at a time per workspace; concurrent PRs queue
- **Policy Checks:** Integrate OPA (Open Policy Agent) or custom scripts to block non-compliant changes
- **Custom Workflows:** Define per-repo or per-directory workflows (e.g., plan-only for dev, auto-apply for staging)
- **Self-Hosted SCM:** Native webhook support for GitHub, GitLab, Bitbucket, **and Gitea**
### Resource Footprint
- Atlantis container: ~100200 MB RAM, minimal CPU
- Optional Redis: ~20 MB RAM
- Total: fits comfortably on any Iron Legion node (MK7, MK3342, Neo)
### Gitea Integration Notes
- Atlantis supports Gitea via the `--gitea-user`, `--gitea-token`, `--gitea-webhook-secret` flags
- Must expose Atlantis endpoint to Gitea (Tailscale funnel, reverse proxy, or LAN if Gitea is in-network)
- Webhook URL: `http://atlantis-host:4141/events`
---
## 9. Operational Workflow
### Day 0 — Bootstrap
```bash
# 1. Clone repo
git clone ssh://git@100.99.123.16:2222/Iron-Legion/terraform-proxmox-lxc.git
cd terraform-proxmox-lxc/environments/dev
# 2. Set credentials
export PROXMOX_VE_PASSWORD="your-pve-password"
# OR for API token:
export PROXMOX_VE_API_TOKEN="root@pam!mytoken=abc123"
# 3. Initialize
terraform init
# 4. Plan
terraform plan -out=tfplan
# 5. Apply
terraform apply tfplan
```
### Day N — Add a Container
1. Add entry to `lxc_configs` map in `environments/dev/main.tf`
2. `terraform plan` — review VM ID collision, IP conflict, storage capacity
3. `terraform apply`
4. Verify: `ssh root@<new-ip>`
### Day N — Destroy a Container
1. Remove entry from `lxc_configs` map
2. `terraform apply` — resource destroyed
3. Or: `terraform destroy -target='module.dev_lxcs.proxmox_virtual_environment_container.lxc["dev-nextcloud"]'`
---
## 10. Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| VM ID collision | Medium | High | Maintain a fleet-wide VM ID registry; use `proxmox_virtual_environment_vms` data source to check |
| IP overlap with DHCP pool | Medium | High | Reserve static IPs in Technitium DNS; use `dns` data source to verify |
| Template download fails (slow mirror) | Low | Medium | Pre-seed templates on PVE nodes; use `pvesm` to verify before `apply` |
| State file corruption | Low | Critical | S3 versioning + periodic `terraform state pull` backups |
| Privilege escalation via privileged LXC | Low | High | Default `unprivileged = true`; explicit override required |
| Provider breaking change | Medium | Medium | Pin provider version `~> 0.108`; test upgrades in dev environment first |
---
## 11. Open Questions
1. **Do we pre-create cloud-image templates on each PVE node, or let Terraform download per-node?**
- Per-node: slower first deploy, but self-contained
- Pre-seeded: faster, requires manual `pvesm` or Ansible step
2. **Should LXCs register themselves in Technitium DNS via Terraform, or rely on DHCP + DNS integration?**
- Terraform can call a `dns_a_record` module (if Technitium provider exists)
- Or: use PVE's built-in DHCP + DNSMASQ if configured
3. **CI/CD pipeline: GitHub Actions runner, or local Gitea Actions on the fleet SCM host?**
- Gitea Actions keeps secrets in-network
- GitHub Actions requires Tailscale funnel or external exposure
4. **Do we want a dedicated LXC "Terraform runner" inside the cluster, or run from Artemis/operator workstation?**
- In-cluster runner: always has LAN access to PVE API
- External: requires Tailscale or VPN for API reachability
---
## 12. Appendix
### A. Provider Documentation Links
- **Registry:** https://registry.terraform.io/providers/bpg/proxmox/latest
- **GitHub:** https://github.com/bpg/terraform-provider-proxmox
- **LXC Resource Docs:** https://registry.terraform.io/providers/bpg/proxmox/latest/docs/resources/virtual_environment_container
- **Download File Resource:** https://registry.terraform.io/providers/bpg/proxmox/latest/docs/resources/virtual_environment_download_file
### B. Useful PVE CLI Commands (for verification)
```bash
# List containers on a node
pct list
# List templates
pvesm list local --content vztmpl
# Check datastore usage
pvesm status
# Enter a container
pct enter <vm_id>
```
### C. Terraform Commands Reference
```bash
terraform init # Download providers, configure backend
terraform validate # Syntax check
terraform plan # Preview changes
terraform apply # Execute changes
terraform destroy # Tear down everything
terraform state list # Show managed resources
terraform state show <addr> # Show one resource's attributes
terraform output # Display output values
terraform fmt -recursive # Format all .tf files
```
---
*End of PRD. Ready for Commander Bobby review and approval.*