I was paying roughly $150 a month across OpenAI and Anthropic API subscriptions when it occurred to me that a perfectly capable NVIDIA GPU was sitting in my homelab, criminally underutilised. It had been doing basically nothing for months. Not even mining crypto. Just… existing. Consuming standby power like a very expensive nightlight.

Before I continue, I need mention - frontier models like GPT-4o and Claude Opus 4.6 are brilliant. Great for complex reasoning, architecture reviews, and the kind of nuanced problem solving that makes you question whether you’re even needed anymore, cloud APIs are impossible to beat there. But for the bread and butter stuff like quick code questions, documentation drafts, summarising meeting notes, brainstorming over a coffee — you’re paying per token for work that a quantised open source model handles just fine.

So I did what any reasonable DevOps engineer would do: I automated the entire thing with Terraform, Packer, and Ansible. What came out the other side is a repeatable, GPU-accelerated AI inference platform running on Proxmox that I can tear down and rebuild in under 20 minutes.

Why Self-Host AI Inference?

Before we get into the weeds, let’s talk about why you’d bother. Cloud AI APIs are just plain convenient. You sign up, you get an API key, you start prompting. No hardware, no drivers, no CUDA headaches. So why complicate things?

Privacy and data sovereignty. This is the big one. I’ve worked places that flat out refused to send proprietary data to external APIs. When you self host, your data never leaves your network. Full stop. No terms of service to read, no “we may use your data to improve our models” clauses to worry about, no third party jurisdiction questions. Your data, your network, your rules. For anyone handling sensitive client information, IP, or regulated data, this isn’t a nice to have — it’s a hard requirement. Even for personal use, there’s something deeply satisfying about knowing your conversations aren’t being watched.

Cost at volume. Let’s do some quick maths. GPT-4o output tokens cost roughly US$10 per million. Claude Opus 4.6 sits at US$75 per million output tokens. If you’re a heavy user pushing 2-3 million tokens a day across development, consulting work, and automation pipelines, you’re looking at serious monthly bills depending on the model mix. A used RTX 3090 costs around $900-1,000 AUD. The maths can start working in your favour pretty quickly — but only at volume. More on that in the cost section.

Latency. For development workflows — code completion, inline suggestions, quick questions while debugging — network round trips to an API add noticeable delay. Local inference on a decent GPU would give you sub-second responses for smaller models. It feels instant in a way that cloud APIs simply don’t.

Control. No rate limits. No surprise model deprecations. No “we’ve updated our pricing” emails. You pick the model, you run it when you want, and it stays exactly where you put it.

⬡

Being upfront: self hosted AI is a complement to cloud AI, not a replacement. Open source models have come a remarkable distance, but for complex multi step reasoning and the latest capability, cloud models like Claude Opus 4.6 are still in a different league. The value of self hosting is privacy, availability, and cost savings on high volume routine tasks.

The Architecture

Here’s the stack at a high level:

Proxmox VE — The hypervisor running on bare metal, managing VMs and GPU passthrough
NVIDIA GPU — Passed through from the host to a dedicated inference VM via VFIO
Ollama — The inference engine, running as a native systemd service with direct GPU access
Open WebUI — A slick web interface for chatting with models, managing conversations, and uploading documents

The IaC layer sits on top of this:

Packer builds an immutable VM template with Ubuntu 24.04, NVIDIA drivers, and Docker pre-installed
Terraform provisions the VM from that template, configuring CPU, memory, network, and GPU passthrough
Ansible handles the application layer — configuring Ollama’s systemd service, deploying Open WebUI, and pulling your preferred models

Why Proxmox over bare metal? Isolation. I run other services on this box — monitoring, DNS, a Kubernetes cluster for testing. GPU passthrough lets me dedicate the GPU to the inference VM without affecting anything else. Plus, VM snapshots mean I can experiment with driver versions or model configurations and roll back in seconds if something goes sideways.

The Tech Stack

Proxmox

Terraform

Packer

Ansible

Ollama

NVIDIA

GPU Passthrough: The Foundation

GPU passthrough is what makes this whole thing work. You basically tell Proxmox to hand the physical GPU directly to a VM, bypassing the hypervisor entirely. The VM sees the GPU as if it were running on bare metal, and you get 95-99% of native performance.

The catch? It requires some host level configuration that you’ll want to get right the first time. I can’t pretend I nailed it on the first attempt — there’s a special kind of joy in rebooting a headless server for the fourth time because you forgot to blacklist a kernel module.

IOMMU and BIOS Setup

First, your hardware needs to support IOMMU (Input/Output Memory Management Unit). On Intel, this is called VT-d. On AMD, it’s AMD-Vi. You’ll need to enable it in your BIOS — the exact menu location varies by motherboard manufacturer, because apparently standardising BIOS menus would be too convenient!

While you’re in the BIOS, enable Above 4G Decoding and Resizable BAR if available. These help VFIO handle interrupt mapping and memory access for the GPU.

Host Kernel Configuration

On the Proxmox host, you need to do three things: enable IOMMU in the bootloader, blacklist the NVIDIA drivers so the host doesn’t claim the GPU, and load the VFIO modules.

# Enable IOMMU in GRUB (Intel CPU)
sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="quiet"/GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"/' /etc/default/grub

# For AMD CPUs, use amd_iommu=on instead
# GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"

# Update GRUB
update-grub

# Blacklist NVIDIA drivers on the host
cat <<EOF >> /etc/modprobe.d/blacklist-nvidia.conf
blacklist nouveau
blacklist nvidia
blacklist nvidia_drm
blacklist nvidia_modeset
EOF

# Load VFIO modules
cat <<EOF >> /etc/modules
vfio
vfio_iommu_type1
vfio_pci
EOF

# Bind the GPU to vfio-pci (replace with your GPU's vendor:device IDs)
# Find your IDs with: lspci -nn | grep NVIDIA
echo "options vfio-pci ids=10ab:2345,67cd:89ef" > /etc/modprobe.d/vfio.conf

# Rebuild initramfs and reboot
update-initramfs -u -k all
reboot

After rebooting, verify IOMMU is working:

# Should show IOMMU groups with your GPU listed
find /sys/kernel/iommu_groups/ -type l | sort -V

# Verify VFIO has claimed the GPU
lspci -nnk -s 01:00 | grep "Kernel driver in use"
# Should show: Kernel driver in use: vfio-pci

⚠

Check your IOMMU groups before proceeding. The GPU and its audio device need to be in their own group for clean passthrough. If they share a group with other devices (common on consumer motherboards), you may need ACS override patches or a motherboard with better IOMMU isolation. The Proxmox wiki has a solid guide on PCI passthrough that covers edge cases.

Building the VM Template with Packer

With GPU passthrough configured on the host, we need a VM template. I use Packer to build an immutable base image with everything pre-installed: Ubuntu 24.04 LTS, NVIDIA drivers, Docker, and the NVIDIA Container Toolkit.

Why Packer? Because I want the template to be reproducible. If I need to rebuild it in six months — maybe for a driver update or an OS upgrade — I run one command and get an identical result. No manual clicky clicky Proxmox’s web UI.

The proxmox-iso builder creates a VM from an ISO, runs your provisioning scripts, then converts it to a Proxmox template with cloud-init support.

# packer/gpu-template.pkr.hcl

packer {
  required_plugins {
    proxmox = {
      version = ">= 1.2.2"
      source  = "github.com/hashicorp/proxmox"
    }
  }
}

source "proxmox-iso" "ubuntu-gpu" {
  proxmox_url              = var.proxmox_url
  username                 = var.proxmox_user
  token                    = var.proxmox_token
  node                     = var.proxmox_node
  insecure_skip_tls_verify = true

  iso_file                 = var.iso_file
  iso_checksum             = "none"
  unmount_iso              = true

  vm_name                  = var.template_name
  template_description     = "Ubuntu 24.04 LTS with NVIDIA drivers, Docker, and NVIDIA Container Toolkit"

  os           = "l26"
  machine      = "q35"
  bios         = "ovmf"
  cpu_type     = "host"
  cores        = var.cores
  memory       = var.memory
  qemu_agent   = true

  scsi_controller = "virtio-scsi-single"

  disks {
    disk_size    = var.disk_size
    storage_pool = var.storage_pool
    type         = "scsi"
  }

  network_adapters {
    bridge = var.network_bridge
    model  = "virtio"
  }

  cloud_init              = true
  cloud_init_storage_pool = var.storage_pool

  boot_command = [
    "<wait3>e<wait1>",
    "<down><down><down><end>",
    " autoinstall ds=nocloud-net",
    "<F10>"
  ]

  ssh_username = "packer"
  ssh_password = var.ssh_password
  ssh_timeout  = "30m"
}

build {
  sources = ["source.proxmox-iso.ubuntu-gpu"]

  provisioner "shell" {
    scripts = [
      "scripts/01-base-packages.sh",
      "scripts/02-nvidia-drivers.sh",
      "scripts/03-docker.sh",
      "scripts/04-nvidia-container-toolkit.sh",
      "scripts/05-cleanup.sh",
    ]
    execute_command = "chmod +x {{ .Path }}; sudo {{ .Path }}"
  }
}

The provisioning scripts are straightforward but worth a note on driver versions: RTX 50 series GPUs (5080, 5090) require the 570+ driver branch (nvidia-driver-570-server), while RTX 40 series and earlier work with the 550 branch (nvidia-driver-550-server). The driver script uses a variable so you can set the right branch for your card — just change one line rather than rebuilding the whole template. The container toolkit script adds NVIDIA’s apt repository and installs nvidia-container-toolkit so Docker containers can access the GPU. Nothing exotic — just automating what you’d otherwise do manually.

Provisioning with Terraform

With the Packer template ready, Terraform clones it into a running VM with the GPU attached.

I’m using the BPG Proxmox provider (currently at v0.97.0). If you’ve been using the Telmate provider, it’s worth switching — BPG is far more actively maintained, has better documentation, and supports features like hardware mappings and SDN that Telmate never got around to.

# terraform/main.tf

terraform {
  required_providers {
    proxmox = {
      source  = "bpg/proxmox"
      version = ">= 0.97.0"
    }
  }
}

provider "proxmox" {
  endpoint  = var.proxmox_url
  api_token = "${var.proxmox_api_token_id}=${var.proxmox_api_token_secret}"
  insecure  = true

  ssh {
    agent = true
  }
}

# GPU hardware mapping — abstracts PCI device behind a named resource
resource "proxmox_virtual_environment_hardware_mapping_pci" "gpu" {
  name    = "nvidia-gpu"
  comment = "NVIDIA GPU for AI inference"

  map = [
    {
      id           = var.gpu_vendor_device_id  # e.g., "10de:2b85"
      iommu_group  = var.gpu_iommu_group       # e.g., 14
      node         = var.proxmox_node
      path         = var.gpu_pci_path           # e.g., "0000:01:00.0"
      subsystem_id = var.gpu_subsystem_id       # e.g., "1462:5127"
    },
  ]
}

resource "proxmox_virtual_environment_vm" "ollama" {
  name      = var.vm_name
  node_name = var.proxmox_node
  tags      = ["ai", "gpu", "inference"]

  clone {
    vm_id = var.template_vm_id
  }

  cpu {
    cores = var.cpu_cores
    type  = "host"  # Required for GPU passthrough
  }

  memory {
    dedicated = var.memory_mb
    floating  = 0  # Disable ballooning for consistent GPU performance
  }

  # GPU passthrough via hardware mapping (works with API tokens)
  hostpci {
    device  = "hostpci0"
    mapping = proxmox_virtual_environment_hardware_mapping_pci.gpu.name
    pcie    = true
    rombar  = true
  }

  initialization {
    ip_config {
      ipv4 {
        address = var.vm_ip
        gateway = var.gateway
      }
    }
    dns {
      servers = var.dns_servers
    }
    user_account {
      keys     = [var.ssh_public_key]
      username = var.vm_user
    }
  }

  network_device {
    bridge = var.network_bridge
    model  = "virtio"
  }
}

output "vm_ip" {
  value = var.vm_ip
}

✎

The provider uses API tokens instead of username/password — they’re more secure, don’t require 2FA bypass, and can be scoped to specific permissions. GPU passthrough uses a hardware mapping resource rather than a raw PCI ID, which is fully IaC managed and compatible with API token auth. The API token needs Mapping.Use permission on the mapping path. Find your GPU’s vendor ID, IOMMU group, and subsystem ID with lspci -nn and find /sys/kernel/iommu_groups/ -type l before writing the Terraform.

Configuration with Ansible

Terraform handles the infrastructure. Ansible handles everything that runs on top of it. This separation of concerns means I can reprovision the VM without touching the application configuration, and vice versa.

Ollama runs as a native systemd service rather than inside Docker. This gives better GPU performance (no container runtime overhead), simpler model management (no docker exec needed), and more direct control over GPU memory allocation. Open WebUI still runs in Docker, connecting to Ollama on the host via host.docker.internal.

# /etc/systemd/system/ollama.service.d/override.conf (templated by Ansible)
[Service]
Environment="OLLAMA_MODELS=/opt/models"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_GPU_MEMORY_FRACTION=0.85"
Environment="OLLAMA_KV_CACHE_TYPE=f16"
Environment="OLLAMA_KEEP_ALIVE=10m"

A few of those environment variables are worth calling out: OLLAMA_FLASH_ATTENTION enables flash attention for a significant speed boost on modern GPUs. OLLAMA_GPU_MEMORY_FRACTION caps VRAM usage at 85%, preventing OOM crashes if other processes need the GPU. OLLAMA_KEEP_ALIVE keeps the model loaded in VRAM between requests, avoiding 5-10 second cold starts.

Open WebUI connects to the host’s Ollama service rather than a Docker network peer:

# docker-compose.yml (templated by Ansible onto the VM)
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    environment:
      - OLLAMA_API_BASE_URL=http://host.docker.internal:11434
      - WEBUI_AUTH=true
    volumes:
      - open_webui_data:/app/backend/data
    ports:
      - '3000:8080'
    extra_hosts:
      - 'host.docker.internal:host-gateway'
    restart: unless-stopped

volumes:
  open_webui_data:

Once everything is running, Ansible pulls the initial models so they’re ready to go immediately:

# ansible/tasks/pull-models.yml
- name: Pull default inference models
  ansible.builtin.command:
    cmd: 'ollama pull {{ item }}'
  loop:
    # Adjust based on your GPU's VRAM capacity
    - glm-4.7-flash # 15GB VRAM - excellent MoE general assistant
    - deepseek-r1:32b # 20GB VRAM - deep reasoning (needs 24GB+ GPU)
    - qwen3:8b # 5GB VRAM  - fast responses, lightweight tasks
  changed_when: true
  timeout: 600 # Large models take a while to download

The whole playbook takes about 15 minutes to run on a fresh VM — most of that is model download time. After that, you’ve got a fully functional AI chat interface accessible from anywhere on your local network.

Model Selection: What Actually Matters in 2026

The open source model landscape in early 2026 is impressive but also genuinely confusing. New models drop weekly with names that read like firmware version numbers, and keeping track of what’s actually good versus what’s just new requires more attention than I’d like.

Here’s the key development you need to know about: Mixture-of-Experts (MoE) architecture. Think of it like a team of specialists rather than one generalist. A MoE model contains multiple “expert” sub-networks, and a routing layer decides which experts handle each token — so only a fraction of the model’s total parameters are active at any given time. Older models like Qwen 2.5 Coder 32B were “dense” — all 32 billion parameters fired on every single token. That meant you needed enormous VRAM and got glacial speeds on consumer hardware. MoE models are a completely different game. A model like GLM-4.7-Flash has 30 billion total parameters but the router only activates about 3 billion per token. The result? It runs at 120-220 tokens per second on 16GB of VRAM. The kind of speed that makes local inference feel instant.

For Conversations and Research (Open WebUI)

This is where self hosting really shines. Open WebUI gives you a ChatGPT-style interface for research, brainstorming, document analysis, and general conversation — with zero data leaving your network.

Model	Architecture	VRAM (Q4)	Speed	Best For
GLM-4.7-Flash	30B MoE (3B active)	~15GB	120-220 t/s	General assistant, research, writing
DeepSeek R1 32B	32B dense	~20GB	~55 t/s	Reasoning, mathematics, complex analysis
Qwen 3 8B	8B dense	~5GB	~120 t/s	Quick tasks, summarisation, code snippets

GLM-4.7-Flash is the standout here. It’s fast enough that responses start appearing before you’ve finished reaching for your coffee, and the quality is genuinely impressive for everyday tasks. DeepSeek R1 is the one you reach for when you need actual reasoning and complex analysis. It’s slower, but when you need a model to actually think, it delivers.

For Agentic Coding

Here’s where things get interesting — and where I need to be honest. Since January 2026, Ollama has supported the Anthropic Messages API, which means you can point Claude Code directly at your local Ollama instance. Same agentic workflow, same tool use, running on a local model.

The reality? It works, but with caveats.

Model	SWE-bench	VRAM	Speed	Honest Assessment
Qwen3-Coder 30B-A3B	69.6%	~19GB (Q4)	80-150 t/s	Strong. Needs 32GB GPU.
GLM-4.7-Flash	59.2%	~15GB (Q4)	120-220 t/s	Good for simpler tasks. Fits 16GB.
Claude Opus 4.6 (cloud)	80.8%	N/A	Cloud latency	The benchmark to beat.

For context, SWE-bench measures how well a model can autonomously fix real world GitHub issues. At 59-70%, local models handle most straightforward coding tasks — writing functions, refactoring, infrastructure scripts. But for complex multi file architectural changes? Claude Opus 4.6 at 80.8% running a sophisticated refactor is still in a genuinely different league. I’ve recently started using GLM-4.7-Flash with Claude Code for simpler agentic tasks, and it’s promising — not perfect, but promising. The occasional malformed tool call keeps you on your toes.

⚠

A gotcha with local models and agentic tool calling: smaller models occasionally hallucinate malformed XML tags (like </function>) at the end of responses, causing stream processing errors. Do not add stop sequences to fix this — they match against the raw token stream before Ollama’s parser can extract tool calls, which breaks tool calling entirely. Instead, use presence_penalty in a custom Modelfile (1.5 works well) to reduce the hallucination.

A quick note on quantisation: Q4_K_M and Q5_K_M are the sweet spots for most use cases. Q4 reduces a model to roughly 4 bits per weight, cutting VRAM requirements by ~75% compared to full precision, with a bit of quality loss. Q5 gives slightly better quality at the cost of more VRAM. Ollama handles all of this transparently — you specify the quantisation in the model tag and Ollama downloads the right variant.

One thing worth knowing: ollama pull defaults are often not optimal. Custom Modelfiles let you tune the context window (num_ctx — the default 2-4K is usually too small for real work), temperature (lower for coding, higher for creative tasks), and system prompts. If you’re using local models seriously, you’ll want to create Modelfiles for your common use cases rather than running stock configurations.

⬡

If you tried self hosting a year ago and bounced off it, it’s worth another look. Back then, 16GB of VRAM meant 7B parameter models that struggled with anything beyond simple prompts. The shift to MoE architectures changed the equation completely - 30B-class models now run faster on the same hardware while being dramatically more capable.

Performance and Cost Reality Check

Let’s be honest about the numbers.

Hardware Costs (AUD, February 2026)

The GPU is the limiting factor. Everything else is secondary. Here’s what it looks like by VRAM tier:

VRAM Tier	Example GPUs	Price Range (AUD)	What You Can Run
16GB	RTX 5080, RTX 4080	$1,200-1,800	GLM-4.7-Flash, Qwen 3 8B, small MoE models
24GB	RTX 4090, RTX 3090 (used)	$900-2,800	All of the above + DeepSeek R1 32B, larger quantised models
32GB	RTX 5090	$5,400-5,700	Everything including Qwen3-Coder 30B at Q5 for agentic coding

For a complete system if you’re building from scratch:

Component	Budget	Mid-Range	High-End
GPU	RTX 3090 used (~$1,000)	RTX 5080 (~$1,500)	RTX 5090 (~$5,500)
CPU + Motherboard	i5 + B660 (~$400)	Ryzen 7 + B650 (~$700)	Ryzen 9 + X670E (~$1,200)
RAM (64GB DDR4/5)	~$200	~$300	~$400
Total	~$1,600	~$2,500	~$7,100

If you already have a homelab or server, you’re likely only buying the GPU — and a used RTX 3090 with 24GB of VRAM is arguably the best value in the entire stack right now.

Running Costs

NVIDIA GPUs draw anywhere from 160W to 575W under inference load, depending on the card. At Australian electricity rates of roughly $0.33/kWh (the national average — and yes, if you’re in South Australia like me, that number makes you wince a bit harder), here’s what it actually costs:

GPU	TDP	Cost/Hour Under Load	8hrs/Day Monthly
RTX 4060 Ti	~160W	~$0.05	~$13
RTX 5080	~300W	~$0.10	~$24
RTX 4090	~350W	~$0.12	~$28
RTX 5090	~575W	~$0.19	~$46

Not nothing, but a fraction of API costs at any serious volume. And unlike your electricity retailer, at least Ollama doesn’t send you emails about “exciting new pricing structures.”

The Break-Even Calculation

If you’re spending $100+ per month on AI API calls, self hosting pays for itself within 12-18 months even with the budget setup. If you’re spending $300+, you’re looking at 4-6 months.

But here’s the caveat: if you’re running the occasional ChatGPT query and a handful of API calls per day, self hosting is probably overkill. The economics only make sense at volume, for privacy sensitive workloads, or if you genuinely enjoy tinkering with infrastructure (no judgement — I clearly do).

Performance Limitations

A single consumer GPU caps out at roughly 4 concurrent inference requests before response times degrade noticeably. This is a single user or small team setup, not a production service for hundreds of users. For that, you need proper inference infrastructure — multiple GPUs, load balancing, and purpose built serving frameworks like vLLM or TensorRT-LLM.

When to Self-Host vs Use Cloud APIs

I’ve landed on a hybrid approach, and I think most people will too.

Self-host when:

Privacy and data sovereignty make external APIs a non-starter
You want an always on AI assistant that doesn’t depend on internet connectivity
You’re doing high volume routine tasks (documentation, summarisation, research)
You want fine tuning control over your models
You already have homelab infrastructure and enjoy the tinkering
You’re building local RAG pipelines over proprietary documents

Use cloud APIs when:

You need genuine frontier capability (complex multi-step reasoning, novel problem solving)
You’re doing serious agentic coding that requires high reliability across long chains of actions
Uptime guarantees matter more than cost
You need the absolute latest models the day they drop

The honest reality is that open source models have closed the gap dramatically. GLM-4.7-Flash and DeepSeek R1 are capable for a huge range of tasks. But there are still things where frontier cloud models like Claude Opus 4.6 and GPT-5.2 genuinely outperform anything you can run locally — and pretending otherwise would be doing you a disservice. Complex architecture reviews, nuanced technical writing, tricky multi file refactors — these still go to cloud APIs in my workflow.

For my day to day, most of my casual AI usage — research, brainstorming, document summarisation, quick code questions — now runs locally through Open WebUI. It’s always there, it’s fast, and nothing leaves my network.

If you’ve got a spare GPU and a Proxmox node, you can have this running in an afternoon. And if it ever breaks? The IaC means you can tear it down and rebuild from scratch in under 20 minutes. That’s the real superpower here — not the AI itself, but the confidence that what you built is repeatable. Well, that and the smug satisfaction of running nvidia-smi and seeing your GPU actually paying its rent.

⬡

Want to dig deeper? Check out the Ollama documentation for model management and API reference, the Open WebUI docs for configuration options, the BPG Proxmox provider on the Terraform Registry, and the Proxmox PCI Passthrough wiki for hardware specific troubleshooting.