GPU Acceleration

Run Ollama with GPU acceleration for faster LLM inference

By default, PiSovereign runs Ollama inside a Docker container using CPU-only inference. With GPU acceleration, inference speed improves dramatically — especially for larger models like qwen2.5:14b or qwen2.5:32b.

Platform Overview

Platform	GPU Access	Method
macOS (Apple Silicon / Intel)	Metal	Native Ollama (hybrid mode)
Linux + NVIDIA GPU	CUDA	Compose override file
Linux + AMD GPU	ROCm	Manual compose override
Raspberry Pi + Hailo	NPU	See Hardware Setup

macOS — Native Ollama with Metal GPU

Docker Desktop on macOS runs containers inside a Linux VM and cannot pass through the Metal GPU. To use GPU acceleration, run Ollama natively on the host and point PiSovereign’s Docker container at it.

1. Install Ollama

brew install ollama

2. Start Ollama

ollama serve

Ollama will listen on http://localhost:11434 and automatically use Metal for GPU-accelerated inference on Apple Silicon (M1/M2/M3/M4) or Intel Macs.

3. Pull the inference model

# Default model (recommended for 16 GB+ RAM)
ollama pull qwen2.5:14b

# Embedding model (required)
ollama pull nomic-embed-text

4. Configure Docker environment

Edit docker/.env and set:

OLLAMA_BASE_URL=http://host.docker.internal:11434

This tells the PiSovereign container to connect to the native Ollama instance via Docker’s host.docker.internal bridge (already configured in compose.yml via extra_hosts).

5. Start PiSovereign

# From the repository root
just docker-up

# Or directly
cd docker && docker compose up -d

Note: The Ollama Docker container will still start but is unused. It runs idle with minimal resource consumption. The PiSovereign container connects to native Ollama via the configured OLLAMA_BASE_URL.

Verify GPU is active

# Check Ollama is using Metal
ollama ps
# Should show "metal" in the processor column

# Test inference
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:14b",
  "prompt": "Hello",
  "stream": false
}'

Linux — NVIDIA GPU

On Linux with an NVIDIA GPU, Ollama runs inside Docker with full GPU passthrough via the NVIDIA Container Toolkit.

1. Install NVIDIA Container Toolkit

# Add the NVIDIA repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

2. Verify GPU is visible to Docker

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

This should display your GPU model, driver version, and CUDA version.

3. Start with GPU override

# From the repository root
just docker-up-gpu

# Or directly
cd docker && docker compose -f compose.yml -f compose.gpu-nvidia.yml up -d

This merges compose.gpu-nvidia.yml into the Ollama service, adding NVIDIA GPU device reservations and higher resource limits. The same ollama service is used — only the resource configuration is overridden.

4. Verify GPU inference

# Check GPU layers are loaded
docker compose -f compose.yml -f compose.gpu-nvidia.yml exec ollama ollama ps
# Should show GPU layers in the "processor" column

# Check NVIDIA GPU usage
docker compose -f compose.yml -f compose.gpu-nvidia.yml exec ollama nvidia-smi

GPU Resource Limits

The GPU override file (compose.gpu-nvidia.yml) configures higher resource limits than CPU-only:

Setting	CPU-only	GPU (NVIDIA)
Memory limit	12 GB	24 GB
CPU limit	4.0	8.0
Parallel requests	1	2
Loaded models	1	2

Adjust these in docker/compose.gpu-nvidia.yml to match your hardware.

services:
  ollama:
    image: ollama/ollama:rocm
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - video
      - render
    deploy:
      resources:
        limits:
          memory: 24G
          cpus: "8.0"
    environment:
      - OLLAMA_NUM_PARALLEL=2
      - OLLAMA_MAX_LOADED_MODELS=3
      - OLLAMA_FLASH_ATTENTION=1

3. Start services

cd docker && docker compose up -d

Docker Compose automatically merges compose.yml with compose.override.yml.

Model Configuration

The inference model is configurable via the OLLAMA_MODEL environment variable in docker/.env. The ollama-init container pulls this model on first start.

Recommended models by VRAM / RAM

VRAM / RAM	Model	Parameter
8 GB	`qwen2.5:7b`	`OLLAMA_MODEL=qwen2.5:7b`
16 GB	`qwen2.5:14b`	`OLLAMA_MODEL=qwen2.5:14b` (default)
24 GB+	`qwen2.5:32b`	`OLLAMA_MODEL=qwen2.5:32b`

To change the model:

# Edit docker/.env
OLLAMA_MODEL=qwen2.5:32b

# Restart ollama-init to pull the new model
cd docker && docker compose restart ollama-init

# Or pull manually
just docker-model-pull qwen2.5:32b

The embedding model (nomic-embed-text) is always pulled regardless of the OLLAMA_MODEL setting.

Troubleshooting

macOS: Ollama not reachable from Docker

# Verify Ollama is running
curl http://localhost:11434/api/tags

# Verify Docker can reach the host
docker run --rm --add-host=host.docker.internal:host-gateway \
  curlimages/curl curl -s http://host.docker.internal:11434/api/tags

# Check .env is correct
grep OLLAMA_BASE_URL docker/.env
# Should show: OLLAMA_BASE_URL=http://host.docker.internal:11434

NVIDIA: GPU not visible in container

# Check NVIDIA driver is loaded
nvidia-smi

# Check Container Toolkit is installed
nvidia-ctk --version

# Check Docker runtime
docker info | grep -i nvidia

# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Model download fails

# Check ollama-init logs
docker compose logs ollama-init

# Pull manually
docker compose exec ollama ollama pull qwen2.5:14b

# Or via Justfile
just docker-model-pull qwen2.5:14b

Performance is slow despite GPU

# Verify GPU layers are being used
ollama ps
# The "processor" column should show "gpu" or "metal", not "cpu"

# Check if model fits in VRAM — if it spills to RAM, inference slows down
# Reduce model size if VRAM is insufficient

PiSovereign Documentation