GPU Acceleration
Run Ollama with GPU acceleration for faster LLM inference
By default, PiSovereign runs Ollama inside a Docker container using CPU-only
inference. With GPU acceleration, inference speed improves dramatically —
especially for larger models like qwen2.5:14b or qwen2.5:32b.
Platform Overview
| Platform | GPU Access | Method |
|---|---|---|
| macOS (Apple Silicon / Intel) | Metal | Native Ollama (hybrid mode) |
| Linux + NVIDIA GPU | CUDA | Compose override file |
| Linux + AMD GPU | ROCm | Manual compose override |
| Raspberry Pi + Hailo | NPU | See Hardware Setup |
macOS — Native Ollama with Metal GPU
Docker Desktop on macOS runs containers inside a Linux VM and cannot pass through the Metal GPU. To use GPU acceleration, run Ollama natively on the host and point PiSovereign’s Docker container at it.
1. Install Ollama
brew install ollama
2. Start Ollama
ollama serve
Ollama will listen on http://localhost:11434 and automatically use Metal for
GPU-accelerated inference on Apple Silicon (M1/M2/M3/M4) or Intel Macs.
3. Pull the inference model
# Default model (recommended for 16 GB+ RAM)
ollama pull qwen2.5:14b
# Embedding model (required)
ollama pull nomic-embed-text
4. Configure Docker environment
Edit docker/.env and set:
OLLAMA_BASE_URL=http://host.docker.internal:11434
This tells the PiSovereign container to connect to the native Ollama instance
via Docker’s host.docker.internal bridge (already configured in
compose.yml via extra_hosts).
5. Start PiSovereign
# From the repository root
just docker-up
# Or directly
cd docker && docker compose up -d
Note: The Ollama Docker container will still start but is unused. It runs idle with minimal resource consumption. The PiSovereign container connects to native Ollama via the configured
OLLAMA_BASE_URL.
Verify GPU is active
# Check Ollama is using Metal
ollama ps
# Should show "metal" in the processor column
# Test inference
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5:14b",
"prompt": "Hello",
"stream": false
}'
Linux — NVIDIA GPU
On Linux with an NVIDIA GPU, Ollama runs inside Docker with full GPU passthrough via the NVIDIA Container Toolkit.
1. Install NVIDIA Container Toolkit
# Add the NVIDIA repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
2. Verify GPU is visible to Docker
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
This should display your GPU model, driver version, and CUDA version.
3. Start with GPU override
# From the repository root
just docker-up-gpu
# Or directly
cd docker && docker compose -f compose.yml -f compose.gpu-nvidia.yml up -d
This merges compose.gpu-nvidia.yml into the Ollama service, adding NVIDIA GPU
device reservations and higher resource limits. The same ollama service is
used — only the resource configuration is overridden.
4. Verify GPU inference
# Check GPU layers are loaded
docker compose -f compose.yml -f compose.gpu-nvidia.yml exec ollama ollama ps
# Should show GPU layers in the "processor" column
# Check NVIDIA GPU usage
docker compose -f compose.yml -f compose.gpu-nvidia.yml exec ollama nvidia-smi
GPU Resource Limits
The GPU override file (compose.gpu-nvidia.yml) configures higher resource
limits than CPU-only:
| Setting | CPU-only | GPU (NVIDIA) |
|---|---|---|
| Memory limit | 12 GB | 24 GB |
| CPU limit | 4.0 | 8.0 |
| Parallel requests | 1 | 2 |
| Loaded models | 1 | 2 |
Adjust these in docker/compose.gpu-nvidia.yml to match your hardware.
Linux — AMD GPU (ROCm)
AMD GPU support requires the ROCm-specific Ollama image and device mappings. This is not provided as a built-in profile due to the different base image, but can be configured manually:
1. Install ROCm drivers
Follow the AMD ROCm installation guide.
2. Create a compose override
Create docker/compose.override.yml:
services:
ollama:
image: ollama/ollama:rocm
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
group_add:
- video
- render
deploy:
resources:
limits:
memory: 24G
cpus: "8.0"
environment:
- OLLAMA_NUM_PARALLEL=2
- OLLAMA_MAX_LOADED_MODELS=2
- OLLAMA_FLASH_ATTENTION=1
3. Start services
cd docker && docker compose up -d
Docker Compose automatically merges compose.yml with compose.override.yml.
Model Configuration
The inference model is configurable via the OLLAMA_MODEL environment variable
in docker/.env. The ollama-init container pulls this model on first start.
Recommended models by VRAM / RAM
| VRAM / RAM | Model | Parameter |
|---|---|---|
| 8 GB | qwen2.5:7b | OLLAMA_MODEL=qwen2.5:7b |
| 16 GB | qwen2.5:14b | OLLAMA_MODEL=qwen2.5:14b (default) |
| 24 GB+ | qwen2.5:32b | OLLAMA_MODEL=qwen2.5:32b |
To change the model:
# Edit docker/.env
OLLAMA_MODEL=qwen2.5:32b
# Restart ollama-init to pull the new model
cd docker && docker compose restart ollama-init
# Or pull manually
just docker-model-pull qwen2.5:32b
The embedding model (nomic-embed-text) is always pulled regardless of the
OLLAMA_MODEL setting.
Troubleshooting
macOS: Ollama not reachable from Docker
# Verify Ollama is running
curl http://localhost:11434/api/tags
# Verify Docker can reach the host
docker run --rm --add-host=host.docker.internal:host-gateway \
curlimages/curl curl -s http://host.docker.internal:11434/api/tags
# Check .env is correct
grep OLLAMA_BASE_URL docker/.env
# Should show: OLLAMA_BASE_URL=http://host.docker.internal:11434
NVIDIA: GPU not visible in container
# Check NVIDIA driver is loaded
nvidia-smi
# Check Container Toolkit is installed
nvidia-ctk --version
# Check Docker runtime
docker info | grep -i nvidia
# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
Model download fails
# Check ollama-init logs
docker compose logs ollama-init
# Pull manually
docker compose exec ollama ollama pull qwen2.5:14b
# Or via Justfile
just docker-model-pull qwen2.5:14b
Performance is slow despite GPU
# Verify GPU layers are being used
ollama ps
# The "processor" column should show "gpu" or "metal", not "cpu"
# Check if model fits in VRAM — if it spills to RAM, inference slows down
# Reduce model size if VRAM is insufficient