llama.cpp OpenAI-Compatible Wrapper

This project wraps the existing llama.cpp TrueNAS app with OpenAI-compatible endpoints and a model management UI. The wrapper reads deployment details from AGENTS.md (build-time) into app/agents_config.json.

Current Agents-Derived Details

llama.cpp image: ghcr.io/ggml-org/llama.cpp:server-cuda
Host port: 8071 -> container port 8080
Model mount: /mnt/fast.storage.rushg.me/datasets/apps/llama-cpp.models -> /models
Network: ix-llamacpp_default
Container name: ix-llamacpp-llamacpp-1
GPUs: 2x NVIDIA RTX 5060 Ti (from AGENTS snapshot)

Regenerate the derived config after updating AGENTS.md:

python app/agents_parser.py --agents AGENTS.md --out app/agents_config.json

Running Locally

python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
python -m app.run

Defaults:

API: PORT_A=9093
UI: PORT_B=9094
Base URL: LLAMACPP_BASE_URL (defaults to container name or localhost based on agents config)
Model dir: MODEL_DIR=/models

Docker (TrueNAS)

Example (join existing llama.cpp network and mount models):

docker run --rm -p 9093:9093 -p 9094:9094 \
  --network ix-llamacpp_default \
  -v /mnt/fast.storage.rushg.me/datasets/apps/llama-cpp.models:/models \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e LLAMACPP_RESTART_METHOD=docker \
  -e LLAMACPP_RESTART_COMMAND=ix-llamacpp-llamacpp-1 \
  -e LLAMACPP_TARGET_CONTAINER=ix-llamacpp-llamacpp-1 \
  -e TRUENAS_WS_URL=ws://192.168.1.2/websocket \
  -e TRUENAS_API_KEY=YOUR_KEY \
  -e TRUENAS_API_USER=YOUR_USER \
  -e TRUENAS_APP_NAME=llamacpp \
  -e LLAMACPP_BASE_URL=http://ix-llamacpp-llamacpp-1:8080 \
  -e PORT_A=9093 -e PORT_B=9094 \
  llama-cpp-openai-wrapper:latest

Model Hot-Swap / Restart Hooks

This wrapper does not modify llama.cpp by default. To enable hot-swap/restart for new models or model selection, provide one of the restart methods below:

LLAMACPP_RESTART_METHOD=http
LLAMACPP_RESTART_URL=http://host-or-helper/restart

LLAMACPP_RESTART_METHOD=shell
LLAMACPP_RESTART_COMMAND="/usr/local/bin/your-restart-script --arg"

or (requires mounting docker socket)

LLAMACPP_RESTART_METHOD=docker
LLAMACPP_RESTART_COMMAND=ix-llamacpp-llamacpp-1

Model switching via TrueNAS middleware (P0)

Provide TrueNAS API credentials so the wrapper can update the llama.cpp app command when a new model is selected:

TRUENAS_WS_URL=ws://192.168.1.2/websocket
TRUENAS_API_KEY=YOUR_KEY
TRUENAS_API_USER=YOUR_USER
TRUENAS_APP_NAME=llamacpp
TRUENAS_VERIFY_SSL=false

The wrapper preserves existing flags in the compose command and only updates --model, while optionally adding missing GPU split flags from LLAMACPP_* if not already set.

Optional arguments passed to restart handlers:

LLAMACPP_DEVICES=0,1
LLAMACPP_TENSOR_SPLIT=0.5,0.5
LLAMACPP_SPLIT_MODE=layer
LLAMACPP_N_GPU_LAYERS=999
LLAMACPP_CTX_SIZE=8192
LLAMACPP_BATCH_SIZE=1024
LLAMACPP_UBATCH_SIZE=256
LLAMACPP_CACHE_TYPE_K=q4_0
LLAMACPP_CACHE_TYPE_V=q4_0
LLAMACPP_FLASH_ATTN=on

You can also pass arbitrary llama.cpp flags (space-separated) via:

LLAMACPP_EXTRA_ARGS="--mlock --no-mmap --rope-scaling linear"

Model Manager UI

Open http://HOST:PORT_B/.

Features:

List existing models
Download models via URL
Live progress + cancel

Testing

Tests are parameterized with 100+ cases per endpoint.

pytest -q

llama.cpp flags reference

Scraped from upstream docs into reports/llamacpp_docs.md and reports/llamacpp_flags.txt.

pwsh scripts/update_llamacpp_flags.ps1

3.6 KiB Raw Permalink Blame History