llama.cpp Wrapper Notes

Last updated: 2026-01-04

Purpose

OpenAI-compatible wrapper for the existing llamacpp app with a model manager UI, model switching, and parameter management via TrueNAS middleware.

Model switching uses TrueNAS middleware app.update to update --model.
--device flag is explicitly removed because it crashes llama.cpp on this host.
UI shows active model and supports switching with verification prompt.
UI auto-refreshes on download progress and on llama.cpp model changes (SSE).
UI allows editing llama.cpp command parameters (ctx-size, temp, top-k/p, etc.).
UI supports dark theme toggle (persisted in localStorage).
UI streams llama.cpp logs via Docker socket fallback when TrueNAS log APIs are unavailable.

Incoming tools in flat format ({type,name,parameters}) are normalized to OpenAI format ({type:"function", function:{...}}) before proxying to llama.cpp.
Legacy functions payloads are normalized into tools.
tool_choice is normalized to OpenAI format as well.
return_format=json is supported (falls back to JSON-only system prompt if llama.cpp rejects response_format).

Remote functional tests: tests/test_remote_wrapper.py (chat/responses/tools/JSON mode, model switch, logs, multi-GPU flags).
UI checks: tests/test_ui.py (UI elements, assets, theme toggle wiring).
Run with env vars:
- WRAPPER_BASE=http://192.168.1.2:9093
- UI_BASE=http://192.168.1.2:9094
- TRUENAS_WS_URL=wss://192.168.1.2/websocket
- TRUENAS_API_KEY=...
- MODEL_REQUEST=<exact model id from /v1/models>

Fixed llama.cpp init failure by enabling --flash-attn on (required with KV cache quantization).
Confirmed TinyLlama loads and answers prompts with return_format=json.
Switched via UI to Qwen2.5-7B-Instruct-Q4_K_M.gguf and validated prompt success.
Expect transient 503 Loading model during warmup; retry after load completes.
Verified yarn-llama-2-13b-64k.Q4_K_M.gguf model switch from wrapper and a tool-enabled chat request completes after load (took ~107s).