# llama.cpp Wrapper Notes Last updated: 2026-01-04 ## Purpose OpenAI-compatible wrapper for the existing `llamacpp` app with a model manager UI, model switching, and parameter management via TrueNAS middleware. ## Deployed Image - `rushabhtechie/llamacpp-wrapper-rushg-d:20260104-112221` ## Ports (current) - API (pinned): `http://192.168.1.2:9093` - UI (pinned): `http://192.168.1.2:9094` - llama.cpp native: `http://192.168.1.2:8071` ## Key Behaviors - Model switching uses TrueNAS middleware `app.update` to update `--model`. - `--device` flag is explicitly removed because it crashes llama.cpp on this host. - UI shows active model and supports switching with verification prompt. - UI auto-refreshes on download progress and on llama.cpp model changes (SSE). - UI allows editing llama.cpp command parameters (ctx-size, temp, top-k/p, etc.). - UI supports dark theme toggle (persisted in localStorage). - UI streams llama.cpp logs via Docker socket fallback when TrueNAS log APIs are unavailable. ## Tools Support (n8n/OpenWebUI) - Incoming `tools` in flat format (`{type,name,parameters}`) are normalized to OpenAI format (`{type:"function", function:{...}}`) before proxying to llama.cpp. - Legacy `functions` payloads are normalized into `tools`. - `tool_choice` is normalized to OpenAI format as well. - `return_format=json` is supported (falls back to JSON-only system prompt if llama.cpp rejects `response_format`). ## Model Resolution - Exact string match only (with optional explicit alias mapping). - Requests that do not exactly match a listed model return `404`. ## Parameters UI - Endpoint: `GET /ui/api/llamacpp-config` (active model + params + extra args) - Endpoint: `POST /ui/api/llamacpp-config` (updates command flags + extra args) ## Model Switch UI - Endpoint: `POST /ui/api/switch-model` with `{ "model_id": "..." }` - Verifies switch by sending a minimal prompt. ## Tests - Remote functional tests: `tests/test_remote_wrapper.py` (chat/responses/tools/JSON mode, model switch, logs, multi-GPU flags). - UI checks: `tests/test_ui.py` (UI elements, assets, theme toggle wiring). - Run with env vars: - `WRAPPER_BASE=http://192.168.1.2:9093` - `UI_BASE=http://192.168.1.2:9094` - `TRUENAS_WS_URL=wss://192.168.1.2/websocket` - `TRUENAS_API_KEY=...` - `MODEL_REQUEST=` ## Runtime Validation (2026-01-04) - Fixed llama.cpp init failure by enabling `--flash-attn on` (required with KV cache quantization). - Confirmed TinyLlama loads and answers prompts with `return_format=json`. - Switched via UI to `Qwen2.5-7B-Instruct-Q4_K_M.gguf` and validated prompt success. - Expect transient `503 Loading model` during warmup; retry after load completes. - Verified `yarn-llama-2-13b-64k.Q4_K_M.gguf` model switch from wrapper and a tool-enabled chat request completes after load (took ~107s).