2.8 KiB
2.8 KiB
llama.cpp Wrapper Notes
Last updated: 2026-01-04
Purpose
OpenAI-compatible wrapper for the existing llamacpp app with a model manager UI,
model switching, and parameter management via TrueNAS middleware.
Deployed Image
rushabhtechie/llamacpp-wrapper-rushg-d:20260104-112221
Ports (current)
- API (pinned):
http://192.168.1.2:9093 - UI (pinned):
http://192.168.1.2:9094 - llama.cpp native:
http://192.168.1.2:8071
Key Behaviors
- Model switching uses TrueNAS middleware
app.updateto update--model. --deviceflag is explicitly removed because it crashes llama.cpp on this host.- UI shows active model and supports switching with verification prompt.
- UI auto-refreshes on download progress and on llama.cpp model changes (SSE).
- UI allows editing llama.cpp command parameters (ctx-size, temp, top-k/p, etc.).
- UI supports dark theme toggle (persisted in localStorage).
- UI streams llama.cpp logs via Docker socket fallback when TrueNAS log APIs are unavailable.
Tools Support (n8n/OpenWebUI)
- Incoming
toolsin flat format ({type,name,parameters}) are normalized to OpenAI format ({type:"function", function:{...}}) before proxying to llama.cpp. - Legacy
functionspayloads are normalized intotools. tool_choiceis normalized to OpenAI format as well.return_format=jsonis supported (falls back to JSON-only system prompt if llama.cpp rejectsresponse_format).
Model Resolution
- Exact string match only (with optional explicit alias mapping).
- Requests that do not exactly match a listed model return
404.
Parameters UI
- Endpoint:
GET /ui/api/llamacpp-config(active model + params + extra args) - Endpoint:
POST /ui/api/llamacpp-config(updates command flags + extra args)
Model Switch UI
- Endpoint:
POST /ui/api/switch-modelwith{ "model_id": "..." } - Verifies switch by sending a minimal prompt.
Tests
- Remote functional tests:
tests/test_remote_wrapper.py(chat/responses/tools/JSON mode, model switch, logs, multi-GPU flags). - UI checks:
tests/test_ui.py(UI elements, assets, theme toggle wiring). - Run with env vars:
WRAPPER_BASE=http://192.168.1.2:9093UI_BASE=http://192.168.1.2:9094TRUENAS_WS_URL=wss://192.168.1.2/websocketTRUENAS_API_KEY=...MODEL_REQUEST=<exact model id from /v1/models>
Runtime Validation (2026-01-04)
- Fixed llama.cpp init failure by enabling
--flash-attn on(required with KV cache quantization). - Confirmed TinyLlama loads and answers prompts with
return_format=json. - Switched via UI to
Qwen2.5-7B-Instruct-Q4_K_M.ggufand validated prompt success. - Expect transient
503 Loading modelduring warmup; retry after load completes. - Verified
yarn-llama-2-13b-64k.Q4_K_M.ggufmodel switch from wrapper and a tool-enabled chat request completes after load (took ~107s).