Files
codex_truenas_helper/docs/llamacpp-wrapper-notes.md
Rushabh Gosar 5d1a0ee72b Initial commit
2026-01-07 16:54:39 -08:00

2.8 KiB

llama.cpp Wrapper Notes

Last updated: 2026-01-04

Purpose

OpenAI-compatible wrapper for the existing llamacpp app with a model manager UI, model switching, and parameter management via TrueNAS middleware.

Deployed Image

  • rushabhtechie/llamacpp-wrapper-rushg-d:20260104-112221

Ports (current)

  • API (pinned): http://192.168.1.2:9093
  • UI (pinned): http://192.168.1.2:9094
  • llama.cpp native: http://192.168.1.2:8071

Key Behaviors

  • Model switching uses TrueNAS middleware app.update to update --model.
  • --device flag is explicitly removed because it crashes llama.cpp on this host.
  • UI shows active model and supports switching with verification prompt.
  • UI auto-refreshes on download progress and on llama.cpp model changes (SSE).
  • UI allows editing llama.cpp command parameters (ctx-size, temp, top-k/p, etc.).
  • UI supports dark theme toggle (persisted in localStorage).
  • UI streams llama.cpp logs via Docker socket fallback when TrueNAS log APIs are unavailable.

Tools Support (n8n/OpenWebUI)

  • Incoming tools in flat format ({type,name,parameters}) are normalized to OpenAI format ({type:"function", function:{...}}) before proxying to llama.cpp.
  • Legacy functions payloads are normalized into tools.
  • tool_choice is normalized to OpenAI format as well.
  • return_format=json is supported (falls back to JSON-only system prompt if llama.cpp rejects response_format).

Model Resolution

  • Exact string match only (with optional explicit alias mapping).
  • Requests that do not exactly match a listed model return 404.

Parameters UI

  • Endpoint: GET /ui/api/llamacpp-config (active model + params + extra args)
  • Endpoint: POST /ui/api/llamacpp-config (updates command flags + extra args)

Model Switch UI

  • Endpoint: POST /ui/api/switch-model with { "model_id": "..." }
  • Verifies switch by sending a minimal prompt.

Tests

  • Remote functional tests: tests/test_remote_wrapper.py (chat/responses/tools/JSON mode, model switch, logs, multi-GPU flags).
  • UI checks: tests/test_ui.py (UI elements, assets, theme toggle wiring).
  • Run with env vars:
    • WRAPPER_BASE=http://192.168.1.2:9093
    • UI_BASE=http://192.168.1.2:9094
    • TRUENAS_WS_URL=wss://192.168.1.2/websocket
    • TRUENAS_API_KEY=...
    • MODEL_REQUEST=<exact model id from /v1/models>

Runtime Validation (2026-01-04)

  • Fixed llama.cpp init failure by enabling --flash-attn on (required with KV cache quantization).
  • Confirmed TinyLlama loads and answers prompts with return_format=json.
  • Switched via UI to Qwen2.5-7B-Instruct-Q4_K_M.gguf and validated prompt success.
  • Expect transient 503 Loading model during warmup; retry after load completes.
  • Verified yarn-llama-2-13b-64k.Q4_K_M.gguf model switch from wrapper and a tool-enabled chat request completes after load (took ~107s).