Initial commit
This commit is contained in:
60
docs/llamacpp-wrapper-notes.md
Normal file
60
docs/llamacpp-wrapper-notes.md
Normal file
@@ -0,0 +1,60 @@
|
||||
# llama.cpp Wrapper Notes
|
||||
|
||||
Last updated: 2026-01-04
|
||||
|
||||
## Purpose
|
||||
OpenAI-compatible wrapper for the existing `llamacpp` app with a model manager UI,
|
||||
model switching, and parameter management via TrueNAS middleware.
|
||||
|
||||
## Deployed Image
|
||||
- `rushabhtechie/llamacpp-wrapper-rushg-d:20260104-112221`
|
||||
|
||||
## Ports (current)
|
||||
- API (pinned): `http://192.168.1.2:9093`
|
||||
- UI (pinned): `http://192.168.1.2:9094`
|
||||
- llama.cpp native: `http://192.168.1.2:8071`
|
||||
|
||||
## Key Behaviors
|
||||
- Model switching uses TrueNAS middleware `app.update` to update `--model`.
|
||||
- `--device` flag is explicitly removed because it crashes llama.cpp on this host.
|
||||
- UI shows active model and supports switching with verification prompt.
|
||||
- UI auto-refreshes on download progress and on llama.cpp model changes (SSE).
|
||||
- UI allows editing llama.cpp command parameters (ctx-size, temp, top-k/p, etc.).
|
||||
- UI supports dark theme toggle (persisted in localStorage).
|
||||
- UI streams llama.cpp logs via Docker socket fallback when TrueNAS log APIs are unavailable.
|
||||
|
||||
## Tools Support (n8n/OpenWebUI)
|
||||
- Incoming `tools` in flat format (`{type,name,parameters}`) are normalized to
|
||||
OpenAI format (`{type:"function", function:{...}}`) before proxying to llama.cpp.
|
||||
- Legacy `functions` payloads are normalized into `tools`.
|
||||
- `tool_choice` is normalized to OpenAI format as well.
|
||||
- `return_format=json` is supported (falls back to JSON-only system prompt if llama.cpp rejects `response_format`).
|
||||
|
||||
## Model Resolution
|
||||
- Exact string match only (with optional explicit alias mapping).
|
||||
- Requests that do not exactly match a listed model return `404`.
|
||||
|
||||
## Parameters UI
|
||||
- Endpoint: `GET /ui/api/llamacpp-config` (active model + params + extra args)
|
||||
- Endpoint: `POST /ui/api/llamacpp-config` (updates command flags + extra args)
|
||||
|
||||
## Model Switch UI
|
||||
- Endpoint: `POST /ui/api/switch-model` with `{ "model_id": "..." }`
|
||||
- Verifies switch by sending a minimal prompt.
|
||||
|
||||
## Tests
|
||||
- Remote functional tests: `tests/test_remote_wrapper.py` (chat/responses/tools/JSON mode, model switch, logs, multi-GPU flags).
|
||||
- UI checks: `tests/test_ui.py` (UI elements, assets, theme toggle wiring).
|
||||
- Run with env vars:
|
||||
- `WRAPPER_BASE=http://192.168.1.2:9093`
|
||||
- `UI_BASE=http://192.168.1.2:9094`
|
||||
- `TRUENAS_WS_URL=wss://192.168.1.2/websocket`
|
||||
- `TRUENAS_API_KEY=...`
|
||||
- `MODEL_REQUEST=<exact model id from /v1/models>`
|
||||
|
||||
## Runtime Validation (2026-01-04)
|
||||
- Fixed llama.cpp init failure by enabling `--flash-attn on` (required with KV cache quantization).
|
||||
- Confirmed TinyLlama loads and answers prompts with `return_format=json`.
|
||||
- Switched via UI to `Qwen2.5-7B-Instruct-Q4_K_M.gguf` and validated prompt success.
|
||||
- Expect transient `503 Loading model` during warmup; retry after load completes.
|
||||
- Verified `yarn-llama-2-13b-64k.Q4_K_M.gguf` model switch from wrapper and a tool-enabled chat request completes after load (took ~107s).
|
||||
Reference in New Issue
Block a user