# llama.cpp OpenAI-Compatible Wrapper This project wraps the existing llama.cpp TrueNAS app with OpenAI-compatible endpoints and a model management UI. The wrapper reads deployment details from `AGENTS.md` (build-time) into `app/agents_config.json`. ## Current Agents-Derived Details - llama.cpp image: `ghcr.io/ggml-org/llama.cpp:server-cuda` - Host port: `8071` -> container port `8080` - Model mount: `/mnt/fast.storage.rushg.me/datasets/apps/llama-cpp.models` -> `/models` - Network: `ix-llamacpp_default` - Container name: `ix-llamacpp-llamacpp-1` - GPUs: 2x NVIDIA RTX 5060 Ti (from AGENTS snapshot) Regenerate the derived config after updating `AGENTS.md`: ```bash python app/agents_parser.py --agents AGENTS.md --out app/agents_config.json ``` ## Running Locally ```bash python -m venv .venv . .venv/bin/activate pip install -r requirements.txt python -m app.run ``` Defaults: - API: `PORT_A=9093` - UI: `PORT_B=9094` - Base URL: `LLAMACPP_BASE_URL` (defaults to container name or localhost based on agents config) - Model dir: `MODEL_DIR=/models` ## Docker (TrueNAS) Example (join existing llama.cpp network and mount models): ```bash docker run --rm -p 9093:9093 -p 9094:9094 \ --network ix-llamacpp_default \ -v /mnt/fast.storage.rushg.me/datasets/apps/llama-cpp.models:/models \ -v /var/run/docker.sock:/var/run/docker.sock \ -e LLAMACPP_RESTART_METHOD=docker \ -e LLAMACPP_RESTART_COMMAND=ix-llamacpp-llamacpp-1 \ -e LLAMACPP_TARGET_CONTAINER=ix-llamacpp-llamacpp-1 \ -e TRUENAS_WS_URL=ws://192.168.1.2/websocket \ -e TRUENAS_API_KEY=YOUR_KEY \ -e TRUENAS_API_USER=YOUR_USER \ -e TRUENAS_APP_NAME=llamacpp \ -e LLAMACPP_BASE_URL=http://ix-llamacpp-llamacpp-1:8080 \ -e PORT_A=9093 -e PORT_B=9094 \ llama-cpp-openai-wrapper:latest ``` ## Model Hot-Swap / Restart Hooks This wrapper does not modify llama.cpp by default. To enable hot-swap/restart for new models or model selection, provide one of the restart methods below: - `LLAMACPP_RESTART_METHOD=http` - `LLAMACPP_RESTART_URL=http://host-or-helper/restart` or - `LLAMACPP_RESTART_METHOD=shell` - `LLAMACPP_RESTART_COMMAND="/usr/local/bin/your-restart-script --arg"` or (requires mounting docker socket) - `LLAMACPP_RESTART_METHOD=docker` - `LLAMACPP_RESTART_COMMAND=ix-llamacpp-llamacpp-1` ## Model switching via TrueNAS middleware (P0) Provide TrueNAS API credentials so the wrapper can update the llama.cpp app command when a new model is selected: ``` TRUENAS_WS_URL=ws://192.168.1.2/websocket TRUENAS_API_KEY=YOUR_KEY TRUENAS_API_USER=YOUR_USER TRUENAS_APP_NAME=llamacpp TRUENAS_VERIFY_SSL=false ``` The wrapper preserves existing flags in the compose command and only updates `--model`, while optionally adding missing GPU split flags from `LLAMACPP_*` if not already set. Optional arguments passed to restart handlers: ``` LLAMACPP_DEVICES=0,1 LLAMACPP_TENSOR_SPLIT=0.5,0.5 LLAMACPP_SPLIT_MODE=layer LLAMACPP_N_GPU_LAYERS=999 LLAMACPP_CTX_SIZE=8192 LLAMACPP_BATCH_SIZE=1024 LLAMACPP_UBATCH_SIZE=256 LLAMACPP_CACHE_TYPE_K=q4_0 LLAMACPP_CACHE_TYPE_V=q4_0 LLAMACPP_FLASH_ATTN=on ``` You can also pass arbitrary llama.cpp flags (space-separated) via: ``` LLAMACPP_EXTRA_ARGS="--mlock --no-mmap --rope-scaling linear" ``` ## Model Manager UI Open `http://HOST:PORT_B/`. Features: - List existing models - Download models via URL - Live progress + cancel ## Testing Tests are parameterized with 100+ cases per endpoint. ```bash pytest -q ``` ## llama.cpp flags reference Scraped from upstream docs into `reports/llamacpp_docs.md` and `reports/llamacpp_flags.txt`. ``` pwsh scripts/update_llamacpp_flags.ps1 ```