Initial commit

2026-01-07 16:54:39 -08:00
commit 5d1a0ee72b
53 changed files with 9885 additions and 0 deletions
--- a/llamaCpp.Wrapper.app/README.md
+++ b/llamaCpp.Wrapper.app/README.md
@@ -0,0 +1,134 @@
+# llama.cpp OpenAI-Compatible Wrapper
+
+This project wraps the existing llama.cpp TrueNAS app with OpenAI-compatible endpoints and a model management UI.
+The wrapper reads deployment details from `AGENTS.md` (build-time) into `app/agents_config.json`.
+
+## Current Agents-Derived Details
+
+- llama.cpp image: `ghcr.io/ggml-org/llama.cpp:server-cuda`
+- Host port: `8071` -> container port `8080`
+- Model mount: `/mnt/fast.storage.rushg.me/datasets/apps/llama-cpp.models` -> `/models`
+- Network: `ix-llamacpp_default`
+- Container name: `ix-llamacpp-llamacpp-1`
+- GPUs: 2x NVIDIA RTX 5060 Ti (from AGENTS snapshot)
+
+Regenerate the derived config after updating `AGENTS.md`:
+
+```bash
+python app/agents_parser.py --agents AGENTS.md --out app/agents_config.json
+```
+
+## Running Locally
+
+```bash
+python -m venv .venv
+. .venv/bin/activate
+pip install -r requirements.txt
+python -m app.run
+```
+
+Defaults:
+- API: `PORT_A=9093`
+- UI: `PORT_B=9094`
+- Base URL: `LLAMACPP_BASE_URL` (defaults to container name or localhost based on agents config)
+- Model dir: `MODEL_DIR=/models`
+
+## Docker (TrueNAS)
+
+Example (join existing llama.cpp network and mount models):
+
+```bash
+docker run --rm -p 9093:9093 -p 9094:9094 \
+  --network ix-llamacpp_default \
+  -v /mnt/fast.storage.rushg.me/datasets/apps/llama-cpp.models:/models \
+  -v /var/run/docker.sock:/var/run/docker.sock \
+  -e LLAMACPP_RESTART_METHOD=docker \
+  -e LLAMACPP_RESTART_COMMAND=ix-llamacpp-llamacpp-1 \
+  -e LLAMACPP_TARGET_CONTAINER=ix-llamacpp-llamacpp-1 \
+  -e TRUENAS_WS_URL=ws://192.168.1.2/websocket \
+  -e TRUENAS_API_KEY=YOUR_KEY \
+  -e TRUENAS_API_USER=YOUR_USER \
+  -e TRUENAS_APP_NAME=llamacpp \
+  -e LLAMACPP_BASE_URL=http://ix-llamacpp-llamacpp-1:8080 \
+  -e PORT_A=9093 -e PORT_B=9094 \
+  llama-cpp-openai-wrapper:latest
+```
+
+## Model Hot-Swap / Restart Hooks
+
+This wrapper does not modify llama.cpp by default. To enable hot-swap/restart for new models or model selection,
+provide one of the restart methods below:
+
+- `LLAMACPP_RESTART_METHOD=http`
+- `LLAMACPP_RESTART_URL=http://host-or-helper/restart`
+
+or
+
+- `LLAMACPP_RESTART_METHOD=shell`
+- `LLAMACPP_RESTART_COMMAND="/usr/local/bin/your-restart-script --arg"`
+
+or (requires mounting docker socket)
+
+- `LLAMACPP_RESTART_METHOD=docker`
+- `LLAMACPP_RESTART_COMMAND=ix-llamacpp-llamacpp-1`
+
+## Model switching via TrueNAS middleware (P0)
+
+Provide TrueNAS API credentials so the wrapper can update the llama.cpp app command when a new model is selected:
+
+```
+TRUENAS_WS_URL=ws://192.168.1.2/websocket
+TRUENAS_API_KEY=YOUR_KEY
+TRUENAS_API_USER=YOUR_USER
+TRUENAS_APP_NAME=llamacpp
+TRUENAS_VERIFY_SSL=false
+```
+
+The wrapper preserves existing flags in the compose command and only updates `--model`, while optionally adding
+missing GPU split flags from `LLAMACPP_*` if not already set.
+
+Optional arguments passed to restart handlers:
+
+```
+LLAMACPP_DEVICES=0,1
+LLAMACPP_TENSOR_SPLIT=0.5,0.5
+LLAMACPP_SPLIT_MODE=layer
+LLAMACPP_N_GPU_LAYERS=999
+LLAMACPP_CTX_SIZE=8192
+LLAMACPP_BATCH_SIZE=1024
+LLAMACPP_UBATCH_SIZE=256
+LLAMACPP_CACHE_TYPE_K=q4_0
+LLAMACPP_CACHE_TYPE_V=q4_0
+LLAMACPP_FLASH_ATTN=on
+```
+
+You can also pass arbitrary llama.cpp flags (space-separated) via:
+
+```
+LLAMACPP_EXTRA_ARGS="--mlock --no-mmap --rope-scaling linear"
+```
+
+## Model Manager UI
+
+Open `http://HOST:PORT_B/`.
+
+Features:
+- List existing models
+- Download models via URL
+- Live progress + cancel
+
+## Testing
+
+Tests are parameterized with 100+ cases per endpoint.
+
+```bash
+pytest -q
+```
+
+## llama.cpp flags reference
+
+Scraped from upstream docs into `reports/llamacpp_docs.md` and `reports/llamacpp_flags.txt`.
+
+```
+pwsh scripts/update_llamacpp_flags.ps1
+```