Add training workflow, datasets, and runbook

This commit is contained in:
2025-12-23 21:17:22 -08:00
commit 619e87aacc
2140 changed files with 2513895 additions and 0 deletions

618
AGENTS.md Normal file
View File

@@ -0,0 +1,618 @@
# AGENTS.md - ingest-ebook-options Runbook (Deep Context + Retrain Guide)
This file captures the full context, decisions, failures, fixes, commands, and
paths used to fine-tune gpt-oss-20b and deploy it into Ollama as
`trained-options-model`. It is meant to be a literal step-by-step recipe for
retraining with new data. Read this end-to-end before touching anything.
------------------------------------------------------------------------------
## 0) Hard Requirements (User Directives)
- Use local documents in this repo only.
- Dedupe repeated docs across formats; do not ingest duplicates.
- Manually remove non-relevant ebook content (preface, index, author/publisher
pages, etc). Options-trading content only.
- Use GPU heavily (not CPU).
- If local AMD 7900XTX is not available, use the remote NVIDIA box.
- All long-running tasks must show progress and **post progress at least every
2 minutes** (print progress or size updates, not silent).
- Retraining must complete locally (no cloud).
- Final Ollama model name must be **trained-options-model**.
- Final Ollama model **must support tool/function calls**.
- Any destructive commands must require explicit approval (do not run them
silently).
------------------------------------------------------------------------------
## 1) Machines, OS, Access, and Credentials
### Local Windows
- Repo path: `C:\Users\Rushabh\projects\ingest-ebook-options`
- Local AMD GPU: 7900XTX (not used here; remote NVIDIA box was used instead).
- Local Ollama install exists but was not used for training.
### Remote TrueNAS SCALE (Used for Training + Ollama)
- Host: `192.168.1.2`
- SSH port: `55555`
- User: `rushabh`
- Password: none required (key-based / no password).
- SSH example:
- `ssh -p 55555 rushabh@192.168.1.2`
- Ollama HTTP endpoint (remote): `http://192.168.1.2:30068`
### TrueNAS UI / middlewared
- User explicitly required: create and manage containers as TrueNAS Apps
(middlewared/TrueNAS UI), not ad-hoc docker only.
- If an app does not show in UI, check middlewared and re-create via UI.
------------------------------------------------------------------------------
## 2) Storage Layout and Mounts (Critical)
### Remote TrueNAS storage root
- `/mnt/fast.storage.rushg.me/datasets/apps`
### Remote training workspace (folder, not ZFS dataset)
- `/mnt/fast.storage.rushg.me/datasets/apps/pytorch`
- IMPORTANT: user requested a folder, not a ZFS dataset.
### Repo copy on remote
- `/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options`
### Ollama model storage mount (remote)
- Host path: `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models`
- Container path: `/root/.ollama`
- Actual model store:
- `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/models`
- `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/models/blobs`
- `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/models/manifests`
### Ollama imports folder (created by us)
- `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports`
### Hugging Face cache (remote)
- `/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/hf_cache`
- When retraining, set `HF_HOME` or `HF_HUB_CACHE` to this path to keep downloads
on fast storage and avoid redownloading.
------------------------------------------------------------------------------
## 3) TrueNAS App Setup (GPU Training + Ollama)
### Ollama App
- Container name: `ix-ollama-ollama-1`
- Exposes: `0.0.0.0:30068`
- GPU: NVIDIA RTX 5060 Ti (16 GB VRAM)
- Observed Ollama version: 0.13.5
- Uses `/root/.ollama` mapped to `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models`
### Training App (Created in TrueNAS UI)
- App name: `options-train`
- GPU: NVIDIA RTX 5060 Ti
- Reason: user demanded TrueNAS UI app creation; also to ensure GPU access.
- We explicitly stopped the `llamacpp` app to free GPU before training.
### Docker permission note
- Non-root user lacks docker socket permission.
- Use `sudo -n docker ...` for all docker commands on the host.
### Shell note (remote)
- Default shell is `zsh`.
- Use `bash -lc '...'` to avoid quote parsing issues and missing tools.
- `rg` is not installed on remote; use `grep`/`find`.
------------------------------------------------------------------------------
## 4) Data Prep Pipeline (Dedup + Manual Relevance)
### Source docs
- Local docs in `eBooks/` (PDF/EPUB/etc).
- Must **manually** select relevant pages (options trading content only).
- Skip: prefaces, index, author/publisher info, boilerplate, etc.
### Step A - Extract full text and doc-level dedupe
Script: `tools/extract_corpus.py`
- Supports .pdf/.epub/.txt/.md
- Dedup by SHA256 of normalized text across different formats.
- Outputs:
- `training_data/manifest.json`
- `training_data/corpus.txt`
- `training_data/text/*.txt`
- `training_data/rejected.json`
Example:
```
python tools/extract_corpus.py --input eBooks --out training_data --min-chars 2000
```
Dependencies:
- `pypdf`, `ebooklib`, `beautifulsoup4`, `lxml`, `chardet`
### Step B - Page/section relevance filtering (Options-focused)
Script: `tools/select_relevant.py`
- Scores segments for options-trading keywords.
- Drops TOC/index/front matter.
- Dedupe by SHA256 of normalized segment.
- Includes neighboring pages by `--neighbors`.
Outputs in `training_data/relevant`:
- `text/*.txt`
- `manifest.json`
- `report.csv`
- `corpus.txt`
Example:
```
python tools/select_relevant.py --input eBooks --out training_data/relevant \
--min-score 10 --min-chars 800 --neighbors 1
```
### Step C - Chunk to JSONL dataset
Script: `tools/build_dataset.py`
- Splits into overlapping chunks.
- Optional junk filter and keyword score.
Outputs:
- `training_data/relevant/dataset.jsonl`
- `training_data/relevant/dataset.stats.json`
Example:
```
python tools/build_dataset.py \
--manifest training_data/relevant/manifest.json \
--text-dir training_data/relevant/text \
--out training_data/curated/dataset.jsonl \
--chunk-chars 6000 --overlap-chars 400 --min-chars 1200 --drop-junk
```
### Manual curation requirement
- The scripts are helper filters only. You must still **manually review** for
relevance, especially to remove prefaces, indexes, disclaimers, etc.
- Use `training_data/relevant/corpus.txt` to scan human-readable content.
### Dataset used in this run
- Remote dataset path:
`/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/training_data/curated/dataset.jsonl`
- Count: 1778 chunks.
------------------------------------------------------------------------------
## 5) Training Pipeline (LoRA fine-tune on NVIDIA box)
### Why local AMD GPU was not used
- User explicitly requested the remote NVIDIA box.
- Local AMD 7900XTX was not used in this run.
### Training script (repo)
- `tools/finetune_lora.py`
- Modified to fix gradient checkpointing + LoRA:
- `model.enable_input_require_grads()` is required.
- Without it, MXFP4 path fails with:
`RuntimeError: element 0 of tensors does not require grad...`
### Key training args used
- `--model openai/gpt-oss-20b`
- `--data training_data/curated/dataset.jsonl`
- `--out training_data/lora_adapter`
- `--max-length 256`
- `--epochs 1` (adjust as needed)
- `--lora-r 8 --lora-alpha 16 --lora-dropout 0.05`
- `--grad-accum 4`
- `--quant auto` (MXFP4 on GPU)
- `--log-seconds 120` (must show progress every 2 minutes)
- `--log-steps 10` (extra progress)
### Progress requirement (must follow)
- Use `--log-seconds 120` so training prints logs every ~2 minutes.
- For long copies or merges, print `date` + file size in a loop every 120 sec.
### GPU requirements
- NVIDIA GPU required for quantized loading; MXFP4 needs GPU.
- GPU observed: RTX 5060 Ti, 16 GB VRAM, CUDA 12.8.
### What failed and how we fixed it
1) **MXFP4 grad error**
- Error: `RuntimeError: element 0 of tensors does not require grad`
- Fix: In `tools/finetune_lora.py`, after
`model.gradient_checkpointing_enable()` add:
`model.enable_input_require_grads()`
2) **Bitsandbytes 4-bit OOM**
- With `--quant 4bit` the model OOMed even with max memory limits.
- CPU offload not supported with this setup; still OOM.
- Fix: use `--quant auto` (MXFP4) instead.
3) **Triton/compile issues**
- Triton kernels required a compiler in the container.
- Fix: Use a PyTorch **CUDA devel** image (not runtime) or install
`build-essential` inside the container.
### Output artifacts (LoRA)
`training_data/lora_adapter/` contains:
- `adapter_model.safetensors`
- `adapter_config.json`
- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`
- `training_summary.json` (includes steps and loss EMA)
------------------------------------------------------------------------------
## 6) GGUF Conversion and Merge (Required; Ollama LoRA not supported)
### Why merge is required
- Ollama error when using ADAPTER:
`Error: 500 Internal Server Error: failed to initialize model: loras are not yet implemented`
- Therefore, must merge LoRA into base GGUF.
### llama.cpp setup (remote)
- Clone location: `/mnt/fast.storage.rushg.me/datasets/apps/pytorch/llama.cpp`
- Build:
```
cd /mnt/fast.storage.rushg.me/datasets/apps/pytorch/llama.cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DLLAMA_CURL=OFF
cmake --build build -j $(nproc)
```
- Note: `-DLLAMA_CURL=OFF` used due to missing libcurl.
- Binaries:
- `build/bin/llama-export-lora`
- `build/bin/llama-gguf`
- When running, set:
- `LD_LIBRARY_PATH=/mnt/.../llama.cpp/build/bin`
### Convert LoRA to GGUF
Use `convert_lora_to_gguf.py`:
```
python convert_lora_to_gguf.py \
--lora /path/to/training_data/lora_adapter \
--outfile /path/to/training_data/lora_adapter/options-lora.gguf
```
### Architecture mismatch pitfall (critical)
- Base GGUF from Ollama uses `general.architecture = gptoss`
- LoRA GGUF from converter uses `general.architecture = gpt-oss`
- `llama-export-lora` throws:
`model arch and LoRA arch mismatch`
### Fix: rewrite LoRA GGUF metadata to `gptoss`
We used `gguf-py` to rewrite metadata. Example (run inside a Python container):
```
from gguf import GGUFReader, GGUFWriter, GGUFValueType
import numpy as np
inp = "options-lora.gguf"
out = "options-lora-gptoss.gguf"
r = GGUFReader(inp)
w = GGUFWriter(out, "gptoss", endianess=r.endianess)
# Copy KV fields except general.architecture
for key, field in r.fields.items():
if key.startswith("GGUF.") or key in ("general.architecture", "general.alignment"):
continue
vtype = field.types[0]
if vtype == GGUFValueType.ARRAY:
w.add_key_value(key, field.contents(), vtype, field.types[-1])
else:
w.add_key_value(key, field.contents(), vtype)
# Copy tensors
for t in r.tensors:
data = t.data
if not data.flags["C_CONTIGUOUS"]:
data = np.ascontiguousarray(data)
w.add_tensor(t.name, data, raw_shape=list(map(int, t.shape)),
raw_dtype=t.tensor_type, tensor_endianess=r.endianess)
```
### Tensor orientation mismatch (critical)
- After arch fix, merge failed with:
`GGML_ASSERT(ggml_can_mul_mat(a, b)) failed`
- Root cause: LoRA A/B tensors had orientation incompatible with base GGUF.
- Fix: transpose LoRA A and B **data** when re-serializing GGUF.
**Important GGUF detail:**
- GGUF stores tensor dims reversed internally.
- You must transpose the data while keeping the *original raw_shape*.
- Working approach:
```
if name.endswith(".lora_a") or name.endswith(".lora_b"):
data = np.ascontiguousarray(data.T)
w.add_tensor(name, data, raw_shape=shape, raw_dtype=..., ...)
```
### Working LoRA GGUF for merge
- `options-lora-gptoss-transposed2.gguf`
### Merge LoRA into base GGUF
Base GGUF path (from Ollama blob):
`/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb`
Merge command:
```
export LD_LIBRARY_PATH=/mnt/.../llama.cpp/build/bin
/mnt/.../llama.cpp/build/bin/llama-export-lora \
-m /mnt/.../ollama.models/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb \
--lora /mnt/.../training_data/lora_adapter/options-lora-gptoss-transposed2.gguf \
-o /mnt/.../training_data/lora_adapter/gpt-oss-20b-options-merged-f16-v3.gguf
```
### Merged output (final)
- `/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/training_data/lora_adapter/gpt-oss-20b-options-merged-f16-v3.gguf`
- Size: ~13 GB
- File type: F16
### Intermediate artifacts kept (not deleted)
- `options-lora-gptoss.gguf`
- `options-lora-gptoss-transposed.gguf`
- `options-lora-gptoss-transposed-debug.gguf`
- `options-lora-gptoss-transposed2.gguf`
- `gpt-oss-20b-options-merged-f16-v2.gguf` (14 MB, failed)
- `gpt-oss-20b-options-merged-f16.gguf` (0 bytes, failed)
------------------------------------------------------------------------------
## 7) Ollama Integration (Final Model)
### Why ADAPTER does not work
Modelfile with ADAPTER fails:
```
Error: 500 Internal Server Error: failed to initialize model: loras are not yet implemented
```
Therefore, merged GGUF is mandatory.
### Copy merged GGUF into Ollama imports
```
mkdir -p /mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports
cp /mnt/.../gpt-oss-20b-options-merged-f16-v3.gguf \
/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports/
```
### Modelfile (with tool support)
**Important:** tools only work if the TEMPLATE block matches the base model
template. Without TEMPLATE, Ollama shows `{{ .Prompt }}` and tools are disabled.
We extracted template from base:
```
sudo -n docker exec -i ix-ollama-ollama-1 ollama show gpt-oss:20b --template \
> /mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports/gptoss.template
```
Then built Modelfile:
`/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports/Modelfile.trained-options-model`
```
FROM /root/.ollama/imports/gpt-oss-20b-options-merged-f16-v3.gguf
TEMPLATE """
<paste full gpt-oss:20b template here>
"""
SYSTEM """You are a knowledgeable options trading assistant.
Explain concepts clearly, use correct terminology (Greeks, volatility, spreads, assignment), and be explicit about assumptions.
If information is uncertain, say so rather than guessing."""
```
### Create the model
```
sudo -n docker exec -i ix-ollama-ollama-1 \
ollama create trained-options-model -f /root/.ollama/imports/Modelfile.trained-options-model
```
### Verify in Ollama
```
sudo -n docker exec -i ix-ollama-ollama-1 ollama list
sudo -n docker exec -i ix-ollama-ollama-1 ollama show trained-options-model
```
Expected capabilities include: `completion`, `tools`, `thinking`.
### Runtime note
- `ollama run` can take a long time to load and may time out.
- Use HTTP API for reliable results:
```
curl http://192.168.1.2:30068/api/generate -d '{
"model":"trained-options-model:latest",
"prompt":"Explain delta and gamma briefly.",
"stream":false
}'
```
------------------------------------------------------------------------------
## 8) Tool/Function Call Requirement (Mandatory)
### How to verify tool support
1) `ollama show trained-options-model` should list `tools` in Capabilities.
2) `ollama show trained-options-model --template` should show the full template
(not `{{ .Prompt }}`).
### Tool-call test (HTTP)
```
curl http://192.168.1.2:30068/api/chat -d '{
"model":"trained-options-model:latest",
"stream":false,
"messages":[
{"role":"system","content":"Use tools when available."},
{"role":"user","content":"Compute total for quantity=3 price=4. Use tool."}
],
"tools":[
{"type":"function","function":{
"name":"calc_total",
"description":"Compute total cost for a trade",
"parameters":{
"type":"object",
"properties":{"quantity":{"type":"number"},"price":{"type":"number"}},
"required":["quantity","price"]
}
}}
]
}'
```
Expected: `tool_calls` in response.
------------------------------------------------------------------------------
## 9) Known Failures + Fixes (Summary)
- **Ollama ADAPTER fails** -> Merge LoRA into GGUF.
- **Arch mismatch** (`gpt-oss` vs `gptoss`) -> Rewrite LoRA metadata.
- **ggml_can_mul_mat assertion** -> Transpose LoRA A/B data.
- **MXFP4 gradient error** -> `model.enable_input_require_grads()`.
- **Bitsandbytes 4-bit OOM** -> Use MXFP4 auto on GPU.
- **Triton compile error** -> Use PyTorch CUDA *devel* image or install gcc.
- **WSL convert_lora_to_gguf.py missing transformers** -> Use docker or install
transformers in WSL.
- **`ollama run` hangs** -> Use `/api/generate` or `/api/chat` via curl.
------------------------------------------------------------------------------
## 10) Retrain Checklist (Minimal Friction)
1) **Prepare data locally**
- Put docs in `eBooks/`.
- Run:
- `python tools/select_relevant.py ...`
- `python tools/build_dataset.py ...`
- Manually inspect `training_data/relevant/corpus.txt`.
2) **Sync to remote**
- Example (PowerShell):
- `scp -P 55555 -r .\ingest-ebook-options rushabh@192.168.1.2:/mnt/fast.storage.rushg.me/datasets/apps/pytorch/`
3) **Stop GPU-conflicting apps**
- Stop `llamacpp` app in TrueNAS UI.
4) **Train LoRA in TrueNAS app**
- Ensure GPU attached.
- Use `tools/finetune_lora.py` with `--log-seconds 120`.
- Confirm adapter saved in `training_data/lora_adapter`.
5) **Convert LoRA to GGUF**
- `convert_lora_to_gguf.py` -> `options-lora.gguf`
6) **Fix arch + transpose**
- Rewrite to `gptoss`
- Transpose LoRA A/B data
- Output `options-lora-gptoss-transposed2.gguf`
7) **Merge into base GGUF**
- Use `llama-export-lora`
- Output `gpt-oss-20b-options-merged-f16-v3.gguf`
8) **Ollama import**
- Copy GGUF to `/mnt/.../ollama.models/imports`
- Build Modelfile with TEMPLATE
- `ollama create trained-options-model -f ...`
9) **Verify tool support**
- `ollama show trained-options-model`
- `/api/chat` tool-call test
------------------------------------------------------------------------------
## 11) Commands Used in This Run (Examples)
### Remote file listing (progress + verify)
```
ssh -p 55555 rushabh@192.168.1.2 "ls -la /mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/training_data/lora_adapter"
```
### GGUF metadata check
```
python - <<'PY'
from gguf import GGUFReader
r = GGUFReader("options-lora.gguf")
print(r.get_field("general.architecture").contents())
PY
```
### Merge with progress updates every 2 minutes
```
BASE=/mnt/.../ollama.models/models/blobs/<base-blob>
LORA=/mnt/.../options-lora-gptoss-transposed2.gguf
OUT=/mnt/.../gpt-oss-20b-options-merged-f16-v3.gguf
export LD_LIBRARY_PATH=/mnt/.../llama.cpp/build/bin
/mnt/.../llama-export-lora -m "$BASE" --lora "$LORA" -o "$OUT" &
pid=$!
while kill -0 $pid 2>/dev/null; do date; ls -lh "$OUT" || true; sleep 120; done
wait $pid
```
------------------------------------------------------------------------------
## 12) Notes About Local Files in This Repo
- `Modelfile.trained-options-model` (local) still references ADAPTER and is
**not** valid for current Ollama (ADAPTER unsupported).
- Use the remote Modelfile in `/mnt/.../ollama.models/imports/`.
- `_tmp_*` scripts exist for prior automation attempts (TrueNAS app creation,
GPU checks, etc). Use only if you know what they do.
------------------------------------------------------------------------------
## 13) Progress Reporting Policy (Non-Negotiable)
During any long run (training, merge, large copy):
- Print a progress line every 120 seconds.
- Example: `date` + file size, or a training loss line.
- Do not allow silent runs.
------------------------------------------------------------------------------
## 14) Quick Sanity Checks (After Retrain)
1) `ollama list` shows `trained-options-model:latest`
2) `ollama show trained-options-model` lists `tools`
3) `/api/generate` returns a coherent answer
4) `/api/chat` returns a tool call when tools are provided
------------------------------------------------------------------------------
## 15) Do NOT Forget These Pitfalls
- Arch mismatch (`gpt-oss` vs `gptoss`) **will break merge**.
- LoRA tensor orientation mismatch **will break merge**.
- ADAPTER in Modelfile **does not work** in current Ollama.
- Tool calls **only** work if TEMPLATE is included.
- Remote shell is zsh; use `bash -lc` for complex quoting.
- Docker requires `sudo -n`.
- Use the remote GPU as requested; do not train on CPU.
------------------------------------------------------------------------------
## 16) Current "Final" Artifacts (Reference)
### LoRA adapter
`/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/training_data/lora_adapter/`
### Merged GGUF (final)
`/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/training_data/lora_adapter/gpt-oss-20b-options-merged-f16-v3.gguf`
### Ollama Modelfile
`/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports/Modelfile.trained-options-model`
### Ollama Model Name
`trained-options-model:latest`
------------------------------------------------------------------------------
## 17) If You Need to Rebuild Tools Support
1) Extract base template:
```
sudo -n docker exec -i ix-ollama-ollama-1 \
ollama show gpt-oss:20b --template > /mnt/.../gptoss.template
```
2) Create Modelfile with TEMPLATE block.
3) Re-run `ollama create`.
4) Verify `ollama show trained-options-model` lists `tools`.
------------------------------------------------------------------------------
## 18) Git Repo + Source Inventory (This Repo)
### Remote git repo
- URL (HTTP): `https://git.rushg.me/rushabh/ollama-model-training-5060ti`
- URL (git): `https://git.rushg.me/rushabh/ollama-model-training-5060ti.git`
- Auth: user will authenticate on push when prompted (username/password).
### What is committed (and why)
- `AGENTS.md` (this runbook; full end-to-end context).
- `README.md` (quick overview + links to AGENTS).
- `tools/` scripts for extraction, filtering, dataset build, and training.
- `training_data/` curated dataset, manifests, reports, and LoRA outputs used
for the run (kept for reproducibility).
- `remote/ollama/Modelfile.trained-options-model.remote` (exact remote Modelfile
used to enable tools).
- `remote/ollama/gptoss.template` (base template pulled from gpt-oss:20b).
- `Modelfile.trained-options-model` (local reference; see remote Modelfile for
tool-enabled version).
### What is excluded (and why)
- `eBooks/` raw source data (large; keep local and private).
- `_llama_cpp/` (upstream repo; clone on demand).
- `.venv/` and Python caches.
- Any base model weights or Ollama blobs (too large; download via Ollama/HF).
### How to recreate missing external assets
- Base model:
- `ollama pull gpt-oss:20b` on the Ollama host
- or `huggingface-cli download openai/gpt-oss-20b` into HF cache
- llama.cpp:
- `git clone https://github.com/ggml-org/llama.cpp.git`
- build with `-DLLAMA_CURL=OFF` if libcurl is missing.
------------------------------------------------------------------------------
End of AGENTS.md