# AGENTS.md - ingest-ebook-options Runbook (Deep Context + Retrain Guide)

This file captures the full context, decisions, failures, fixes, commands, and
paths used to fine-tune gpt-oss-20b and deploy it into Ollama as
`trained-options-model`. It is meant to be a literal step-by-step recipe for
retraining with new data. Read this end-to-end before touching anything.

------------------------------------------------------------------------------
## 0) Hard Requirements (User Directives)

- Use local documents in this repo only.
- Dedupe repeated docs across formats; do not ingest duplicates.
- Manually remove non-relevant ebook content (preface, index, author/publisher
  pages, etc). Options-trading content only.
- Use GPU heavily (not CPU).
- If local AMD 7900XTX is not available, use the remote NVIDIA box.
- All long-running tasks must show progress and **post progress at least every
  2 minutes** (print progress or size updates, not silent).
- Retraining must complete locally (no cloud).
- Final Ollama model name must be **trained-options-model**.
- Final Ollama model **must support tool/function calls**.
- Any destructive commands must require explicit approval (do not run them
  silently).

------------------------------------------------------------------------------
## 1) Machines, OS, Access, and Credentials

### Local Windows
- Repo path: `C:\Users\Rushabh\projects\ingest-ebook-options`
- Local AMD GPU: 7900XTX (not used here; remote NVIDIA box was used instead).
- Local Ollama install exists but was not used for training.

### Remote TrueNAS SCALE (Used for Training + Ollama)
- Host: `192.168.1.2`
- SSH port: `55555`
- User: `rushabh`
- Password: none required (key-based / no password).
- SSH example:
  - `ssh -p 55555 rushabh@192.168.1.2`
- Ollama HTTP endpoint (remote): `http://192.168.1.2:30068`

### TrueNAS UI / middlewared
- User explicitly required: create and manage containers as TrueNAS Apps
  (middlewared/TrueNAS UI), not ad-hoc docker only.
- If an app does not show in UI, check middlewared and re-create via UI.

------------------------------------------------------------------------------
## 2) Storage Layout and Mounts (Critical)

### Remote TrueNAS storage root
- `/mnt/fast.storage.rushg.me/datasets/apps`

### Remote training workspace (folder, not ZFS dataset)
- `/mnt/fast.storage.rushg.me/datasets/apps/pytorch`
- IMPORTANT: user requested a folder, not a ZFS dataset.

### Repo copy on remote
- `/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options`

### Ollama model storage mount (remote)
- Host path: `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models`
- Container path: `/root/.ollama`
- Actual model store:
  - `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/models`
  - `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/models/blobs`
  - `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/models/manifests`

### Ollama imports folder (created by us)
- `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports`

### Hugging Face cache (remote)
- `/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/hf_cache`
- When retraining, set `HF_HOME` or `HF_HUB_CACHE` to this path to keep downloads
  on fast storage and avoid redownloading.

------------------------------------------------------------------------------
## 3) TrueNAS App Setup (GPU Training + Ollama)

### Ollama App
- Container name: `ix-ollama-ollama-1`
- Exposes: `0.0.0.0:30068`
- GPU: NVIDIA RTX 5060 Ti (16 GB VRAM)
- Observed Ollama version: 0.13.5
- Uses `/root/.ollama` mapped to `/mnt/fast.storage.rushg.me/datasets/apps/ollama.models`

### Training App (Created in TrueNAS UI)
- App name: `options-train`
- GPU: NVIDIA RTX 5060 Ti
- Reason: user demanded TrueNAS UI app creation; also to ensure GPU access.
- We explicitly stopped the `llamacpp` app to free GPU before training.

### Docker permission note
- Non-root user lacks docker socket permission.
- Use `sudo -n docker ...` for all docker commands on the host.

### Shell note (remote)
- Default shell is `zsh`.
- Use `bash -lc '...'` to avoid quote parsing issues and missing tools.
- `rg` is not installed on remote; use `grep`/`find`.

------------------------------------------------------------------------------
## 4) Data Prep Pipeline (Dedup + Manual Relevance)

### Source docs
- Local docs in `eBooks/` (PDF/EPUB/etc).
- Must **manually** select relevant pages (options trading content only).
- Skip: prefaces, index, author/publisher info, boilerplate, etc.

### Step A - Extract full text and doc-level dedupe
Script: `tools/extract_corpus.py`
- Supports .pdf/.epub/.txt/.md
- Dedup by SHA256 of normalized text across different formats.
- Outputs:
  - `training_data/manifest.json`
  - `training_data/corpus.txt`
  - `training_data/text/*.txt`
  - `training_data/rejected.json`
Example:
```
python tools/extract_corpus.py --input eBooks --out training_data --min-chars 2000
```
Dependencies:
- `pypdf`, `ebooklib`, `beautifulsoup4`, `lxml`, `chardet`

### Step B - Page/section relevance filtering (Options-focused)
Script: `tools/select_relevant.py`
- Scores segments for options-trading keywords.
- Drops TOC/index/front matter.
- Dedupe by SHA256 of normalized segment.
- Includes neighboring pages by `--neighbors`.
Outputs in `training_data/relevant`:
  - `text/*.txt`
  - `manifest.json`
  - `report.csv`
  - `corpus.txt`
Example:
```
python tools/select_relevant.py --input eBooks --out training_data/relevant \
  --min-score 10 --min-chars 800 --neighbors 1
```

### Step C - Chunk to JSONL dataset
Script: `tools/build_dataset.py`
- Splits into overlapping chunks.
- Optional junk filter and keyword score.
Outputs:
  - `training_data/relevant/dataset.jsonl`
  - `training_data/relevant/dataset.stats.json`
Example:
```
python tools/build_dataset.py \
  --manifest training_data/relevant/manifest.json \
  --text-dir training_data/relevant/text \
  --out training_data/curated/dataset.jsonl \
  --chunk-chars 6000 --overlap-chars 400 --min-chars 1200 --drop-junk
```

### Manual curation requirement
- The scripts are helper filters only. You must still **manually review** for
  relevance, especially to remove prefaces, indexes, disclaimers, etc.
- Use `training_data/relevant/corpus.txt` to scan human-readable content.

### Dataset used in this run
- Remote dataset path:
  `/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/training_data/curated/dataset.jsonl`
- Count: 1778 chunks.

------------------------------------------------------------------------------
## 5) Training Pipeline (LoRA fine-tune on NVIDIA box)

### Why local AMD GPU was not used
- User explicitly requested the remote NVIDIA box.
- Local AMD 7900XTX was not used in this run.

### Training script (repo)
- `tools/finetune_lora.py`
- Modified to fix gradient checkpointing + LoRA:
  - `model.enable_input_require_grads()` is required.
  - Without it, MXFP4 path fails with:
    `RuntimeError: element 0 of tensors does not require grad...`

### Key training args used
- `--model openai/gpt-oss-20b`
- `--data training_data/curated/dataset.jsonl`
- `--out training_data/lora_adapter`
- `--max-length 256`
- `--epochs 1` (adjust as needed)
- `--lora-r 8 --lora-alpha 16 --lora-dropout 0.05`
- `--grad-accum 4`
- `--quant auto` (MXFP4 on GPU)
- `--log-seconds 120` (must show progress every 2 minutes)
- `--log-steps 10` (extra progress)

### Progress requirement (must follow)
- Use `--log-seconds 120` so training prints logs every ~2 minutes.
- For long copies or merges, print `date` + file size in a loop every 120 sec.

### GPU requirements
- NVIDIA GPU required for quantized loading; MXFP4 needs GPU.
- GPU observed: RTX 5060 Ti, 16 GB VRAM, CUDA 12.8.

### What failed and how we fixed it

1) **MXFP4 grad error**
   - Error: `RuntimeError: element 0 of tensors does not require grad`
   - Fix: In `tools/finetune_lora.py`, after
     `model.gradient_checkpointing_enable()` add:
     `model.enable_input_require_grads()`

2) **Bitsandbytes 4-bit OOM**
   - With `--quant 4bit` the model OOMed even with max memory limits.
   - CPU offload not supported with this setup; still OOM.
   - Fix: use `--quant auto` (MXFP4) instead.

3) **Triton/compile issues**
   - Triton kernels required a compiler in the container.
   - Fix: Use a PyTorch **CUDA devel** image (not runtime) or install
     `build-essential` inside the container.

### Output artifacts (LoRA)
`training_data/lora_adapter/` contains:
- `adapter_model.safetensors`
- `adapter_config.json`
- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`
- `training_summary.json` (includes steps and loss EMA)

------------------------------------------------------------------------------
## 6) GGUF Conversion and Merge (Required; Ollama LoRA not supported)

### Why merge is required
- Ollama error when using ADAPTER:
  `Error: 500 Internal Server Error: failed to initialize model: loras are not yet implemented`
- Therefore, must merge LoRA into base GGUF.

### llama.cpp setup (remote)
- Clone location: `/mnt/fast.storage.rushg.me/datasets/apps/pytorch/llama.cpp`
- Build:
```
cd /mnt/fast.storage.rushg.me/datasets/apps/pytorch/llama.cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DLLAMA_CURL=OFF
cmake --build build -j $(nproc)
```
- Note: `-DLLAMA_CURL=OFF` used due to missing libcurl.
- Binaries:
  - `build/bin/llama-export-lora`
  - `build/bin/llama-gguf`
- When running, set:
  - `LD_LIBRARY_PATH=/mnt/.../llama.cpp/build/bin`

### Convert LoRA to GGUF
Use `convert_lora_to_gguf.py`:
```
python convert_lora_to_gguf.py \
  --lora /path/to/training_data/lora_adapter \
  --outfile /path/to/training_data/lora_adapter/options-lora.gguf
```

### Architecture mismatch pitfall (critical)
- Base GGUF from Ollama uses `general.architecture = gptoss`
- LoRA GGUF from converter uses `general.architecture = gpt-oss`
- `llama-export-lora` throws:
  `model arch and LoRA arch mismatch`

### Fix: rewrite LoRA GGUF metadata to `gptoss`
We used `gguf-py` to rewrite metadata. Example (run inside a Python container):
```
from gguf import GGUFReader, GGUFWriter, GGUFValueType
import numpy as np

inp = "options-lora.gguf"
out = "options-lora-gptoss.gguf"
r = GGUFReader(inp)
w = GGUFWriter(out, "gptoss", endianess=r.endianess)

# Copy KV fields except general.architecture
for key, field in r.fields.items():
    if key.startswith("GGUF.") or key in ("general.architecture", "general.alignment"):
        continue
    vtype = field.types[0]
    if vtype == GGUFValueType.ARRAY:
        w.add_key_value(key, field.contents(), vtype, field.types[-1])
    else:
        w.add_key_value(key, field.contents(), vtype)

# Copy tensors
for t in r.tensors:
    data = t.data
    if not data.flags["C_CONTIGUOUS"]:
        data = np.ascontiguousarray(data)
    w.add_tensor(t.name, data, raw_shape=list(map(int, t.shape)),
                 raw_dtype=t.tensor_type, tensor_endianess=r.endianess)
```

### Tensor orientation mismatch (critical)
- After arch fix, merge failed with:
  `GGML_ASSERT(ggml_can_mul_mat(a, b)) failed`
- Root cause: LoRA A/B tensors had orientation incompatible with base GGUF.
- Fix: transpose LoRA A and B **data** when re-serializing GGUF.

**Important GGUF detail:**
- GGUF stores tensor dims reversed internally.
- You must transpose the data while keeping the *original raw_shape*.
- Working approach:
```
if name.endswith(".lora_a") or name.endswith(".lora_b"):
    data = np.ascontiguousarray(data.T)
w.add_tensor(name, data, raw_shape=shape, raw_dtype=..., ...)
```

### Working LoRA GGUF for merge
- `options-lora-gptoss-transposed2.gguf`

### Merge LoRA into base GGUF
Base GGUF path (from Ollama blob):
`/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb`

Merge command:
```
export LD_LIBRARY_PATH=/mnt/.../llama.cpp/build/bin
/mnt/.../llama.cpp/build/bin/llama-export-lora \
  -m /mnt/.../ollama.models/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb \
  --lora /mnt/.../training_data/lora_adapter/options-lora-gptoss-transposed2.gguf \
  -o /mnt/.../training_data/lora_adapter/gpt-oss-20b-options-merged-f16-v3.gguf
```

### Merged output (final)
- `/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/training_data/lora_adapter/gpt-oss-20b-options-merged-f16-v3.gguf`
- Size: ~13 GB
- File type: F16

### Intermediate artifacts kept (not deleted)
- `options-lora-gptoss.gguf`
- `options-lora-gptoss-transposed.gguf`
- `options-lora-gptoss-transposed-debug.gguf`
- `options-lora-gptoss-transposed2.gguf`
- `gpt-oss-20b-options-merged-f16-v2.gguf` (14 MB, failed)
- `gpt-oss-20b-options-merged-f16.gguf` (0 bytes, failed)

------------------------------------------------------------------------------
## 7) Ollama Integration (Final Model)

### Why ADAPTER does not work
Modelfile with ADAPTER fails:
```
Error: 500 Internal Server Error: failed to initialize model: loras are not yet implemented
```
Therefore, merged GGUF is mandatory.

### Copy merged GGUF into Ollama imports
```
mkdir -p /mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports
cp /mnt/.../gpt-oss-20b-options-merged-f16-v3.gguf \
   /mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports/
```

### Modelfile (with tool support)
**Important:** tools only work if the TEMPLATE block matches the base model
template. Without TEMPLATE, Ollama shows `{{ .Prompt }}` and tools are disabled.

We extracted template from base:
```
sudo -n docker exec -i ix-ollama-ollama-1 ollama show gpt-oss:20b --template \
  > /mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports/gptoss.template
```

Then built Modelfile:
`/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports/Modelfile.trained-options-model`
```
FROM /root/.ollama/imports/gpt-oss-20b-options-merged-f16-v3.gguf
TEMPLATE """
<paste full gpt-oss:20b template here>
"""

SYSTEM """You are a knowledgeable options trading assistant.
Explain concepts clearly, use correct terminology (Greeks, volatility, spreads, assignment), and be explicit about assumptions.
If information is uncertain, say so rather than guessing."""
```

### Create the model
```
sudo -n docker exec -i ix-ollama-ollama-1 \
  ollama create trained-options-model -f /root/.ollama/imports/Modelfile.trained-options-model
```

### Verify in Ollama
```
sudo -n docker exec -i ix-ollama-ollama-1 ollama list
sudo -n docker exec -i ix-ollama-ollama-1 ollama show trained-options-model
```
Expected capabilities include: `completion`, `tools`, `thinking`.

### Runtime note
- `ollama run` can take a long time to load and may time out.
- Use HTTP API for reliable results:
```
curl http://192.168.1.2:30068/api/generate -d '{
  "model":"trained-options-model:latest",
  "prompt":"Explain delta and gamma briefly.",
  "stream":false
}'
```

------------------------------------------------------------------------------
## 8) Tool/Function Call Requirement (Mandatory)

### How to verify tool support
1) `ollama show trained-options-model` should list `tools` in Capabilities.
2) `ollama show trained-options-model --template` should show the full template
   (not `{{ .Prompt }}`).

### Tool-call test (HTTP)
```
curl http://192.168.1.2:30068/api/chat -d '{
  "model":"trained-options-model:latest",
  "stream":false,
  "messages":[
    {"role":"system","content":"Use tools when available."},
    {"role":"user","content":"Compute total for quantity=3 price=4. Use tool."}
  ],
  "tools":[
    {"type":"function","function":{
      "name":"calc_total",
      "description":"Compute total cost for a trade",
      "parameters":{
        "type":"object",
        "properties":{"quantity":{"type":"number"},"price":{"type":"number"}},
        "required":["quantity","price"]
      }
    }}
  ]
}'
```
Expected: `tool_calls` in response.

------------------------------------------------------------------------------
## 9) Known Failures + Fixes (Summary)

- **Ollama ADAPTER fails** -> Merge LoRA into GGUF.
- **Arch mismatch** (`gpt-oss` vs `gptoss`) -> Rewrite LoRA metadata.
- **ggml_can_mul_mat assertion** -> Transpose LoRA A/B data.
- **MXFP4 gradient error** -> `model.enable_input_require_grads()`.
- **Bitsandbytes 4-bit OOM** -> Use MXFP4 auto on GPU.
- **Triton compile error** -> Use PyTorch CUDA *devel* image or install gcc.
- **WSL convert_lora_to_gguf.py missing transformers** -> Use docker or install
  transformers in WSL.
- **`ollama run` hangs** -> Use `/api/generate` or `/api/chat` via curl.

------------------------------------------------------------------------------
## 10) Retrain Checklist (Minimal Friction)

1) **Prepare data locally**
   - Put docs in `eBooks/`.
   - Run:
     - `python tools/select_relevant.py ...`
     - `python tools/build_dataset.py ...`
   - Manually inspect `training_data/relevant/corpus.txt`.

2) **Sync to remote**
   - Example (PowerShell):
     - `scp -P 55555 -r .\ingest-ebook-options rushabh@192.168.1.2:/mnt/fast.storage.rushg.me/datasets/apps/pytorch/`

3) **Stop GPU-conflicting apps**
   - Stop `llamacpp` app in TrueNAS UI.

4) **Train LoRA in TrueNAS app**
   - Ensure GPU attached.
   - Use `tools/finetune_lora.py` with `--log-seconds 120`.
   - Confirm adapter saved in `training_data/lora_adapter`.

5) **Convert LoRA to GGUF**
   - `convert_lora_to_gguf.py` -> `options-lora.gguf`

6) **Fix arch + transpose**
   - Rewrite to `gptoss`
   - Transpose LoRA A/B data
   - Output `options-lora-gptoss-transposed2.gguf`

7) **Merge into base GGUF**
   - Use `llama-export-lora`
   - Output `gpt-oss-20b-options-merged-f16-v3.gguf`

8) **Ollama import**
   - Copy GGUF to `/mnt/.../ollama.models/imports`
   - Build Modelfile with TEMPLATE
   - `ollama create trained-options-model -f ...`

9) **Verify tool support**
   - `ollama show trained-options-model`
   - `/api/chat` tool-call test

------------------------------------------------------------------------------
## 11) Commands Used in This Run (Examples)

### Remote file listing (progress + verify)
```
ssh -p 55555 rushabh@192.168.1.2 "ls -la /mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/training_data/lora_adapter"
```

### GGUF metadata check
```
python - <<'PY'
from gguf import GGUFReader
r = GGUFReader("options-lora.gguf")
print(r.get_field("general.architecture").contents())
PY
```

### Merge with progress updates every 2 minutes
```
BASE=/mnt/.../ollama.models/models/blobs/<base-blob>
LORA=/mnt/.../options-lora-gptoss-transposed2.gguf
OUT=/mnt/.../gpt-oss-20b-options-merged-f16-v3.gguf
export LD_LIBRARY_PATH=/mnt/.../llama.cpp/build/bin
/mnt/.../llama-export-lora -m "$BASE" --lora "$LORA" -o "$OUT" &
pid=$!
while kill -0 $pid 2>/dev/null; do date; ls -lh "$OUT" || true; sleep 120; done
wait $pid
```

------------------------------------------------------------------------------
## 12) Notes About Local Files in This Repo

- `Modelfile.trained-options-model` (local) still references ADAPTER and is
  **not** valid for current Ollama (ADAPTER unsupported).
- Use the remote Modelfile in `/mnt/.../ollama.models/imports/`.
- `_tmp_*` scripts exist for prior automation attempts (TrueNAS app creation,
  GPU checks, etc). Use only if you know what they do.

------------------------------------------------------------------------------
## 13) Progress Reporting Policy (Non-Negotiable)

During any long run (training, merge, large copy):
- Print a progress line every 120 seconds.
- Example: `date` + file size, or a training loss line.
- Do not allow silent runs.

------------------------------------------------------------------------------
## 14) Quick Sanity Checks (After Retrain)

1) `ollama list` shows `trained-options-model:latest`
2) `ollama show trained-options-model` lists `tools`
3) `/api/generate` returns a coherent answer
4) `/api/chat` returns a tool call when tools are provided

------------------------------------------------------------------------------
## 15) Do NOT Forget These Pitfalls

- Arch mismatch (`gpt-oss` vs `gptoss`) **will break merge**.
- LoRA tensor orientation mismatch **will break merge**.
- ADAPTER in Modelfile **does not work** in current Ollama.
- Tool calls **only** work if TEMPLATE is included.
- Remote shell is zsh; use `bash -lc` for complex quoting.
- Docker requires `sudo -n`.
- Use the remote GPU as requested; do not train on CPU.

------------------------------------------------------------------------------
## 16) Current "Final" Artifacts (Reference)

### LoRA adapter
`/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/training_data/lora_adapter/`

### Merged GGUF (final)
`/mnt/fast.storage.rushg.me/datasets/apps/pytorch/ingest-ebook-options/training_data/lora_adapter/gpt-oss-20b-options-merged-f16-v3.gguf`

### Ollama Modelfile
`/mnt/fast.storage.rushg.me/datasets/apps/ollama.models/imports/Modelfile.trained-options-model`

### Ollama Model Name
`trained-options-model:latest`

------------------------------------------------------------------------------
## 17) If You Need to Rebuild Tools Support

1) Extract base template:
```
sudo -n docker exec -i ix-ollama-ollama-1 \
  ollama show gpt-oss:20b --template > /mnt/.../gptoss.template
```
2) Create Modelfile with TEMPLATE block.
3) Re-run `ollama create`.
4) Verify `ollama show trained-options-model` lists `tools`.

------------------------------------------------------------------------------
## 18) Git Repo + Source Inventory (This Repo)

### Remote git repo
- URL (HTTP): `https://git.rushg.me/rushabh/ollama-model-training-5060ti`
- URL (git):  `https://git.rushg.me/rushabh/ollama-model-training-5060ti.git`
- Auth: user will authenticate on push when prompted (username/password).

### What is committed (and why)
- `AGENTS.md` (this runbook; full end-to-end context).
- `README.md` (quick overview + links to AGENTS).
- `tools/` scripts for extraction, filtering, dataset build, and training.
- `training_data/` curated dataset, manifests, reports, and LoRA outputs used
  for the run (kept for reproducibility).
- `remote/ollama/Modelfile.trained-options-model.remote` (exact remote Modelfile
  used to enable tools).
- `remote/ollama/gptoss.template` (base template pulled from gpt-oss:20b).
- `Modelfile.trained-options-model` (local reference; see remote Modelfile for
  tool-enabled version).

### What is excluded (and why)
- `eBooks/` raw source data (large; keep local and private).
- `_llama_cpp/` (upstream repo; clone on demand).
- `.venv/` and Python caches.
- Any base model weights or Ollama blobs (too large; download via Ollama/HF).

### How to recreate missing external assets
- Base model:
  - `ollama pull gpt-oss:20b` on the Ollama host
  - or `huggingface-cli download openai/gpt-oss-20b` into HF cache
- llama.cpp:
  - `git clone https://github.com/ggml-org/llama.cpp.git`
  - build with `-DLLAMA_CURL=OFF` if libcurl is missing.

------------------------------------------------------------------------------
End of AGENTS.md