Updated docs

2025-04-27 00:09:14 -07:00 · 2025-04-27 00:09:14 -07:00 · b8067d05a9
commit b8067d05a9
parent 581bc4a575
1 changed files with 68 additions and 5 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # Backstory

-Backstory is an AI Resume agent that provides context into a diverse career narative. Backstory will take a collection of documents about a person and provide:
+Backstory is an AI Resume agent that provides context into a diverse career narrative. Backstory will take a collection of documents about a person and provide:

 * Through the use of several custom Language Processing Modules (LPM), develop a comprehensive set of test and validation data based on the input documents. While manual review of content should be performed to ensure accuracy, several LLM techniques are employed in the LPM in order to isolate and remove hallucinations and inaccuracies in the test and validation data.
 * Utilizing quantized low-rank adaption (QLoRA) and parameter effecient tine tuning (PEFT,) provide a hyper parameter tuned and customized LLM for use in chat and content creation scenarios with expert knowledge about the individual. 
@ -20,6 +20,9 @@ Before you spend too much time learning how to customize Backstory, you may want

 Backstory works by generating a set of facts about you. Those facts can be exposed to the LLM via RAG, or baked into the LLM by fine-tuning. In either scenario, Backstory needs to know your relationship with a given fact.

+**WIP**
+WIP notes: Right now, it just uses RAG. I'm working on the PEFT+QLoRA code. So take this section as aspirational... (patches welcome)
+
 To facilitate this, Backstory expects the documents it reads to be marked with information that highlights your role in relation to the document. That information is either stored within each document as [Front Matter (YAML)](https://jekyllrb.com/docs/front-matter/) or as a YAML sidecar file (a file with the same name as the content, plus the extension .yml)

 The two key items expected in the front matter / sidecar are:
@ -100,11 +103,29 @@ To download many common models for testing against, you can use the `fetch-model
 * deepseek-r1:7b
 * mistral:7b

+To run the script:
+
 ```bash
 docker compose exec -it ollama /fetch-models.sh
 ```

-The persisted volume mount can grow quite large with models, GPU kernel caching, etc. During the development of this project, the `./cache` directory has grown to consume ~250G of disk space.
+The persisted volume mounts (`./cache` and `./ollama`) can grow quite large with models, GPU kernel caching, etc. During the development of this project, the cache directory has grown to consume ~250G of disk space.
+
+Inside the cache you will see directories like:
+
+| Directory | Size | What's in it? |
+|:----------|:-----:--------------|
+| datasets  | 23G  | If you download any HF datasets, they will be here |
+| hub       | 310G | All of the HF models will show up here. |
+| libsycl_cache | 2.9G | Used by... libsycl. It caches pre-compiled things here. |
+| modules   | ~1M | Not sure what created this. It has some microsoft code, so maybe from markitdown? |
+| neo_compiler_cache | 1.1G | If you are on an Intel GPU, this is where JIT compiled GPU kernels go. If you launch a model and it seems to stall out, `watch ls -alt cache/neo_compiler_cache` to see if Intel's compute runtime (NEO) is writing here. |
+
+And in ollama:
+
+| Directory | Size | What's in it? |
+|:----------|:-----:--------------|
+| models    | 32G  | All models downloaded via `ollama pull ...`. |

 ## Running

@ -132,18 +153,26 @@ docker compose up -d

 That will launch all the required containers. Once loaded, the following ports are exposed:

-#### Container: backstory
+#### Container: backstory-prod

 * 8911 - http for the chat server. If you want https (recommended) then you should use an nginx reverse proxy to provide this endpoint. See src/server.py WEB_PORT and docker-compose `ports` under the `backstory` service. This port is safe to be exposed to the Internet if you want to expose this from your own service.
-* 3000 - During interactive development of the frontend, the React server can be found at this port. By default, static content is served through port 8911. Do not expose this port to the Internet.
+
+#### Container: backstory
+
+* 8912 - https for the development chat server. Do not expose this port to the Internet. The chat server running on 8912 uses qwen-2.b:3B instead of the larger 7B model. This allows you to run backstory-prod and backstory (for development) on the same GPU without running out of memory.
+* 3000 - During interactive development of the frontend, the React is found at this port. By default, static content is served through port 8911. Do not expose this port to the Internet.

 #### Container: jupyter

+**Do not expose these ports to the Internet**
+
 * 8888 - Jupyter Notebook. You can access this port for a Juptyer notebook running on top of the `backstory` base container.
 * 60673 - This allows you to connect to Gradio apps from outside the container, provided you launch the Gradio on port 60673 `.launch(server_name="0.0.0.0", server_port=60673)`

 #### Container: ollama

+**Do not expose these ports to the Internet**
+
 * 11434 - ollama server port. This should not be exposed to the Internet. You can use it via curl/wget locally. The `backstory` and `jupyter` containers are on the same Docker network, so they do not need this port exposed if you don't want it. See docker-compose.yml `ports` under `ollama`.

 Once the above is running, to launch the backstory shell interactively:
@ -180,4 +209,38 @@ To monitor a device:

 ```bash
 docker compose exec backstory ze-monitor --device 2
-```
+```
+
+If you have more than one GPU, the device numbering can change when you reboot. You can specify the PCI ID instead of a device number:
+
+```bash
+docker compose exec backstory ze-monitor --device 8086:e20b
+```
+
+NOTE: The ability to monitor temperature sensors, etc. is restricted while running in a container. I recommend installing ze-monitor on the host system and running it there. 
+
+Sample output:
+
+```
+$ ze-monitor --device 8086:e20b --one-shot
+Device: 8086:E20B (Intel(R) Arc(TM) B580 Graphics)
+Total Memory:  12809404416     
+Free memory:  [# 4%                                                    ] 
+Sensor 0: 38.0C       
+Sensor 1: 33.0C     
+Sensor 2: 38.0C     
+Power usage: 36.0W
+-----------------------------------------------------------------------
+   PID COMMAND-LINE                
+       USED MEMORY       SHARED MEMORY     ENGINE FLAGS
+-----------------------------------------------------------------------
+ 43344 /opt/ollama/ollama-bin serve
+       MEM: 197795840    SHR: 0            FLAGS:
+ 44098 /opt/ollama/ollama-bin runner --model /root/....threads 8 --no-mmap --parallel 1 --port 42341
+       MEM: 1006231552   SHR: 0            FLAGS: DMA COMPUTE
+ 85909 /opt/ollama/ollama-bin runner --model /root/....threads 8 --no-mmap --parallel 1 --port 38085
+       MEM: 3873189888   SHR: 0            FLAGS: DMA COMPUTE
+104468 /opt/ollama/ollama-bin runner --model /root/....threads 8 --no-mmap --parallel 1 --port 42505
+       MEM: 7101763584   SHR: 0            FLAGS: DMA COMPUTE
+132740 /opt/backstory/venv/bin/python /opt/backstory...orkers=32 --parent=9 --read-fd=3 --write-fd=
+```