James Ketrenos 9f7ddca90a Routing refactored

2025-05-14 16:17:27 -07:00

23 KiB

Raw Blame History

Backstory

Backstory is an AI Resume agent that provides context into a diverse career narrative. Backstory will take a collection of documents about a person and provide:

WIP: Through the use of several custom Language Processing Modules (LPM), develop a comprehensive set of test and validation data based on the input documents. While manual review of content should be performed to ensure accuracy, several LLM techniques are employed in the LPM in order to isolate and remove hallucinations and inaccuracies in the test and validation data.
WIP: Utilizing quantized low-rank adaption (QLoRA) and parameter effecient tine tuning (PEFT,) provide a hyper parameter tuned and customized LLM for use in chat and content creation scenarios with expert knowledge about the individual.
Post-training, utilize additional RAG content to further enhance the information domain used in conversations and content generation.
An integrated document publishing work flow that will transform a "Job Description" into a customized "Resume" for the person the LLM has been trained on, incorporating a multi-stage "Fact Check" to reduce hallucination.

While it can run a variety of LLM models, Backstory is currently running Qwen2.5:7b. In addition to the standard model, the chat pipeline also exposes several utility tools for the LLM to use to obtain real-time data.

Internally, Backstory is built using PyTorch 2.6, and Python 3.11 (several pip packages were not yet available for Python 3.12 shipped with Ubuntu Oracular 24.10, which these containers are based on.)

This system was built to run on commodity hardware, for example the Intel Arc B580 GPU with 12G of RAM.

Zero to Hero

Before you spend too much time learning how to customize Backstory, you may want to see it in action with your own information. Fine-tuning the LLM with your data can take a while, so you might want to see what the system can do just by utilizing retrieval-augmented generation.

The ./docs directory has been seeded with an AI generated persona. That directory is only used during development; actual content should be put into the ./docs-prod directory.

Launching with the defaults (which includes the AI generated persona), you can ask things like Who is Eliza Morgan?

If you want to seed your own data:

docker compose down backstory
Remove everything from docs/: rm -rf docs/*
Put your generic resume in docs/resume/generic[.pdf,.md,.txt,.docx]
Remove everything from chromadb/: rm --rf chromadb/*
docker compose up backstory -d

WIP

Backstory works by generating a set of facts about you. Those facts can be exposed to the LLM via RAG, or baked into the LLM by fine-tuning. In either scenario, Backstory needs to know your relationship with a given fact.

WIP notes: Right now, it just uses RAG. I'm working on the PEFT+QLoRA code. So take this section as aspirational... (patches welcome)

To facilitate this, Backstory expects the documents it reads to be marked with information that highlights your role in relation to the document. That information is either stored within each document as Front Matter (YAML) or as a YAML sidecar file (a file with the same name as the content, plus the extension .yml)

The two key items expected in the front matter / sidecar are:

---
person:
role:
---

For example, a file resume.md could have the following either as front matter or in the file resume.md.yml:

---
person: James Ketrenos
role: This resume is about James Ketrenos and refers to his work history.
---

A document from a project you worked on, in my case backstory, could have the following front matter:

---
person: James Ketrenos
role: Designed, built, and deployed the application described in this document.
---

During both RAG extraction and during fine-tuning, that context information is provided to the LLM so it can better respond to queries about the user and that user's specific roles.

This project is seeded with a minimal resume and document about backstory. Those are present in the docs/ directory, which is where you will place your content. If you do not replace anything and run the system as-is, Backstory will be able to provide information about me via RAG (there is fine-tuned data provided in this project archive.)

Installation

This project uses docker containers to build. As this was originally written to work on an Intel Arc B580 (Battlemage), it requires a kernel that supports that hardware, such as the one documented at Intel Graphics Preview, which runs in Ubuntu Oracular (24.10)..

NOTE: You need 'docker compose' installed. See Install Docker Engine on Ubuntu

Building

NOTE: You need 'docker compose' installed. See Install Docker Engine on Ubuntu

git clone https://github.com/jketreno/backstory
cd backstory
docker compose build

Containers

This project provides the following containers:

Container	Purpose
backstory	Base container with GPU packages installed and configured. Main server entry point. Exposes an HTTPS entrypoint for use by frontend development
backstory-prod	Base container with GPU packages installed and configured. Main server entry point. Exposes an HTTP entrypoint for exposing via nginx or other reverse proxy server. Serves static files generated by frontend.
frontend	Frontend development and building static file for backstory-prod.
jupyter	backstory + Jupyter notebook for running Jupyter sessions
miniircd	Tiny deployment of an IRC server for testing IRC agents
ollama	Installation of Intel's pre-built Ollama.cpp

While developing Backstory, sometimes Hugging Face is used directly with models loaded via PyTorch. At other times, especially during rapid-development, the ollama deployment is used. This combination allows you to easily access GPUs running either locally (via the local ollama or HF code)

To see which models are easily deployable with Ollama, see the Ollama Model List.

Prior to using a new model, you need to download it:

MODEL=qwen2.5:7b
docker compose exec -it ollama ollama pull ${MODEL}

To download many common models for testing against, you can use the fetch-models.sh script which will download:

qwen2.5:7b
llama3.2
mxbai-embed-large
deepseek-r1:7b
mistral:7b

To run the script:

docker compose exec -it ollama /fetch-models.sh

The persisted volume mounts (./cache and ./ollama) can grow quite large with models, GPU kernel caching, etc. During the development of this project, the cache directory has grown to consume ~250G of disk space.

Inside the cache you will see directories like:

Directory	Size	What's in it?
datasets	23G	If you download any HF datasets, they will be here
hub	310G	All of the HF models will show up here. `docker exec backstory shell "huggingface-cli scan-cache"`
libsycl_cache	2.9G	Used by... libsycl. It caches pre-compiled things here.
modules	~1M	Not sure what created this. It has some microsoft code, so maybe from markitdown?
neo_compiler_cache	1.1G	If you are on an Intel GPU, this is where JIT compiled GPU kernels go. If you launch a model and it seems to stall out, `watch ls -alt cache/neo_compiler_cache` to see if Intel's compute runtime (NEO) is writing here.

I haven't kept up on pruning out old models I'm not using. Sample output of running the hugging-cli command:

$ docker exec backstory shell "huggingface-cli scan-cache -vvv"
REPO ID                                              REPO TYPE REVISION                                 SIZE ON DISK NB FILES LAST_MODIFIED REFS       LOCAL PATH                                                                                                                        
---------------------------------------------------- --------- ---------------------------------------- ------------ -------- ------------- ---------- --------------------------------------------------------------------------------------------------------------------------------- 
Matthijs/cmu-arctic-xvectors                         dataset   36e87b347a6a70f0420445b02ec40c55556f9ed7        21.3M        1 5 weeks ago              /root/.cache/hub/datasets--Matthijs--cmu-arctic-xvectors/snapshots/36e87b347a6a70f0420445b02ec40c55556f9ed7                       
Matthijs/cmu-arctic-xvectors                         dataset   5c1297a9eb6c91714ea77c0d4ac5aca9b6a952e5         2.4K        2 5 weeks ago   main       /root/.cache/hub/datasets--Matthijs--cmu-arctic-xvectors/snapshots/5c1297a9eb6c91714ea77c0d4ac5aca9b6a952e5                       
McAuley-Lab/Amazon-Reviews-2023                      dataset   2b6d039ed471f2ba5fd2acb718bf33b0a7e5598e        25.2G       10 3 weeks ago   main       /root/.cache/hub/datasets--McAuley-Lab--Amazon-Reviews-2023/snapshots/2b6d039ed471f2ba5fd2acb718bf33b0a7e5598e                    
yahma/alpaca-cleaned                                 dataset   12567cabf869d7c92e573c7c783905fc160e9639        44.3M        2 2 months ago  main       /root/.cache/hub/datasets--yahma--alpaca-cleaned/snapshots/12567cabf869d7c92e573c7c783905fc160e9639                               
IDEA-Research/grounding-dino-tiny                    model     a2bb814dd30d776dcf7e30523b00659f4f141c71       690.3M        8 2 days ago    main       /root/.cache/hub/models--IDEA-Research--grounding-dino-tiny/snapshots/a2bb814dd30d776dcf7e30523b00659f4f141c71                    
Intel/neural-chat-7b-v3-3                            model     7506dfc5fb325a8a8e0c4f9a6a001671833e5b8e        14.5G       10 3 months ago  main       /root/.cache/hub/models--Intel--neural-chat-7b-v3-3/snapshots/7506dfc5fb325a8a8e0c4f9a6a001671833e5b8e                            
Qwen/CodeQwen1.5-7B-Chat                             model     7b0cc3380fe815e6f08fe2f80c03e05a8b1883d8        14.5G       10 4 weeks ago   main       /root/.cache/hub/models--Qwen--CodeQwen1.5-7B-Chat/snapshots/7b0cc3380fe815e6f08fe2f80c03e05a8b1883d8                             
TheBloke/neural-chat-7B-v3-2-AWQ                     model     f3c5e4160e0faecf91ca396558527ba13f1efb72         2.3M        6 2 months ago  main       /root/.cache/hub/models--TheBloke--neural-chat-7B-v3-2-AWQ/snapshots/f3c5e4160e0faecf91ca396558527ba13f1efb72                     
TheBloke/neural-chat-7B-v3-2-GGUF                    model     97de3dbd877a4b022eda57b292d0efba0187ed79         7.5G        3 2 months ago  main       /root/.cache/hub/models--TheBloke--neural-chat-7B-v3-2-GGUF/snapshots/97de3dbd877a4b022eda57b292d0efba0187ed79                    
black-forest-labs/FLUX.1-dev                         model     0ef5fff789c832c5c7f4e127f94c8b54bbcced44        57.9G       29 6 weeks ago   main       /root/.cache/hub/models--black-forest-labs--FLUX.1-dev/snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44                         
black-forest-labs/FLUX.1-schnell                     model     741f7c3ce8b383c54771c7003378a50191e9efe9        33.7G       23 6 weeks ago   main       /root/.cache/hub/models--black-forest-labs--FLUX.1-schnell/snapshots/741f7c3ce8b383c54771c7003378a50191e9efe9                     
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B            model     ad9f0ae0864d7fbcd1cd905e3c6c5b069cc8b562         3.6G        5 2 months ago  main       /root/.cache/hub/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-1.5B/snapshots/ad9f0ae0864d7fbcd1cd905e3c6c5b069cc8b562            
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B              model     916b56a44061fd5cd7d6a8fb632557ed4f724f60        15.2G        7 2 months ago  main       /root/.cache/hub/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a8fb632557ed4f724f60              
intel/neural-chat-7b-v3                              model     7f6ebc113310e0d2ecc92ae94daeddba5493704d         2.3M        7 2 months ago  main       /root/.cache/hub/models--intel--neural-chat-7b-v3/snapshots/7f6ebc113310e0d2ecc92ae94daeddba5493704d                              
intel/neural-chat-7b-v3-3                            model     7506dfc5fb325a8a8e0c4f9a6a001671833e5b8e         2.3M        7 2 months ago  main       /root/.cache/hub/models--intel--neural-chat-7b-v3-3/snapshots/7506dfc5fb325a8a8e0c4f9a6a001671833e5b8e                            
llmware/intel-neural-chat-7b-v3-2-ov                 model     7a0a312108b4b9c37c739eb83b592c30c9965eb0         2.3M        5 2 months ago  main       /root/.cache/hub/models--llmware--intel-neural-chat-7b-v3-2-ov/snapshots/7a0a312108b4b9c37c739eb83b592c30c9965eb0                 
meta-llama/Llama-3.2-3B                              model     13afe5124825b4f3751f836b40dafda64c1ed062         9.1M        3 3 weeks ago   main       /root/.cache/hub/models--meta-llama--Llama-3.2-3B/snapshots/13afe5124825b4f3751f836b40dafda64c1ed062                              
meta-llama/Llama-3.2-3B-Instruct                     model     0cb88a4f764b7a12671c53f0838cd831a0843b95         9.1M        3 5 weeks ago   main       /root/.cache/hub/models--meta-llama--Llama-3.2-3B-Instruct/snapshots/0cb88a4f764b7a12671c53f0838cd831a0843b95                     
microsoft/Florence-2-base                            model     ceaf371f01ef66192264811b390bccad475a4f02       467.1M        9 2 days ago    main       /root/.cache/hub/models--microsoft--Florence-2-base/snapshots/ceaf371f01ef66192264811b390bccad475a4f02                            
microsoft/florence-2-base                            model     ceaf371f01ef66192264811b390bccad475a4f02         2.5M        7 2 days ago    main       /root/.cache/hub/models--microsoft--florence-2-base/snapshots/ceaf371f01ef66192264811b390bccad475a4f02                            
microsoft/speecht5_hifigan                           model     6f01b211b404df2e0a0a20ca79628a757bb35854        50.6M        1 5 weeks ago   refs/pr/1  /root/.cache/hub/models--microsoft--speecht5_hifigan/snapshots/6f01b211b404df2e0a0a20ca79628a757bb35854                           
microsoft/speecht5_hifigan                           model     bb6f429406e86a9992357a972c0698b22043307d        50.7M        2 5 weeks ago   main       /root/.cache/hub/models--microsoft--speecht5_hifigan/snapshots/bb6f429406e86a9992357a972c0698b22043307d                           
microsoft/speecht5_tts                               model     30fcde30f19b87502b8435427b5f5068e401d5f6       585.7M        7 5 weeks ago   main       /root/.cache/hub/models--microsoft--speecht5_tts/snapshots/30fcde30f19b87502b8435427b5f5068e401d5f6                               
microsoft/speecht5_tts                               model     a01d4f293234515125d07f68be3c36d739ccac93       585.4M        1 5 weeks ago   refs/pr/28 /root/.cache/hub/models--microsoft--speecht5_tts/snapshots/a01d4f293234515125d07f68be3c36d739ccac93                               
mistralai/Mistral-Small-3.1-24B-Instruct-2503        model     247c7a102f360e2ab181caf6aa7e8144316fd488        96.1G       25 5 weeks ago   main       /root/.cache/hub/models--mistralai--Mistral-Small-3.1-24B-Instruct-2503/snapshots/247c7a102f360e2ab181caf6aa7e8144316fd488        
openlm-research/open_llama_3b_v2                     model     4293833c8795656cdacfae811f713ada0e7a2726         6.9G        1 2 months ago  refs/pr/16 /root/.cache/hub/models--openlm-research--open_llama_3b_v2/snapshots/4293833c8795656cdacfae811f713ada0e7a2726                     
openlm-research/open_llama_3b_v2                     model     bce5d60d3b0c68318862270ec4e794d83308d80a         6.9G        6 2 months ago  main       /root/.cache/hub/models--openlm-research--open_llama_3b_v2/snapshots/bce5d60d3b0c68318862270ec4e794d83308d80a                     
openlm-research/open_llama_7b_v2                     model     e5961def23172a2384543940e773ab676033c963        13.5G       10 3 months ago  main       /root/.cache/hub/models--openlm-research--open_llama_7b_v2/snapshots/e5961def23172a2384543940e773ab676033c963                     
runwayml/stable-diffusion-v1-5                       model     451f4fe16113bff5a5d2269ed5ad43b0592e9a14         5.5G       15 6 weeks ago   main       /root/.cache/hub/models--runwayml--stable-diffusion-v1-5/snapshots/451f4fe16113bff5a5d2269ed5ad43b0592e9a14                       
sentence-transformers/all-MiniLM-L6-v2               model     c9745ed1d9f207416be6d2e6f8de32d1f16199bf        91.6M       11 2 months ago  main       /root/.cache/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/c9745ed1d9f207416be6d2e6f8de32d1f16199bf               
stabilityai/stable-diffusion-xl-base-1.0             model     462165984030d82259a11f4367a4eed129e94a7b         7.1G       19 1 day ago     main       /root/.cache/hub/models--stabilityai--stable-diffusion-xl-base-1.0/snapshots/462165984030d82259a11f4367a4eed129e94a7b             
unsloth/Mistral-Small-24B-Base-2501-unsloth-bnb-4bit model     4e277e563e75dc642a9947b0a5e42b16440c9546        15.7G       12 5 weeks ago   main       /root/.cache/hub/models--unsloth--Mistral-Small-24B-Base-2501-unsloth-bnb-4bit/snapshots/4e277e563e75dc642a9947b0a5e42b16440c9546 

Done in 0.0s. Scanned 29 repo(s) for a total of 326.8G.

And inside ollama:

Directory	Size	What's in it?
models	32G	All models downloaded via `ollama pull ...`. Run `docker exec ollama ollama list`

Sample output of running ollama list:

$ docker exec ollama ollama list
ggml_sycl_init: found 1 SYCL devices:
NAME                        ID              SIZE      MODIFIED     
mxbai-embed-large:latest    468836162de7    669 MB    15 hours ago    
qwen2.5:3b                  357c53fb659c    1.9 GB    10 days ago     
mistral:7b                  f974a74358d6    4.1 GB    2 weeks ago     
qwen2.5:7b                  845dbda0ea48    4.7 GB    3 weeks ago     
llama3.2:latest             a80c4f17acd5    2.0 GB    3 weeks ago     
dolphin-phi:latest          c5761fc77240    1.6 GB    6 weeks ago     
llama3.2-vision:latest      085a1fdae525    7.9 GB    6 weeks ago     
llava:latest                8dd30f6b0cb1    4.7 GB    6 weeks ago     
deepseek-r1:1.5b            a42b25d8c10a    1.1 GB    7 weeks ago     
deepseek-r1:7b              0a8c26691023    4.7 GB    7 weeks ago

Running

In order to download Hugging Face models, you need to have a Hugging Face token. See https://huggingface.co/settings/tokens for information on obtaining a token.

Edit .env to add the following:

HF_ACCESS_TOKEN=<access token from huggingface>
HF_HOME=/root/.cache

HF_HOME is set for running in the containers to point to a volume mounted directory which will enable model downloads to be persisted.

NOTE: Models downloaded by most examples will be placed in the ./cache directory, which is bind mounted to the container.

Backstory

If you just want to run the pre-built environment, you can run:

docker compose up -d

That will launch all the required containers. Once loaded, the following ports are exposed:

Container: backstory-prod

8911 - http for the chat server. If you want https (recommended) then you should use an nginx reverse proxy to provide this endpoint. See src/server.py WEB_PORT and docker-compose ports under the backstory service. This port is safe to be exposed to the Internet if you want to expose this from your own service.

Container: backstory

8912 - https for the development chat server. Do not expose this port to the Internet. The chat server running on 8912 uses qwen-2.b:3B instead of the larger 7B model. This allows you to run backstory-prod and backstory (for development) on the same GPU without running out of memory.
3000 - During interactive development of the frontend, the React is found at this port. By default, static content is served through port 8911. Do not expose this port to the Internet.

Container: jupyter

Do not expose these ports to the Internet

8888 - Jupyter Notebook. You can access this port for a Juptyer notebook running on top of the backstory base container.
60673 - This allows you to connect to Gradio apps from outside the container, provided you launch the Gradio on port 60673 .launch(server_name="0.0.0.0", server_port=60673)

Container: ollama

Do not expose these ports to the Internet

11434 - ollama server port. This should not be exposed to the Internet. You can use it via curl/wget locally. The backstory and jupyter containers are on the same Docker network, so they do not need this port exposed if you don't want it. See docker-compose.yml ports under ollama.

Once the above is running, to launch the backstory shell interactively:

docker compose exec --it backstory shell

Jupyter

docker compose up jupyter -d

The default port for inbound connections is 8888 (see docker-compose.yml). $(pwd)/jupyter is bind mounted to /opt/jupyter in the container, which is where notebooks will be saved by default.

To access the jupyter notebook, go to https://localhost:8888/jupyter.

Monitoring

You can run ze-monitor within the launched containers to monitor GPU usage.

docker compose exec backstory ze-monitor --list

Container 5317c503e771 devices:
Device 1: 8086:A780 (Intel(R) UHD Graphics 770)
Device 2: 8086:E20B (Intel(R) Graphics [0xe20b])

To monitor a device:

docker compose exec backstory ze-monitor --device 2

If you have more than one GPU, the device numbering can change when you reboot. You can specify the PCI ID instead of a device number:

docker compose exec backstory ze-monitor --device 8086:e20b

NOTE: The ability to monitor temperature sensors, etc. is restricted while running in a container. I recommend installing ze-monitor on the host system and running it there.

Sample output:

$ ze-monitor --device 8086:e20b --one-shot
Device: 8086:E20B (Intel(R) Arc(TM) B580 Graphics)
Total Memory:  12809404416     
Free memory:  [# 4%                                                    ] 
Sensor 0: 38.0C       
Sensor 1: 33.0C     
Sensor 2: 38.0C     
Power usage: 36.0W
-----------------------------------------------------------------------
   PID COMMAND-LINE                
       USED MEMORY       SHARED MEMORY     ENGINE FLAGS
-----------------------------------------------------------------------
 43344 /opt/ollama/ollama-bin serve
       MEM: 197795840    SHR: 0            FLAGS:
 44098 /opt/ollama/ollama-bin runner --model /root/....threads 8 --no-mmap --parallel 1 --port 42341
       MEM: 1006231552   SHR: 0            FLAGS: DMA COMPUTE
 85909 /opt/ollama/ollama-bin runner --model /root/....threads 8 --no-mmap --parallel 1 --port 38085
       MEM: 3873189888   SHR: 0            FLAGS: DMA COMPUTE
104468 /opt/ollama/ollama-bin runner --model /root/....threads 8 --no-mmap --parallel 1 --port 42505
       MEM: 7101763584   SHR: 0            FLAGS: DMA COMPUTE
132740 /opt/backstory/venv/bin/python /opt/backstory...orkers=32 --parent=9 --read-fd=3 --write-fd=

23 KiB Raw Blame History

Backstory

Zero to Hero

Installation

Building

Containers

Running

Backstory

Container: backstory-prod

Container: backstory

Container: jupyter

Container: ollama

Jupyter

Monitoring

23 KiB

Raw Blame History