12 KiB
Backstory
Backstory is an AI Resume agent that provides context into a diverse career narrative. Backstory will take a collection of documents about a person and provide:
- Through the use of several custom Language Processing Modules (LPM), develop a comprehensive set of test and validation data based on the input documents. While manual review of content should be performed to ensure accuracy, several LLM techniques are employed in the LPM in order to isolate and remove hallucinations and inaccuracies in the test and validation data.
- Utilizing quantized low-rank adaption (QLoRA) and parameter effecient tine tuning (PEFT,) provide a hyper parameter tuned and customized LLM for use in chat and content creation scenarios with expert knowledge about the individual.
- Post-training, utilize additional RAG content to further enhance the information domain used in conversations and content generation.
- An integrated document publishing work flow that will transform a "Job Description" into a customized "Resume" for the person the LLM has been trained on.
- "Fact Check" the resulting resume against the RAG content directly provided by the user in order to remove hallucinations.
While it can run a variety of LLM models, Backstory is currently running Qwen2.5:7b. In addition to the standard model, the chat pipeline also exposes several utility tools for the LLM to use to obtain real-time data.
Internally, Backstory is built using PyTorch 2.6, and Python 3.11 (several pip packages were not yet available for Python 3.12 shipped with Ubuntu Oracular 24.10, which these containers are based on.)
This system was built to run on commodity hardware, for example the Intel Arc B580 GPU with 12G of RAM.
Zero to Hero
Before you spend too much time learning how to customize Backstory, you may want to see it in action with your own information. Fine-tuning the LLM with your data can take a while, so you might want to see what the system can do just by utilizing retrieval-augmented generation.
Backstory works by generating a set of facts about you. Those facts can be exposed to the LLM via RAG, or baked into the LLM by fine-tuning. In either scenario, Backstory needs to know your relationship with a given fact.
WIP WIP notes: Right now, it just uses RAG. I'm working on the PEFT+QLoRA code. So take this section as aspirational... (patches welcome)
To facilitate this, Backstory expects the documents it reads to be marked with information that highlights your role in relation to the document. That information is either stored within each document as Front Matter (YAML) or as a YAML sidecar file (a file with the same name as the content, plus the extension .yml)
The two key items expected in the front matter / sidecar are:
---
person:
role:
---
For example, a file resume.md
could have the following either as front matter or in the file resume.md.yml
:
---
person: James Ketrenos
role: This resume is about James Ketrenos and refers to his work history.
---
A document from a project you worked on, in my case backstory
, could have the following front matter:
---
person: James Ketrenos
role: Designed, built, and deployed the application described in this document.
---
During both RAG extraction and during fine-tuning, that context information is provided to the LLM so it can better respond to queries about the user and that user's specific roles.
This project is seeded with a minimal resume and document about backstory. Those are present in the docs/
directory, which is where you will place your content. If you do not replace anything and run the system as-is, Backstory will be able to provide information about me via RAG (there is fine-tuned data provided in this project archive.)
Installation
This project uses docker containers to build. As this was originally written to work on an Intel Arc B580 (Battlemage), it requires a kernel that supports that hardware, such as the one documented at Intel Graphics Preview, which runs in Ubuntu Oracular (24.10)..
NOTE: You need 'docker compose' installed. See Install Docker Engine on Ubuntu
Building
NOTE: You need 'docker compose' installed. See Install Docker Engine on Ubuntu
git clone https://github.com/jketreno/backstory
cd backstory
docker compose build
Containers
This project provides the following containers:
Container | Purpose |
---|---|
backstory | Base container with GPU packages installed and configured. Main server entry point. Also used for frontend development. |
jupyter | backstory + Jupyter notebook for running Jupyter sessions |
miniircd | Tiny deployment of an IRC server for testing IRC agents |
ollama | Installation of Intel's pre-built Ollama.cpp |
While developing Backstory, sometimes Hugging Face is used directly with models loaded via PyTorch. At other times, especially during rapid-development, the ollama deployment is used. This combination allows you to easily access GPUs running either locally (via the local ollama or HF code)
To see which models are easily deployable with Ollama, see the Ollama Model List.
Prior to using a new model, you need to download it:
MODEL=qwen2.5:7b
docker compose exec -it ollama ollama pull ${MODEL}
To download many common models for testing against, you can use the fetch-models.sh
script which will download:
- qwen2.5:7b
- llama3.2
- mxbai-embed-large
- deepseek-r1:7b
- mistral:7b
To run the script:
docker compose exec -it ollama /fetch-models.sh
The persisted volume mounts (./cache
and ./ollama
) can grow quite large with models, GPU kernel caching, etc. During the development of this project, the cache directory has grown to consume ~250G of disk space.
Inside the cache you will see directories like:
| Directory | Size | What's in it? |
|:----------|:-----:--------------|
| datasets | 23G | If you download any HF datasets, they will be here |
| hub | 310G | All of the HF models will show up here. |
| libsycl_cache | 2.9G | Used by... libsycl. It caches pre-compiled things here. |
| modules | ~1M | Not sure what created this. It has some microsoft code, so maybe from markitdown? |
| neo_compiler_cache | 1.1G | If you are on an Intel GPU, this is where JIT compiled GPU kernels go. If you launch a model and it seems to stall out, watch ls -alt cache/neo_compiler_cache
to see if Intel's compute runtime (NEO) is writing here. |
And in ollama:
| Directory | Size | What's in it? |
|:----------|:-----:--------------|
| models | 32G | All models downloaded via ollama pull ...
. |
Running
In order to download Hugging Face models, you need to have a Hugging Face token. See https://huggingface.co/settings/tokens for information on obtaining a token.
Edit .env to add the following:
HF_ACCESS_TOKEN=<access token from huggingface>
HF_HOME=/root/.cache
HF_HOME is set for running in the containers to point to a volume mounted directory which will enable model downloads to be persisted.
NOTE: Models downloaded by most examples will be placed in the ./cache directory, which is bind mounted to the container.
Backstory
If you just want to run the pre-built environment, you can run:
docker compose up -d
That will launch all the required containers. Once loaded, the following ports are exposed:
Container: backstory-prod
- 8911 - http for the chat server. If you want https (recommended) then you should use an nginx reverse proxy to provide this endpoint. See src/server.py WEB_PORT and docker-compose
ports
under thebackstory
service. This port is safe to be exposed to the Internet if you want to expose this from your own service.
Container: backstory
- 8912 - https for the development chat server. Do not expose this port to the Internet. The chat server running on 8912 uses qwen-2.b:3B instead of the larger 7B model. This allows you to run backstory-prod and backstory (for development) on the same GPU without running out of memory.
- 3000 - During interactive development of the frontend, the React is found at this port. By default, static content is served through port 8911. Do not expose this port to the Internet.
Container: jupyter
Do not expose these ports to the Internet
- 8888 - Jupyter Notebook. You can access this port for a Juptyer notebook running on top of the
backstory
base container. - 60673 - This allows you to connect to Gradio apps from outside the container, provided you launch the Gradio on port 60673
.launch(server_name="0.0.0.0", server_port=60673)
Container: ollama
Do not expose these ports to the Internet
- 11434 - ollama server port. This should not be exposed to the Internet. You can use it via curl/wget locally. The
backstory
andjupyter
containers are on the same Docker network, so they do not need this port exposed if you don't want it. See docker-compose.ymlports
underollama
.
Once the above is running, to launch the backstory shell interactively:
docker compose exec --it backstory shell
Jupyter
docker compose up jupyter -d
The default port for inbound connections is 8888 (see docker-compose.yml). $(pwd)/jupyter is bind mounted to /opt/jupyter
in the container, which is where notebooks will be saved by default.
To access the jupyter notebook, go to https://localhost:8888/jupyter
.
Monitoring
You can run ze-monitor
within the launched containers to monitor GPU usage.
docker compose exec backstory ze-monitor --list
Container 5317c503e771 devices:
Device 1: 8086:A780 (Intel(R) UHD Graphics 770)
Device 2: 8086:E20B (Intel(R) Graphics [0xe20b])
To monitor a device:
docker compose exec backstory ze-monitor --device 2
If you have more than one GPU, the device numbering can change when you reboot. You can specify the PCI ID instead of a device number:
docker compose exec backstory ze-monitor --device 8086:e20b
NOTE: The ability to monitor temperature sensors, etc. is restricted while running in a container. I recommend installing ze-monitor on the host system and running it there.
Sample output:
$ ze-monitor --device 8086:e20b --one-shot
Device: 8086:E20B (Intel(R) Arc(TM) B580 Graphics)
Total Memory: 12809404416
Free memory: [# 4% ]
Sensor 0: 38.0C
Sensor 1: 33.0C
Sensor 2: 38.0C
Power usage: 36.0W
-----------------------------------------------------------------------
PID COMMAND-LINE
USED MEMORY SHARED MEMORY ENGINE FLAGS
-----------------------------------------------------------------------
43344 /opt/ollama/ollama-bin serve
MEM: 197795840 SHR: 0 FLAGS:
44098 /opt/ollama/ollama-bin runner --model /root/....threads 8 --no-mmap --parallel 1 --port 42341
MEM: 1006231552 SHR: 0 FLAGS: DMA COMPUTE
85909 /opt/ollama/ollama-bin runner --model /root/....threads 8 --no-mmap --parallel 1 --port 38085
MEM: 3873189888 SHR: 0 FLAGS: DMA COMPUTE
104468 /opt/ollama/ollama-bin runner --model /root/....threads 8 --no-mmap --parallel 1 --port 42505
MEM: 7101763584 SHR: 0 FLAGS: DMA COMPUTE
132740 /opt/backstory/venv/bin/python /opt/backstory...orkers=32 --parent=9 --read-fd=3 --write-fd=