Backstory
Backstory is an AI Resume agent that provides context into a diverse career narative.
This project provides an AI chat client. While it can run a variety of LLM models, it is currently running Qwen2.5:7b. In addition to the standard model, enhanced with a RAG expert system that will chunk and embed any text files in ./docs
. It also exposes several utility tools for the LLM to use to obtain real-time data.
Internally, it is built using PyTorch 2.6, and Python 3.11 (several pip packages were not yet available for Python 3.12 shipped with Ubuntu Oracular 24.10, which these containers are based on.)
NOTE: If running on an Intel Arc A series graphics processor, fp64 is not supported and may need to either be emulated or have the model quantized. It has been a while since I've had an A series GPU to test on, so if you run into problems please file an issue--I have some routines I can put in, but don't have a way to test them.
Installation
This project uses docker containers to build. As this was originally written to work on an Intel Arc B580 (Battlemage), it requires a kernel that supports that hardware, such as the one documented at Intel Graphics Preview, which runs in Ubuntu Oracular (24.10)..
NOTE: You need 'docker compose' installed. See Install Docker Engine on Ubuntu
Want to run under WSL2? No can do...
https://www.intel.com/content/www/us/en/support/articles/000093216/graphics/processor-graphics.html
The A- and B-series discrete GPUs do not support SR-IOV, required for the GPU partitioning that Microsoft Windows uses in order to support GPU acceleration in WSL.
Building
NOTE: You need 'docker compose' installed. See Install Docker Engine on Ubuntu
git clone https://github.com/jketreno/backstory
cd backstory
docker compose build
Containers
This project provides the following containers:
Container | Purpose |
---|---|
backstory | Base container with GPU packages installed and configured. Main server entry point. Also used for frontend development. |
jupyter | backstory + Jupyter notebook for running Jupyter sessions |
miniircd | Tiny deployment of an IRC server for testing IRC agents |
ollama | Installation of Intel's pre-built Ollama.cpp |
While developing Backstory, sometimes Hugging Face is used directly with models loaded via PyTorch. At other times, especially during rapid-development, the ollama deployment is used. This combination allows you to easily access GPUs running either locally (via the local ollama or HF code)
To see which models are easily deployable with Ollama, see the Ollama Model List.
Prior to using a new model, you need to download it:
MODEL=qwen2.5:7b
docker compose exec -it ollama ollama pull ${MODEL}
To download many common models for testing against, you can use the fetch-models.sh
script which will download:
- qwen2.5:7b
- llama3.2
- mxbai-embed-large
- deepseek-r1:7b
- mistral:7b
docker compose exec -it ollama /fetch-models.sh
The persisted volume mount can grow quite large with models, GPU kernel caching, etc. During the development of this project, the ./cache
directory has grown to consume ~250G of disk space.
Running
In order to download Hugging Face models, you need to have a Hugging Face token. See https://huggingface.co/settings/tokens for information on obtaining a token.
Edit .env to add the following:
HF_ACCESS_TOKEN=<access token from huggingface>
HF_HOME=/root/.cache
HF_HOME is set for running in the containers to point to a volume mounted directory which will enable model downloads to be persisted.
NOTE: Models downloaded by most examples will be placed in the ./cache directory, which is bind mounted to the container.
Backstory
If you just want to run the pre-built environment, you can run:
docker compose up -d
That will launch all the required containers. Once loaded, the following ports are exposed:
Container: backstory
- 8911 - http for the chat server. If you want https (recommended) then you should use an nginx reverse proxy to provide this endpoint. See src/server.py WEB_PORT and docker-compose
ports
under thebackstory
service. This port is safe to be exposed to the Internet if you want to expose this from your own service. - 3000 - During interactive development of the frontend, the React server can be found at this port. By default, static content is served through port 8911. Do not expose this port to the Internet.
Container: jupyter
- 8888 - Jupyter Notebook. You can access this port for a Juptyer notebook running on top of the
backstory
base container. - 60673 - This allows you to connect to Gradio apps from outside the container, provided you launch the Gradio on port 60673
.launch(server_name="0.0.0.0", server_port=60673)
Container: ollama
- 11434 - ollama server port. This should not be exposed to the Internet. You can use it via curl/wget locally. The
backstory
andjupyter
containers are on the same Docker network, so they do not need this port exposed if you don't want it. See docker-compose.ymlports
underollama
.
Once the above is running, to launch the backstory shell interactively:
docker compose exec --it backstory shell
Jupyter
docker compose up jupyter -d
The default port for inbound connections is 8888 (see docker-compose.yml). $(pwd)/jupyter is bind mounted to /opt/jupyter
in the container, which is where notebooks will be saved by default.
To access the jupyter notebook, go to https://localhost:8888/jupyter
.
Monitoring
You can run ze-monitor
within the launched containers to monitor GPU usage.
docker compose exec backstory ze-monitor --list
Container 5317c503e771 devices:
Device 1: 8086:A780 (Intel(R) UHD Graphics 770)
Device 2: 8086:E20B (Intel(R) Graphics [0xe20b])
To monitor a device:
docker compose exec backstory ze-monitor --device 2