# RAG and Fine Tuning

Two common practices for augmenting a model with new information:

1. In-context learning
2. Fine Tuning

## In-context learning

In-context learnings is great for dynamic data. Rather than retrain the model, you include relevant contextual information in the prompt to the model. The approach to collect that context information is generally:

1. Store the content in a context retrieval system
2. Based on the user's query, determine which content in #1 is most relevant
3. Augment the user's query with that context information

PROS: Easy to implement
CONS: Retrieving the correct context can be difficult, and without the context, the model won't have the additional information and will revert to its internal knowledge.

If you only have one or two documents (totalling a few thousand tokens than the models maximum input) you can include the full text of the document in the prompt and skip the context mapping.

## Fine tuning

Fine tuning is great if you have static content, as applying the knowledge to the model requires the model to go through a training phase which can take several hours or longer, depending on the amount of information being updated. The approach to fine-tune is:

1. Process your data into a series of 'context', 'query', 'response' correlations
2. Train the model using those correlations

The main work involved is in #1, and the success of fine tuning will be greatly impacted by the method used to perform it. As having an expert system manually generate queries and responses can be time consuming, the "prompt adjustment" of the first method can be used. 

For this, you can iteratively perform the following:

### Query generation

1. Context data
2. Prompt: Given the context, create a list of questions about the topic. Do not provide answers.

### Response generation

1. Context data
2. Prompt: Given the context, respond to the following question: {query}

Repeat the above for each piece of context data. This is now your expert system which you can use to fine-tune your model.

PROS: Not too difficult to implement and can give much better responses to queries about items covered in the context.
CONS: Updating the model with changes in the context requires retraining, and training takes a log of system resources.

With the Intel Arc B580, I was not able to fine-tune on a full 7B parameter model and had to use a smaller 1.5B parameter model. To train using the alpaca-clean dataset takes 3-5 hours. To generate the correlations, the full 7B model was used in order to get better questions and answers. That data was then used to fine-tune the 1.5B model.

As inference using deepseek-r1 can take 15 seconds or so per query (using the 7B model) and queries on neuralchat-7b are nearly instantaneous, I will explore creating correlations using deepseek, and then use that data to train neurlchat.

# Aproach taken in resume-bot

I tried several techniques, and have collected example output.

1. In-context via pre-embedding context tokens (ollama TRAINING)
2. In-context via full-context in-query
3. In-context via relavent text (traditional RAG)
4. Fine-tune

# Torch vs Ollama

Ollama is easy to setup, and it performs well. However, it does not expose a method for fine-tuning a model beyond the TRAINING template which does not adjust model weights and is more akin to in-context training.

torch is a little more difficult to setup, and it too performs well. With the fast collection of libraries and infrastructure available, fine-tuning using torch is relatively straight forward.

Once you have a fine-tuned model, you can use that model with ollama or torch. I have run the resume-bot using both torch and ollama on an Intel Core i9-14900KS with 64G of RAM and an Intel Arc B580 GPU with 12G of RAM. Below are some metrics gathered while running several query passes:

|                      | ollama-ipex-llm | pytorch w/ ipex-llm |
|:---------------------|:----------------|:--------------------|
| Query time           |                 |                     |
| Concurrent queries   |                 |                     |

### How Ollama Uses the TRAINING Section

The `TRAINING` section in an Ollama Modelfile works differently than traditional fine-tuning methods. Here's how Ollama uses it:

1. **Not True Parameter Fine-tuning**:
   - Unlike traditional fine-tuning that updates model weights through backpropagation, Ollama doesn't modify the underlying model parameters
   - The examples in `TRAINING` don't trigger a training loop or gradient updates

2. **Template-Based Learning**:
   - Ollama uses these examples as additional context when the model is created
   - The examples effectively become part of the model's "knowledge" 
   - This is more like instruction-tuning through examples than actual parameter updates

3. **Implementation Details**:
   - The examples are processed during model creation
   - They're tokenized and stored alongside the model
   - When running inference, Ollama doesn't directly include these examples in every prompt
   - Instead, the model is influenced by having processed these examples during creation

4. **Technical Mechanism**:
   - The exact implementation varies by model architecture
   - For many models, Ollama prepends these examples during the model creation process
   - This shapes the model's understanding without modifying weights
   - It's similar to how system prompts work but applied at model creation time

5. **Limitations**:
   - The effectiveness depends on the base model's capability
   - It works best for teaching patterns and preferred response styles
   - It's less effective for teaching new facts or complex reasoning
   - The number of examples is limited by context window size

### Practical Considerations

- Use concise, high-quality examples that demonstrate the exact behavior you want
- Focus on patterns rather than specific facts
- Include diverse examples covering different aspects of desired behavior
- For best results, combine with well-crafted system prompts
- Remember that this isn't true fine-tuning - it's more like "example-based conditioning"

This approach works well for adapting model style and format, but for more substantial changes to model behavior, traditional fine-tuning frameworks that update weights (like those in Hugging Face's ecosystem) would be more effective.