Adding CAG to Local RAG by Cole

Hey everyone,

First of all i want to say that of all big AI tubers Cole really is on another level in my opinion. Not to be a bootlicker, just really felt like saying that.

i am a complete novice at this, so be gentle. i would really appreciate any feedback and guidance. My plan is to combine core project by Cole with my add on feature for my company.

https://github.com/AbelCoplet/llama-cag-n8N

I tried using RAG template for local AI, and came across a reserach paper “Dont do RAG” because the core design does not fit my purpose. Not really a great name, but it describes another way to handle memory management. so i dove in to the rabbit hole head first. there are explanations in my workflows. My purpose is to deploy a production ready system combinig both approaches. If necessary i can explain my business case, but in short the system has to handle with precision querries from data that is limited in size such as a company handbook or manual.

Combinig that with RAG i want to attempt a functioning production ready system. As you probably know error rate for any overly complex systems especially involving AI agents is too high at the moment for real world production usecase, so i try to streamline it as much as possible with AI agent only handling the initial querry.

In this project I explore various Memory Management possibilities for local AI, Focusing on CAG, and how i think to combine the 2 for optimal querry processing.

Note i focus on production ready and tangible result rather thatn useless Buzzwords lik AI agents, that are designed to remind you to poop.

I’m sharing my project llama-cag-n8n - a complete implementation of Context-Augmented Generation (CAG) for document processing using large context window models.

The purpose is to have a reliable LLM to communicate with a fixed dataset with precision. My personal usecase is it has to interact with company handbook and manuals for precise answers to querries.

my work is inspired by “dont do RAG” Paper that discusses CAG

Wha is CAG (i have explanation on my GIT)

The TL;DR:

  • Instead of chunking documents into tiny pieces like traditional RAG, this system lets models process entire documents at once (up to 128K tokens)
  • It creates “memory snapshots” (KV caches) of documents that can be instantly loaded for future queries
  • Much faster responses with deeper document understanding
  • Works offline, Mac-compatible, and integrates easily via n8n
    Complimentary to RAG

The Document Processing Sauce

Core is the way it handles document preprocessing. I’m using Claude 3.7 Sonnet’s 200K input and ability to output exactly 128K tokens to create optimized documents. This sidesteps the need for complex chunking strategies - Claude automatically handles redundancy removal and optimization while preserving all critical information.

I mean it is always possible to replace this with a more involved OCR, chunking workflow, but this is not priority for me if i can get away with a more simple solution for now.

The workflow:

  1. Claude processes the document to fit within the 128K output window
  2. The optimized document is sent directly to the KV cache creation process
  3. The model’s internal state after reading the document is saved as a KV cache

This approach is only possible because of Claude’s recent capability to produce outputs matching exactly the token limit needed for the KV cache target in its “simple” design

Streamlined Retrieval

For document querying, I’ve implemented a direct approach using the CAG bridge component that loads the pre-computed KV caches. This gives responses in seconds rather than the much longer time needed to reprocess documents.

While the primary focus is on CAG, the system is designed to work alongside traditional RAG when needed:

  • CAG provides deep understanding of specific documents
  • RAG can be used for broader knowledge when appropriate
  • An intelligent agent can choose the best approach per query

The template is not optimised and there are various ways to implement this to production

Why This Matters

The 128K context window is a game-changer for document processing in my opinion for small local llm. Instead of having models try to understand fragmented chunks, they can comprehend entire documents at once, maintaining awareness across sections and providing more coherent answers.

All the code is available in my GitHub repo with step-by-step setup instructions. there are many smaller mistakes here and there for sure, so i am still debugging.

Please understand this is explorative in nature, so there are bound to be mistakes, oversight.

From a purely CAG focused project it became more of an exploration and a template for me to select and combine appropriate Memory Management techniques among known variations that are enabled with today technology.

I would appreciate any feedback, namely on n8n workflows. I included detailed explanations.

again, i barely know what i am doing, so things are shaky, but i hope i am atleast on the right track?

EDIT:
if anyone is interrested, i am studying scenarios and apllications for this either as a reliable Chatbot interface, a loop to control RAG output for hallucinations with Metadata because local rag is very much imperfect, a RAG enhancement system of sorts if you will, or as a standalone solution for like i said, comany handbooks, manuals, and/or very precise and rigid datasets that can be made compact enough to fit into cache. I understand there are industrial grade solutions out there, but that is out of reach. and i found no actionable explanations ready for deployment locally today. This is a non exhaustive overview of the scope.

1 Like

Took a look at this repo and this is some fantastic work, thank you so much for sharing! I’d love to dive into this next week when I have more time and give you some feedback. Love it!

1 Like

Much appreciate the encouraging reply. your content, for real, inspired me to give it a try. 2 sweaty days later , here i am.

i hope you can forgive my lack of technical ability, and see past it the interesting concepts and intricate relationships they reveal, that has potential for real world application as of today.

1 Like

Even more impressive since you don’t come from a technical background! Nice work

Updated Repo CAG UI

this does not override my initial post, this is a sub process. a llama.cpp wrapper and ui to handle KV cache push and retrieval.

Edit: Updated my post with most up to date functioning repo.
i used small context for testing though, but this time seems to work as intended. now, thisi s very barebones because i did not upgrade anything else, but this should work as proof of concept i think?
some leftover bits and cleanup still needed.

also i added a fallback strategy.

TLDR:
i am not good at this at all, so any input is valuable.

Laying down the foundation for My CAG/RAG local document management

I wanted to share an experimental project I’ve been working on called LlamaCag UI.

Readme should be fairly comprehensive, second readme has added content.

after a much needed weekend i spent 1 day to try to build a functioning UI for the KV cache push and Retrieval. I think i kinda succeeded. it is super janky, cause i never coded in my life, literally first project ever was last week i shared in the post above. i have no idea what i’m doing really.

spent a few hours putting it together, and literally half a day chasing down 1 minor issue tho get this working.

i include screenshots in git to show the UI.

i focused on BASIC functionality to get this working first. this is the first brick in my overall concep design i shared in this post in the beginning.

Some additional considerations and update plans can be found in "FIXES/Readme ".

i am a COMPLETE novice, so please bear with me. i tried my best and only sharing because i was able to actually get the minimum viable prototype running. so proof of concept.

It’s a desktop application that explores a potentially interesting approach to document Q&A using large context window models like Gemma 3 and Llama 3.
UI allows to upload doc, and chat with it.
diclaimer: Janky AF, but kinda works.
just pushed to git, so barely teested. to make sure basics work i recommend manually cleaning temp and cache file after restart just in case.

The Concept: Context-Augmented Generation

The core idea is pretty simple but might be effective. Instead of using traditional RAG with chunk retrieval from a vector database, I’m testing whether we can just load entire documents directly into the LLM’s context window.
this approach Context-Augmented Generation (CAG). The theory is that by giving the model the complete document, we might get better contextual understanding than with retrieved snippets. With 128K token context windows in newer models, this approach becomes increasingly feasible for many documents.

Current State - Early Testing Phase
I want to be completely upfront: this is very much a work in progress with plenty of bugs. I’ve done some basic testing and the core concept seems promising, but there’s still a lot to validate and fix.

The application can in theory:

  • Download local models (Gemma 3, Llama 3, etc.)

  • (not working properly, so manually pull gema3 4b from huggingface, you need the bartowski/google_gemma-3-4b-it-GGUF · Hugging Face

  • Process documents into “KV caches” (stored document contexts)

  • Let you chat with these documents using the model’s full context window

  • Manage different document caches

Everything runs locally using llama.cpp, so documents stay on your machine.
Potential Advantages (If It Works Well)
If this approach pans out after more testing, it could have some interesting benefits:

  • No need for complex vector database setup (token limit sensetive ofc)
  • Potentially better handling of document-wide context
  • Simpler follow-up questions since the whole document remains available
  • Less preprocessing complexity than chunking strategies

Technical Implementation
The app uses Python with PyQt5 for the UI and llama.cpp for inference. Document processing is deliberately simple - we check if it fits in the context window and load it directly if it does.

Looking for Feedback
Since this is experimental, I’d really value feedback from anyone interested in trying it out. There are likely many bugs and edge cases I haven’t encountered yet, so please approach with patience if you decide to test it. This is way beyound my skill level, even with all the AI in the world. this is like my first week doing stuff myself, no background, so be gentle.

If you’re curious:

  1. The repo is at [GitHub link]
  2. It currently only works on Mac (sorry Windows users)
  3. You’ll need at least 16GB RAM, preferably 32GB for larger documents
  4. Be prepared for errors and unexpected behavior!

Early Development
I’m still testing the fundamental premise, and there’s a long “to-do” list for improvements:

  • Better error handling
  • Improved document processing for different formats
  • Support for multiple documents in one context
  • More robust cache management
  • Proper handling of documents exceeding context limits

Would love to hear thoughts on whether this approach makes sense or if there are fatal flaws I haven’t considered. Has anyone tried similar direct context loading approaches?

+++++++++++++++++++++++++++++++
EDIT:

readme explains the current implementation, debugging steps taken and strategies

1 Like