Hey everyone,
First of all i want to say that of all big AI tubers Cole really is on another level in my opinion. Not to be a bootlicker, just really felt like saying that.
i am a complete novice at this, so be gentle. i would really appreciate any feedback and guidance. My plan is to combine core project by Cole with my add on feature for my company.
https://github.com/AbelCoplet/llama-cag-n8N
I tried using RAG template for local AI, and came across a reserach paper “Dont do RAG” because the core design does not fit my purpose. Not really a great name, but it describes another way to handle memory management. so i dove in to the rabbit hole head first. there are explanations in my workflows. My purpose is to deploy a production ready system combinig both approaches. If necessary i can explain my business case, but in short the system has to handle with precision querries from data that is limited in size such as a company handbook or manual.
Combinig that with RAG i want to attempt a functioning production ready system. As you probably know error rate for any overly complex systems especially involving AI agents is too high at the moment for real world production usecase, so i try to streamline it as much as possible with AI agent only handling the initial querry.
In this project I explore various Memory Management possibilities for local AI, Focusing on CAG, and how i think to combine the 2 for optimal querry processing.
Note i focus on production ready and tangible result rather thatn useless Buzzwords lik AI agents, that are designed to remind you to poop.
I’m sharing my project llama-cag-n8n - a complete implementation of Context-Augmented Generation (CAG) for document processing using large context window models.
The purpose is to have a reliable LLM to communicate with a fixed dataset with precision. My personal usecase is it has to interact with company handbook and manuals for precise answers to querries.
my work is inspired by “dont do RAG” Paper that discusses CAG
Wha is CAG (i have explanation on my GIT)
The TL;DR:
- Instead of chunking documents into tiny pieces like traditional RAG, this system lets models process entire documents at once (up to 128K tokens)
- It creates “memory snapshots” (KV caches) of documents that can be instantly loaded for future queries
- Much faster responses with deeper document understanding
- Works offline, Mac-compatible, and integrates easily via n8n
Complimentary to RAG
The Document Processing Sauce
Core is the way it handles document preprocessing. I’m using Claude 3.7 Sonnet’s 200K input and ability to output exactly 128K tokens to create optimized documents. This sidesteps the need for complex chunking strategies - Claude automatically handles redundancy removal and optimization while preserving all critical information.
I mean it is always possible to replace this with a more involved OCR, chunking workflow, but this is not priority for me if i can get away with a more simple solution for now.
The workflow:
- Claude processes the document to fit within the 128K output window
- The optimized document is sent directly to the KV cache creation process
- The model’s internal state after reading the document is saved as a KV cache
This approach is only possible because of Claude’s recent capability to produce outputs matching exactly the token limit needed for the KV cache target in its “simple” design
Streamlined Retrieval
For document querying, I’ve implemented a direct approach using the CAG bridge component that loads the pre-computed KV caches. This gives responses in seconds rather than the much longer time needed to reprocess documents.
While the primary focus is on CAG, the system is designed to work alongside traditional RAG when needed:
- CAG provides deep understanding of specific documents
- RAG can be used for broader knowledge when appropriate
- An intelligent agent can choose the best approach per query
The template is not optimised and there are various ways to implement this to production
Why This Matters
The 128K context window is a game-changer for document processing in my opinion for small local llm. Instead of having models try to understand fragmented chunks, they can comprehend entire documents at once, maintaining awareness across sections and providing more coherent answers.
All the code is available in my GitHub repo with step-by-step setup instructions. there are many smaller mistakes here and there for sure, so i am still debugging.
Please understand this is explorative in nature, so there are bound to be mistakes, oversight.
From a purely CAG focused project it became more of an exploration and a template for me to select and combine appropriate Memory Management techniques among known variations that are enabled with today technology.
I would appreciate any feedback, namely on n8n workflows. I included detailed explanations.
again, i barely know what i am doing, so things are shaky, but i hope i am atleast on the right track?
EDIT:
if anyone is interrested, i am studying scenarios and apllications for this either as a reliable Chatbot interface, a loop to control RAG output for hallucinations with Metadata because local rag is very much imperfect, a RAG enhancement system of sorts if you will, or as a standalone solution for like i said, comany handbooks, manuals, and/or very precise and rigid datasets that can be made compact enough to fit into cache. I understand there are industrial grade solutions out there, but that is out of reach. and i found no actionable explanations ready for deployment locally today. This is a non exhaustive overview of the scope.