Understanding Embedding for Archon and more

renaud · February 24, 2025, 5:02pm

Hello everyone,

I’m making steady progress on Archon, my AI agent that builds AI agents!

My goal is to develop a tool that helps both in understanding the process of building AI agents and in improving my own agent over time.

My Use Case

I want to create a clean Supabase knowledge database containing all the documentation I need (e.g., Pydantic AI, Langraph, n8n).

Over time, different versions of Archon will have access to this structured and well-maintained knowledge base.

Key Questions

Embeddings Model Choice:

• Should I always use the same model to embed my documentation knowledge database?

• Is it necessary to use a model from the same company to both embed and query the content?

Model-Specific Considerations:

• Each model has its own vector size (e.g., 1536 for OpenAI embeddings ).

• How do I find the vector size of a specific model? I checked the Hugging Face model card for nomic-embed-text, but I couldn’t find this information. Any tips?

Long-Term Strategy:

• What’s the best approach to building a knowledge database that can evolve over time?

• I’ve seen Cole switch between models (e.g., nomic-embed-text vs. text-embedding-3-small from OpenAI). How should I think about embedding model updates?

Would love to hear your insights! Thanks in advance for your help.

renaud · February 24, 2025, 5:06pm

Here is the answer from chat GPT 4o
Embeddings Model Choice

Should I always use the same model to embed my documentation knowledge database?

It’s not mandatory to use the same model indefinitely, but switching models requires recomputing all embeddings for consistency.

• If you change the embeddings model, you’ll need to recompute and replace all stored vectors.

• Using different models for embedding and querying might introduce inconsistencies due to differences in vector space.

Is it necessary to use a model from the same company to both embed and query the content?

No, but it’s highly recommended to use the same model (or at least models trained on similar principles) for both embedding and querying.

• Different models have different vector dimensions and interpret semantic meaning differently.

• If you use OpenAI to embed and Nomic AI to query, retrieval accuracy might suffer.

Model-Specific Considerations

How do I find the vector size of a specific model?

For OpenAI models, the vector size is documented (e.g., 1536 for text-embedding-3-small).

For Hugging Face models like nomic-embed-text, try:

Looking for dim or vector_size in the config.json of the model repository.
Running:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")
print(model.get_sentence_embedding_dimension())  # This should return the vector size

Checking Hugging Face forums or documentation.

Long-Term Strategy

What’s the best approach to building a knowledge database that can evolve over time?

To ensure scalability and flexibility , consider:

• Versioning your embeddings → Keep track of the model version used to generate them.

• Storing raw text + metadata → So you can easily re-embed content if needed.

• Choosing a widely-used embeddings model → Minimizes the risk of migration issues later.

• Monitoring model updates → If a newer model offers better performance, test its impact before fully switching.

How should I think about embedding model updates?

• If your database is small, re-embedding is feasible whenever you change models.

• For large-scale databases, consider a phased migration, where you compare results before a full switch.

• Some systems allow hybrid search (combining embeddings + keyword-based retrieval), which can help if migrating embeddings is impractical.

Would love to hear your thoughts!

lamachine.geo · February 26, 2025, 5:41pm

These have been my solutions or progress, ymmv.

Apparently you can sometimes use one model to search another models embedding, but not recommended. When I embed chunks, I added a column for what model was used to embed it.

I tried padding the smaller models with zeros, but had numerous issues. I have now 4 columns regarding this in my knowledge graph, 768 vector and model used to embed, and 1536 and model used to embed. Then I just point what ever tool I am playing with to the particular column it needs.

While it is space-inefficient, adding columns did not make the data much larger, and it is relatively easy to build a simple tool to re-vectorize data because you already have the chunks broken and summarized.

Finally, I just stick with nomic-embed-text 768 for most of what I do. I can run it free on local hardware, and oddly enough it runs it on my CPU while my GPU handles llama3.1 (can use tools, just bump up the context window size).

Good luck, and zap me a note if you need code to re-vectorize your data.

lamachine.geo · February 26, 2025, 5:43pm

LOL, compared my answer to @renaud . Use his

renaud · February 27, 2025, 8:24pm

Thanks so much for your detailed explanation! As someone just starting with RAG after watching Cole’s videos, your insights are incredibly valuable.

If I understand correctly, your technique would result in a structure with 5 columns:

Original text chunks
Metadata for the 768 vector model
The 768-dimension embeddings themselves
Metadata for the 1536 vector model
The 1536-dimension embeddings themselves

This also means I’d need to implement a separate similarity search function for each embedding type, which makes perfect sense.

I think I need to first decide whether I want to run things locally with Llama models or start my RAG with Claude 3.7 Sonnet, which seems ideal for Archon-like applications. Then I can choose my embedding model accordingly and implement my RAG system with that. If needed later, I can always vectorize my data again in additional columns for other models.

This approach seems especially smart if I want to compare performance between different embedding models on the same dataset!

Thanks again for the guidance - this is exactly the practical advice I needed to move forward with Archon!

info2 · April 1, 2025, 12:50am

I added langchain and crawl4ai to archon v5 with explanations on how I did it. here CCwithAi/ArchonV5-CRAWL4AI-LANGCHAIN: ARCHONV5-Langchain-Crawl4AI

Works great Tip use local AI LLMs for langchain to avoid rate limits.
Let me know if theres anything I can help you with?

renaud · April 1, 2025, 8:06am

Thank you @info2 .
Archon with pydantic, crawl4ai and docling documentation is my goal.
I am busy until next week on a client’s project, but I will for sure pull your PR and run it then.
Going for local AI seems appealing, but I don’t know if my laptop can handle it.
Does It takes a long time (so I just need to launch it at night) or it uses a lot of RAM ?
And then, local archon is really useful with local llm or it is worth embedding with proprietary llm once and then use it with claude 3.7 or equivalent ?
And finally, does archon handle well having all those documentations and understand clearly when to use what ?
Sometimes I am wondering if it would not be smarter to have a server for documentation only and another one for agent creation.
Feel free to tell me your thought. (I might take a couple of weeks to answer …)