Understanding Embedding for Archon and more

Hello everyone,

I’m making steady progress on Archon, my AI agent that builds AI agents! :rocket:

My goal is to develop a tool that helps both in understanding the process of building AI agents and in improving my own agent over time.

My Use Case

I want to create a clean Supabase knowledge database containing all the documentation I need (e.g., Pydantic AI, Langraph, n8n).

Over time, different versions of Archon will have access to this structured and well-maintained knowledge base.

Key Questions

:one: Embeddings Model Choice:

• Should I always use the same model to embed my documentation knowledge database?

• Is it necessary to use a model from the same company to both embed and query the content?

:two: Model-Specific Considerations:

• Each model has its own vector size (e.g., 1536 for OpenAI embeddings ).

• How do I find the vector size of a specific model? I checked the Hugging Face model card for nomic-embed-text, but I couldn’t find this information. Any tips?

:three: Long-Term Strategy:

• What’s the best approach to building a knowledge database that can evolve over time?

• I’ve seen Cole switch between models (e.g., nomic-embed-text vs. text-embedding-3-small from OpenAI). How should I think about embedding model updates?

Would love to hear your insights! Thanks in advance for your help. :pray:

1 Like

Here is the answer from chat GPT 4o
:one: Embeddings Model Choice

Should I always use the same model to embed my documentation knowledge database?

It’s not mandatory to use the same model indefinitely, but switching models requires recomputing all embeddings for consistency.

• If you change the embeddings model, you’ll need to recompute and replace all stored vectors.

• Using different models for embedding and querying might introduce inconsistencies due to differences in vector space.

Is it necessary to use a model from the same company to both embed and query the content?

No, but it’s highly recommended to use the same model (or at least models trained on similar principles) for both embedding and querying.

• Different models have different vector dimensions and interpret semantic meaning differently.

• If you use OpenAI to embed and Nomic AI to query, retrieval accuracy might suffer.

:two: Model-Specific Considerations

How do I find the vector size of a specific model?

For OpenAI models, the vector size is documented (e.g., 1536 for text-embedding-3-small).

For Hugging Face models like nomic-embed-text, try:

  1. Looking for dim or vector_size in the config.json of the model repository.

  2. Running:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")
print(model.get_sentence_embedding_dimension())  # This should return the vector size
  1. Checking Hugging Face forums or documentation.

:three: Long-Term Strategy

What’s the best approach to building a knowledge database that can evolve over time?

To ensure scalability and flexibility , consider:

Versioning your embeddings → Keep track of the model version used to generate them.

Storing raw text + metadata → So you can easily re-embed content if needed.

Choosing a widely-used embeddings model → Minimizes the risk of migration issues later.

Monitoring model updates → If a newer model offers better performance, test its impact before fully switching.

How should I think about embedding model updates?

• If your database is small, re-embedding is feasible whenever you change models.

• For large-scale databases, consider a phased migration, where you compare results before a full switch.

• Some systems allow hybrid search (combining embeddings + keyword-based retrieval), which can help if migrating embeddings is impractical.

Would love to hear your thoughts! :blush:

2 Likes

These have been my solutions or progress, ymmv.

Apparently you can sometimes use one model to search another models embedding, but not recommended. When I embed chunks, I added a column for what model was used to embed it.

I tried padding the smaller models with zeros, but had numerous issues. I have now 4 columns regarding this in my knowledge graph, 768 vector and model used to embed, and 1536 and model used to embed. Then I just point what ever tool I am playing with to the particular column it needs.

While it is space-inefficient, adding columns did not make the data much larger, and it is relatively easy to build a simple tool to re-vectorize data because you already have the chunks broken and summarized.

Finally, I just stick with nomic-embed-text 768 for most of what I do. I can run it free on local hardware, and oddly enough it runs it on my CPU while my GPU handles llama3.1 (can use tools, just bump up the context window size).

Good luck, and zap me a note if you need code to re-vectorize your data.

LOL, compared my answer to @renaud . Use his :rofl:

Thanks so much for your detailed explanation! As someone just starting with RAG after watching Cole’s videos, your insights are incredibly valuable.

If I understand correctly, your technique would result in a structure with 5 columns:

  1. Original text chunks
  2. Metadata for the 768 vector model
  3. The 768-dimension embeddings themselves
  4. Metadata for the 1536 vector model
  5. The 1536-dimension embeddings themselves

This also means I’d need to implement a separate similarity search function for each embedding type, which makes perfect sense.

I think I need to first decide whether I want to run things locally with Llama models or start my RAG with Claude 3.7 Sonnet, which seems ideal for Archon-like applications. Then I can choose my embedding model accordingly and implement my RAG system with that. If needed later, I can always vectorize my data again in additional columns for other models.

This approach seems especially smart if I want to compare performance between different embedding models on the same dataset!

Thanks again for the guidance - this is exactly the practical advice I needed to move forward with Archon!