Here are a few things I ran into trying to get this to work with Ollama:
is_ollama = “localhost” in base_url.lower() really screws things up when accessing Ollama outside Docker requires the base URL to be: http://host.docker.internal:11434/v1. Instead of looking for localhost, maybe look for “11434” or just allow us to choose our provider from a dropdown.
For me, using Ollama locally is great for saving on costs while sacrificing speed. I don’t have a blazing-fast PC, so when trying to add the Pydantic AI Documentation to Supabase, I left it run for an hour and it was just a mess. I think this task should be separate and we should be able to select a different model and provider. Why are we forced to use the same base for everything?
Revamping the environment variable set up is the next thing on my list. I’m going to make it so you can select your provider first and then provide your base URL which will solve #1. And then I also want to separate the embedding model from the LLMs like you saying for #2!
If you’re going to do that, can I make an additional suggestion: when you pick a provider and enter URLs and models for that provider, save those URLs and models per provider so we can quickly switch back and forth between providers without having to keep changing the URLs. This would allow us to switch providers on the fly depending on our needs.
All I’m trying to do is the PydanticAI Documentation to Supabase. It shows the crawling, the processing, but never adds anything to the database. Here are my settings:
@ColeMedin So far, I can get chunks to finally save to the database, but titles and summaries are not generating and saves “Error processing title” and “Error processing summary” to those fields. Here is the error: 2025-03-16 13:19:50 Error getting title and summary: ‘NoneType’ object is not subscriptable
Is there a way to incorporate better error handling? If it’s getting an error such as this, it shouldn’t continue running for the next several minutes inserting garbage into the database. Or, at the very least, there should be an Abort button or an easy way to cancel the run. I had to shut the entire container down to stop it.
Also, what can we add if we want to slow it down to be on the safe side when it comes to possible rate limiting. For example, for OpenRouter: “If you are using a free model variant (with an ID ending in :free), then you will be limited to 20 requests per minute and 200 requests per day” API Rate Limits | Configure Usage Limits in OpenRouter — OpenRouter | Documentation
@ColeMedin Here’s another update. I think we need a community table or list where we can submit what models work and don’t work.
For documentation crawling, Ollama nomic-embed-text seems to work fine as long at it’s paired with the “right” primary model.
I have not been able to get it to work with Ollama as the LLM Provider.
I have not been able to get it to work with free OpenRouter models and even many of the paid models. For Example, Gemini 2.0 Flash returns a bunch of these errors and the majority of the processes fail: Error processing https://ai.pydantic.dev/examples/question-graph/: list indices must be integers or slices, not str
The following setting have been the most successful so far:
Like I said, it would really be nice to start building a list of what models work best for each task, so we are not needlessly wasting credits. I would really like to find less expensive models that work just as well as above.
Your note about using host.docker.internal instead of localhost for Ollama in Archon is good, noted!
Still thinking through how I want to do error handling but I agree it can be a lot better. I default to continuing to scrape even though the summaries fail just because they aren’t really required. Maybe we should crash the script though.
I’ve never gotten those issues with OpenRouter models before, I’ll have to do some more testing! What error do you get when using Ollama for the primary LLM for the scraping? Same as that list indices error with OpenRouter?
@ColeMedin The error I got with Ollama for primary and embedding is: 2025-03-16 13:19:50 Error getting title and summary: ‘NoneType’ object is not subscriptable. But many fail as well, only like 85 chunks were successful, a lot just timed out because the model I was using was running slow on my PC.
I’ve pretty much just given up on trying to use Ollama for the primary, I really don’t have the hardware I need to run more capable models. After all these tests, OpenAI’s 4o-mini only cost $0.04 to get the docs embedded. I wonder if it’s worth the trouble to save 4 cents.
@ColeMedin what Ollama models do you recommend? They have to have tools, so there is only a handful of options. How well did they summarize compared to OpenAI?
EDIT: Not sure if my tests are accurate, but it seems like any Ollama model that exceeds VRAM capacity ends up failing. I tested qwen2.5:14b that is just shy of VRAM (for me) and runs on CPU, resulting in the missing titles and summaries. However, a 12b model that fits within GPU memory works. It appears that when a model attempts to load into VRAM and falls back to CPU, something gets lost in the process, causing the issue.
That said, I think I’m going to stick with OpenAI as my primary. Ollama just isn’t consistent. With OpenAI, inserted document chunks consistently reach 450+. With llama3.1:8b, 385. With mistral-nemo:12b, 416 first run, 386 second run.
Yeah no matter the application or agent, you’ll run into issues with Ollama without the right hardware. Makes sense to stick to OpenAI or OpenRouter instead!
If you do want to keep exploring local LLMs, my general recommendation for less powerful hardware is Qwen 2.5 instruct 7b. Also the new Mistral 3.1 Small models are looking really nice!
I have a pair of bridged A6000 Adas and the config is pretty basic for ollama but when I started this was NOT the configs I found…just an fyi for anyone with multiple gpu’s it took me a bit to sort out so hopefully helps, its a bit of belt and suspenders but…it works better now than when I started.
ollama-gpu:
profiles: [“gpu-nvidia”]
<<: *service-ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=0,1 # A6000 GPU’s
- CUDA_DEVICE_ORDER=PCI_BUS_ID
- OLLAMA_SCHED_SPREAD=1 # Spread model across both GPUs
- OLLAMA_NUM_GPU=2 # Explicitly set to 2 GPUs
- OLLAMA_GPU_LAYERS=0 # 0 = all layers on A6000 GPU