I’m running the fork on my Macbook Pro with M1 chip. I using on the Ollama framework using the Ollama quen2.5 coder. It’s taking 30 min or so to respond back with 1 sentence. Am I doing something wrong or is it just my laptop?
That is pretty slow!
I would try testing with running Ollama in the terminal and seeing what kinds of speeds you get there.
The command for that would be:
ollama run qwen2.5-coder:7b
If it’s still slow there it probably is because of your computer unfortunately!
thanks for responding. I downgraded to llama2 but the same thing. 8gb of ram might be my issue. But I’ve seen others make it work(not with bolt.new)in the terminal. I can’t seem to figure this one out
llama3.2 seems to work well in the terminal but not with bolt.new-any-llm
What size of Llama 3.2 are you using? The smaller ones like 3b and 1b are usually not big enough to handle the Bolt.new prompt unfortunately.
This is my experience as well, once your model won’t fit fully into GPU RAM along with your machine’s usual memory needs, it will back off to CPU and the response times will nosedive. For reference I am on an M1 Max 32GB, and the 7B sized qwen models are at the realistic limit of support for me.