How to run qwen2.5 coder:7b faster with bolt.diy

keshav0479 · December 19, 2024, 5:02pm

i am using new bolt.diy with pnpm and it is slow in generating ,is context limit small ,if then how to increase it for ollama qwen2.5coder 7b

leex279 · December 19, 2024, 6:58pm

Hi @keshav0479,
welcome to the community.

You can configure it for bolt within your .env.local. Example is in .env.example:

If it is already slow when you try it directly within the terminal, without bolt, the I guess your hardware is to low.

Provide some more information about your setup, then we can see.

keshav0479 · December 19, 2024, 7:39pm

Thanks for info, I got rtx 3070 ti laptop gpu and i7 12th gen configuration with 16gb ddr4 ram what will be best config to get best performance with bolt.diy ,tell which model should I use .

farazali7895 · December 19, 2024, 7:57pm

i downloaded qwen 2.5, but when i run bolt.diy, it doesnt show in the drop down box next to olama, did u ever have this issue?

aliasfox · December 19, 2024, 8:01pm

Make sure the Ollama server is running. If it’s not, you can trigger it with ollama serve but I believe the default behavior is to run automatically.

Check that it’s up and running by visiting the url: https://localhost:1234
It should say “Ollama is running.” (or something like that)

Bolt.diy should detect Ollama automatically with no other changes, but maybe you’ll want to restart the development server if you made any changes.

leex279 · December 19, 2024, 8:41pm

I dont think it makes sense to run local models with that hardware specs, as your video memory is with 8GB way to low to get appropriate speed to run something that works well with bolt. You can run models with 1.5 or 3B tokens I guess, but as I said, bolt will not work very well with it and you will not get the expected results.

I would recommend using an external provider e.g. OpenRouter with Qwen-Coder 32B Model which is very cheap to use. Or you could use google api for free at the moment.

keshav0479 · December 20, 2024, 4:03am

Thanks for informing ,i am new to work with llms so didn’t knew ,i think it would be better to use external provider,also one more query ,whenever i put openAI key there from my account it doesn’t work and in right bottom corner says unexpected error ,i have free plan only ,can’t it work like free version of gpt3.5turbo with more limits ,same goes with anthropic’s claude its Api key also gives same error ,when i put api key in cursor editor it also said make sure you have tokens

aliasfox · December 20, 2024, 4:24am

I think you could get by with a quantized model < 8B parameters. I tried QwQ-LCoT-7B-Instruct-GGUF and it was surprisingly good, has decent performance, and only used 1.4GB of RAM.

Godned · December 20, 2024, 7:57am

Using a quantized version with a lower bit rate generally results in a loss of quality. The lower the bit rate, the faster the model typically becomes. Also, not all models perform the same. Don’t settle for the first model you obtain. Two models quantized from the same base model, even if they share the same name but are quantized by different users, may perform differently. Therefore, shop around, try many, and you may be surprised or disappointed.

keshav0479 · December 20, 2024, 3:14pm

thanks for info will surely test

keshav0479 · December 20, 2024, 3:15pm

yupp will surely experiment with different versions to check which suits best

aliasfox · December 20, 2024, 4:56pm

I saw some benchmarks of Qwen and Llama 3.3 losing only < 5% accuracy at 4bit quantization, but huge reduction in resource usage and improved speed. So I think for running local models, it would be better to get the best quantized version you can run on your hardware. You’d see better performance out of higher parameters and lose very little by quantization.

But of course your mileage will vary and it depends on the model, fine tuning, and etc.

salaluddin111 · December 22, 2024, 9:56am

use ollama through google collab