Llama3.3 slow running

dale · December 16, 2024, 1:02pm

I hope it’s ok to post a non bolt.diy question.

I am trying to use llama3.3 and either using the CLI or Open WebUI it is very slow. My system specs are below. Any know if there is anything I can do to speed it up? Other models like Mistral work fine.

Intel(R) Core™ Ultra 7 265KF, 3900 Mhz, 20 Core(s), 20 Logical Processor(s)
64 GB of RAM
Nvidia GeForce RX 4080 Super
Windows 11

leex279 · December 16, 2024, 1:08pm

Hi Dale,
What exact Models of Mistral you comparing to? I´m asking because the llama3.3 model is a 70B Model what needs a lot of GPU Power to use properly. I think the RTX 4080 is way to less.

If I ask chatgpt I get this:

And 20 tokens/s is already not even much. I think someone here mentioned that to run bolt smoothly we would ne >40t/s.

dale · December 16, 2024, 1:21pm

Thanks @leex279 ! I am looking at some other models now. llama3:8b runs well. Trying something a little more and seeing what the sweet spot is.

123jigme123 · December 16, 2024, 1:23pm

I assume you are using Ollama to run Llama3.3 70b.
Since the VRAM of Nvidia GeForce RTX 4080 Super is 16GB and Llama3.3 70b is 43GB in size, it doesn’t fit whole Llama3.3 70b in VRAM of RTX 4080 Super. Thats why it is slow. So for that you can either use quantized version of Llama 3.3 like 70b-instruct-q2_K[ollama run llama3.3:70b-instruct-q2_K] which is 26GB in size or you can use other LLM suck as:

qwq 32b (20GB)

ollama run qwq

qwen 32b (20GB)

ollama run qwen2.5:32b

qwen2.5-coder 32b (20GB)

ollama run qwen2.5:32b

leex279 · December 16, 2024, 1:24pm

Addition to @123jigme123: Check out the other qwen models:

dale · December 16, 2024, 2:05pm

Thanks for this, llama3:70b-instruct-q2_K is significantly faster than Llama3.3 70b but still too slow for me. I will check around and review all the information you all provided! This is great! Thank you!