My 2x 3090 GPU LLM Build

ColeMedin · December 4, 2024, 5:54pm

Hey guys!

Wanted to quickly share the part list for my custom build I made specifically for running local LLMs!

https://pcpartpicker.com/user/coleam00/saved/#view=3zHNvK

Definitely a large investment (though I did get the 3090s used for $800 each which brought down the price a lot), but it was well worth it to me because AI is my life if you can’t tell from my videos.

thecodacus · December 5, 2024, 7:00am

thanks for the list.
a ubuntu distro can be used instead of windows to bring down the cost even more

digitalchild · December 5, 2024, 7:02am

Nice build Cole. Any specific reason you wouldn’t run Linux on it and run it as an API server?

I think as a development machine, you’d have far less issues then using WSL2 and having the windows overhead.

murbati91 · December 5, 2024, 12:37pm

awesome build ! must recommend SLI connectors as well… i have the same build!

thecodacus · December 5, 2024, 6:16pm

@ColeMedin do some experiments with QwQ
I cant run 32B locally

ColeMedin · December 5, 2024, 9:47pm

Thanks @digitalchild! The main reason I’m using Windows is just because I use this desktop for a lot of other things as well besides LLM inference, so it isn’t just going to be an API server.

I am thinking about duel booting though and having it run Linux when I do want it to just be an API server from time to time.

ColeMedin · December 5, 2024, 9:48pm

Beautiful @murbati91!!

ColeMedin · December 5, 2024, 9:48pm

I actually did test QwQ with oTToDev, not the best results!

digitalchild · December 6, 2024, 2:20am

If you’re not gaming, I would put windows in a VM on the Linux machine and have dual displays with windows in full screen on the secondary. Then you get the power of both without the need to dual boot. This is how I ran my setup for many years before I switched to OS X, but I still run windows natively on a secondary machine and in a vm on the mac.

dsmflow · December 6, 2024, 3:16am

I’ve been playing with reasoning models as the “architects” in my aider setup. They seem to excel in that world - rather than just raw coding ability. It can drive the conversation better than it can construct it, if you get what I mean.

ColeMedin · December 6, 2024, 8:30pm

Okay nice, thanks for sharing! Definitely sounds ideal.

ColeMedin · December 6, 2024, 8:30pm

Yeah that sounds perfect! QwQ for high level thinking/planning and Qwen 2.5 Coder 32b for the actual implementation kind of thing

max2veg · December 7, 2024, 6:14am

btw there’s a merge qwq w/ qwen 2.5 coder 32b, that merged one might work better perhaps

itsmebcc · December 8, 2024, 12:34am

I have the same board, ram, drives, etc, but used a 4090, a P40, and 2 16GB 4060ti’s. I did this a while back however.

ColeMedin · December 9, 2024, 8:16pm

Ooohh nice!! I’ll have to give it a try

william-ty · December 10, 2024, 1:46pm

I had the same logic and got a secondhand 3090 too for that purpose. But I soon realised I won’t be able to run Qwen Coder 2.5 32b with bolt efficiently… it takes aaaages. What are the two 3090 doing for you, more VRAM, but is that enough for you to get good results?
Or am I just doing things wrong?
I know I can run the 7b easily but to get results close to a GPT4o I would need the 32b. What would you advise?
And wanted to say a huge thank you for your work, it’s so awesome!

ColeMedin · December 12, 2024, 12:53pm

First of all, thanks for your kind words!

The two 3090s are for me to have more VRAM to run larger models like 70b. But having a model split across two GPUs does make the inference take a lot longer, so I’ll only use 70b parameter models when I need a lot of power and don’t need a lot of speed.

A single 3090 is enough to run Qwen 2.5 Coder 32b fast! I am saying that with the Q4 (4 bits per weight) variation in mind though because that is the default version when you pull it through Ollama.

The true model size is massive and won’t fit on a single 3090, but the Q4 model will work almost as well (barely a difference in performance).

aliasfox · December 12, 2024, 1:03pm

Pardon my ignorance, but how much VRAM does the 4bit Quantized version of Qwen 2.5-32B-Coder use? And what quantized methods do you prefer/use? I’m assuming GGUF with Ollama?

ColeMedin · December 12, 2024, 1:05pm

I’m not totally sure about how much VRAM it takes, but I believe it’s between 16-20GB. Definitely less than 24GB since my second GPU stays at 0% usage when running Q4 32b parameter models.

Yes GGUF with Ollama!

aliasfox · December 12, 2024, 1:15pm

Thanks, I thought it used more. I’ll have to try it out.

At some point I moved a bit away from running them locally and started using API providers instead, almost exclusively.