Definitely a large investment (though I did get the 3090s used for $800 each which brought down the price a lot), but it was well worth it to me because AI is my life if you can’t tell from my videos.
Thanks @digitalchild! The main reason I’m using Windows is just because I use this desktop for a lot of other things as well besides LLM inference, so it isn’t just going to be an API server.
I am thinking about duel booting though and having it run Linux when I do want it to just be an API server from time to time.
If you’re not gaming, I would put windows in a VM on the Linux machine and have dual displays with windows in full screen on the secondary. Then you get the power of both without the need to dual boot. This is how I ran my setup for many years before I switched to OS X, but I still run windows natively on a secondary machine and in a vm on the mac.
I’ve been playing with reasoning models as the “architects” in my aider setup. They seem to excel in that world - rather than just raw coding ability. It can drive the conversation better than it can construct it, if you get what I mean.
I had the same logic and got a secondhand 3090 too for that purpose. But I soon realised I won’t be able to run Qwen Coder 2.5 32b with bolt efficiently… it takes aaaages. What are the two 3090 doing for you, more VRAM, but is that enough for you to get good results?
Or am I just doing things wrong?
I know I can run the 7b easily but to get results close to a GPT4o I would need the 32b. What would you advise?
And wanted to say a huge thank you for your work, it’s so awesome!
The two 3090s are for me to have more VRAM to run larger models like 70b. But having a model split across two GPUs does make the inference take a lot longer, so I’ll only use 70b parameter models when I need a lot of power and don’t need a lot of speed.
A single 3090 is enough to run Qwen 2.5 Coder 32b fast! I am saying that with the Q4 (4 bits per weight) variation in mind though because that is the default version when you pull it through Ollama.
The true model size is massive and won’t fit on a single 3090, but the Q4 model will work almost as well (barely a difference in performance).
Pardon my ignorance, but how much VRAM does the 4bit Quantized version of Qwen 2.5-32B-Coder use? And what quantized methods do you prefer/use? I’m assuming GGUF with Ollama?
I’m not totally sure about how much VRAM it takes, but I believe it’s between 16-20GB. Definitely less than 24GB since my second GPU stays at 0% usage when running Q4 32b parameter models.