If you don’t mind, I kind of already answered this in another post… this would be my suggestion: Local LLM Configuration - #5 by aliasfox
But I’ll expand the thought a little for your use case. Personally, I would use the Hugging Face Open LLM Leader Board as a resource (filtered list below):
Rank | Model | Average | IFEval | BBH | MATH | GPQA | MURS | MMLU-Pro | CO2 Cost |
---|---|---|---|---|---|---|---|---|---|
13 | rombodawg/Rombos-LLM-V2.5-Qwen-32b | 44.57% | 68.27% | 58.26% | 41.99% | 19.57% | 24.73% | 54.62% | 17.91 kg |
46 | sometimesanotion/Lamarck-14B-v0.7-rc4 | 41.22% | 72.11% | 49.85% | 36.86% | 18.57% | 21.07% | 48.89% | 1.92 kg |
361 | prithiv/MLmods/QwQ-LCoT-14B-Conversational | 33.17% | 40.47% | 45.63% | 31.42% | 13.31% | 20.62% | 47.54% | 1.95 kg |
708 | prithivMLmods/QwQ-LCoT2-7B-Instruct | 28.57% | 55.61% | 34.37% | 22.21% | 6.38% | 15.75% | 37.13% | 1.37 kg |
1524 | bunnycore/QwQen-3B-LCoT | 22.11% | 60.25% | 28.50% | 0.91% | 2.24% | 10.76% | 29.99% | 0.73 kg |
Installation Commands using Ollama:
ollama run hf.co/bartowski/Lamarck-14B-v0.7-GGUF:Q4_K_M
ollama run hf.co/mradermacher/QwQ-LCoT-14B-Conversational-GGUF:Q4_K_M
ollama run hf.co/mradermacher/QwQ-LCoT2-7B-Instruct-GGUF:Q4_K_M
ollama run hf.co/mradermacher/QwQ-LCoT-3B-Instruct-GGUF:Q4_K_M
It would be curious to see if your system can run Rombos-LLM-V2.5-Qwen-32b
using 2bit Quantization (10.4GB), because it’s the best 32B model on the charts but you may see lower quality output due to the aggressive quantization:
ollama run hf.co/mradermacher/Rombos-LLM-V2.5-Qwen-32b-i1-GGUF:IQ2_S
Notes: List sorted by “best” (but also largest/slowest) first, under ~9GB. You can’t really run a model >= 32B because it requires ~20GB VRAM quantized at 4 bit (Q4_K_M which is the Ollama default). But technically you can run one at 2bit quantization (generally 4bit quantization offers 75% reduction with relatively little accuracy loss). But 2bit quantization may result in considerable loss in accuracy… but as a higher parameter count generally provides better accuracy it might just be the right tradeoff, and worth trying.
Hope that helps!
P.S. You can also increase the context size of your local LLM’s in the .env.local
file (copied from .env.example
). If you choose a small enough model and leave some headroom, you can crank it up to DEFAULT_NUM_CTX=32768
. And disregard the estimated memory usage, it’s inaccurate because it doesn’t take into account the parameter size (32K context should really only add ~1-2GB RAM of overhead).