Newb questions involving attach file size and GPU limitations

aliasfox · January 27, 2025, 5:38pm

If you don’t mind, I kind of already answered this in another post… this would be my suggestion: Local LLM Configuration - #5 by aliasfox

But I’ll expand the thought a little for your use case. Personally, I would use the Hugging Face Open LLM Leader Board as a resource (filtered list below):

Rank	Model	Average	IFEval	BBH	MATH	GPQA	MURS	MMLU-Pro	CO2 Cost
13	rombodawg/Rombos-LLM-V2.5-Qwen-32b	44.57%	68.27%	58.26%	41.99%	19.57%	24.73%	54.62%	17.91 kg
46	sometimesanotion/Lamarck-14B-v0.7-rc4	41.22%	72.11%	49.85%	36.86%	18.57%	21.07%	48.89%	1.92 kg
361	prithiv/MLmods/QwQ-LCoT-14B-Conversational	33.17%	40.47%	45.63%	31.42%	13.31%	20.62%	47.54%	1.95 kg
708	prithivMLmods/QwQ-LCoT2-7B-Instruct	28.57%	55.61%	34.37%	22.21%	6.38%	15.75%	37.13%	1.37 kg
1524	bunnycore/QwQen-3B-LCoT	22.11%	60.25%	28.50%	0.91%	2.24%	10.76%	29.99%	0.73 kg

Installation Commands using Ollama:
ollama run hf.co/bartowski/Lamarck-14B-v0.7-GGUF:Q4_K_M
ollama run hf.co/mradermacher/QwQ-LCoT-14B-Conversational-GGUF:Q4_K_M
ollama run hf.co/mradermacher/QwQ-LCoT2-7B-Instruct-GGUF:Q4_K_M
ollama run hf.co/mradermacher/QwQ-LCoT-3B-Instruct-GGUF:Q4_K_M

It would be curious to see if your system can run Rombos-LLM-V2.5-Qwen-32b using 2bit Quantization (10.4GB), because it’s the best 32B model on the charts but you may see lower quality output due to the aggressive quantization:
ollama run hf.co/mradermacher/Rombos-LLM-V2.5-Qwen-32b-i1-GGUF:IQ2_S

Notes: List sorted by “best” (but also largest/slowest) first, under ~9GB. You can’t really run a model >= 32B because it requires ~20GB VRAM quantized at 4 bit (Q4_K_M which is the Ollama default). But technically you can run one at 2bit quantization (generally 4bit quantization offers 75% reduction with relatively little accuracy loss). But 2bit quantization may result in considerable loss in accuracy… but as a higher parameter count generally provides better accuracy it might just be the right tradeoff, and worth trying.

Hope that helps!

P.S. You can also increase the context size of your local LLM’s in the .env.local file (copied from .env.example). If you choose a small enough model and leave some headroom, you can crank it up to DEFAULT_NUM_CTX=32768. And disregard the estimated memory usage, it’s inaccurate because it doesn’t take into account the parameter size (32K context should really only add ~1-2GB RAM of overhead).