Newb questions involving attach file size and GPU limitations

I have had moderate success with Bolt.diy but as a newbie I have problems that seem to link back to these 2 questions:

  1. I would prefer to work local using my 3060 12G GPU (Speed not a big issue - hobby . What models should I be using? Are 8G models my top end?

  2. I have a 500KB txt file created by OCR from PDFs. The OCR format is terrible and I was hoping for an llm to clean it up. I have even chopped the file into 80K chunks but using Bolt.diy with any model just spits it out or just errors out. Not sure if this is a context - token problem or just beginner problems.

I do realize these problems are not necessarily Bolt.diy centric questions.

Any help appreciated.
Signed “Old Amiga Guy”… Yup - that old.

If you don’t mind, I kind of already answered this in another post… this would be my suggestion: Local LLM Configuration - #5 by aliasfox

But I’ll expand the thought a little for your use case. Personally, I would use the Hugging Face Open LLM Leader Board as a resource (filtered list below):

Rank Model Average IFEval BBH MATH GPQA MURS MMLU-Pro CO2 Cost
13 rombodawg/Rombos-LLM-V2.5-Qwen-32b 44.57% 68.27% 58.26% 41.99% 19.57% 24.73% 54.62% 17.91 kg
46 sometimesanotion/Lamarck-14B-v0.7-rc4 41.22% 72.11% 49.85% 36.86% 18.57% 21.07% 48.89% 1.92 kg
361 prithiv/MLmods/QwQ-LCoT-14B-Conversational 33.17% 40.47% 45.63% 31.42% 13.31% 20.62% 47.54% 1.95 kg
708 prithivMLmods/QwQ-LCoT2-7B-Instruct 28.57% 55.61% 34.37% 22.21% 6.38% 15.75% 37.13% 1.37 kg
1524 bunnycore/QwQen-3B-LCoT 22.11% 60.25% 28.50% 0.91% 2.24% 10.76% 29.99% 0.73 kg

Installation Commands using Ollama:
ollama run hf.co/bartowski/Lamarck-14B-v0.7-GGUF:Q4_K_M
ollama run hf.co/mradermacher/QwQ-LCoT-14B-Conversational-GGUF:Q4_K_M
ollama run hf.co/mradermacher/QwQ-LCoT2-7B-Instruct-GGUF:Q4_K_M
ollama run hf.co/mradermacher/QwQ-LCoT-3B-Instruct-GGUF:Q4_K_M

It would be curious to see if your system can run Rombos-LLM-V2.5-Qwen-32b using 2bit Quantization (10.4GB), because it’s the best 32B model on the charts but you may see lower quality output due to the aggressive quantization:
ollama run hf.co/mradermacher/Rombos-LLM-V2.5-Qwen-32b-i1-GGUF:IQ2_S

Notes: List sorted by “best” (but also largest/slowest) first, under ~9GB. You can’t really run a model >= 32B because it requires ~20GB VRAM quantized at 4 bit (Q4_K_M which is the Ollama default). But technically you can run one at 2bit quantization (generally 4bit quantization offers 75% reduction with relatively little accuracy loss). But 2bit quantization may result in considerable loss in accuracy… but as a higher parameter count generally provides better accuracy it might just be the right tradeoff, and worth trying.

Hope that helps!

P.S. You can also increase the context size of your local LLM’s in the .env.local file (copied from .env.example). If you choose a small enough model and leave some headroom, you can crank it up to DEFAULT_NUM_CTX=32768. And disregard the estimated memory usage, it’s inaccurate because it doesn’t take into account the parameter size (32K context should really only add ~1-2GB RAM of overhead).

2 Likes

Thank you the reply, I will go over the above mentioned points and the llm guide you kindly added.

1 Like