Would love feedback and addons to make this better @ColeMedin
This is really cool @chrislassiter11! Nice work!
Definitely the kind of education people need around hardware requirements, guides for quantizing, etc.
What kind of feedback are you looking for?
Are there any models you use that is not on my list. Is there any other common knowledge that would be good to be on this list.
Made some crazy updates: https://quiet-kelpie-c3f571.netlify.app/
Hi @chrislassiter11 The link provided is invaluable, thank you for sharing. I was looking for the same. But after seeing the system requirements of the models, I was disappointed with the system requirements. There is only one model which supports my laptop specks in the list, could you please provide more such reliable models.
These are my Laptop specks please suggest to me some suitable models. Prefabily Hugging Face or any other free models.
Windows 10:
Processor: Intel(R) Core™ i3-4005U CPU @ 1.70GHz 1.70 GHz
RAM: 8.00 GB
64-bit operating system, x64-based processor
GPU: 2 GB
- Is having GPU compulsory? Also if I use a model which needs 2 GB GPU I can’t run another model right? As my GPU GB is just 2 GB, any alternatives.
- Please suggest basic requirements for laptop to run models.
Please @ColeMedin Could you also provide a solution for the same. Also please in the bolt.diy roadmap add a feature which shows prefered specks which mentions basic to advance requirements device needs for running different models, it will be very helpful.
What GPU with 2GB VRAM? And it might not be the fastest, but you can always run various quantized models, which are supported by Ollama and LMStudio. They should also work with either GPU or CPU. But you can also use more capable models through HuggingFace for Free (Llama 3.3, Qwen Coder 32B or 72B Instruct). And then there’s ChatGPT 4o through Azure (GitHub) and Gemini Exp-1206 through Google (they are both pretty good).
But if running locally, I might suggest QwQ-LCoT-7B-Instruct-GGUF, which in my testing only used 1.4GB of RAM. Qwen-Coder-7B-Instruct may also be a good option or any quantized “Instruct” model < 8B Parameters (Llama, etc.).
And I know, it’s a lot to take in lol.
P.S. It might be pushing the limit, but there’s also the new a quantized version of Phi-4 from Microsoft on HuggingFace, but I haven’t tried it yet. Looks promising through. But you’d have to run it Locally because there doesn’t seem to be an API endpoint available for it.
Yes, 2GB VRAM. Amazing I lost hope of running models when I saw the speck requirements but after you detailed the requirements. Looks like this is feasible, hopefully. Thank you @aliasfox for the clarification.
I do have some doubts.
- Previously I installed Ollama models but it didn’t go well. So will try for this later as it needs more storage.
- On the LMStudio website it’s written a minimum of 16 GB of RAM is required for using LMStudio so even this is not feasible for me.
- As mentioned this model [QwQ-LCoT-7B-Instruct-GGUF] has just used 1.4 GB but in the file sizes, it shows more than 20 GB which I am also attaching a screenshot down. Could you please help me understand why there is a difference between the two?
As a new user, I can only attach 1 photo in a post so my message continues in 2 more parts.
- I need help adding a model Ex: I am not clear on how to add any model that is not present in the Hugging Face list in ". env . local file " & also how to get the API key, when I click on Get API key it redirects to my API keys page in Hugging Face.
Continuation to this part down @aliasfox
Continuation:
Should I again create an API key of the model or just click on the “use model” as shown in the example image down & choose any option? I think if I need to proceed with “use the model” I need to pick Transformers. Am I right?
For now I have this doubts please help me with them.
QwQ-LCoT-7B-Instruct doesn’t support the Interface API (serverless)
endpoint on HuggingFace, so you will need to download and run it locally (only option as serverless
endpoints are generally only setup by companies). And Ollama is lighter than LMStudio, so use that though it’s only through the command line. As a note, only the QwQ 32B model is available through Ollama directly, not the 7B, so we will need to manually set it up and add it.
-
Download the QwQ-LCoT-7B-Instruct.Q4_K_M.gguf model.
Note: 4_K_M version is the most used GGUF 4bit quantization, and default of most Ollama models. -
Browse to
%USERPROFILE%\.ollama\models
Note: the full path to models, because the config files do not resolve environment variables. Replacepath\to
in the following steps. -
Create the
QwQ-LCoT-7B-Instruct.modelfile
:
Note: I just created this from the template for QwQ, so it might need some refinement.
FROM path\to\.ollama\models\QwQ-LCoT-7B-Instruct.Q4_K_M.gguf
TEMPLATE """{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}"""
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
- Create Your Custom Model
ollama create QwQ-LCoT-7B-Instruct --file path\to\.ollama\models\QwQ-LCoT-7B-Instruct.modelfile
P.S. I still need to test this, but wanted to get the response out with the steps up to this point.
Thank you @aliasfox but is there any better solution than installing Ollama? Are there any options to run the same model without installing other external applications? And how do I find the templates of the models, will that be on the same page, could you also guide me with this?
If I understand your question correctly, no. You need an application to run the LLM itself. If you didn’t want to run a local model, then you can use API endpoints, which would be the recommended way anyways.
I’d recommend signing up for HuggingFace and using models they offer (Qwen, Llama, Llava, etc.). Also sign up for Google studio for access to their “experiment” models which are free. Additionally, Microsoft offers free ChatGPT 4o though GitHub.
Okay @aliasfox sorry for mentioning this again, for better understanding doing it.
Windows 10:
Processor: Intel(R) Core™ i3-4005U CPU @ 1.70GHz 1.70 GHz
RAM: 8.00 GB
64-bit operating system, x64-based processor
GPU: 2 GB
For now, I would prefer to run locally, if feasible.
As my laptop is i3 can it also handle with ollama?
It be good to list what GPU you have. I’m guessing it’s likely an integrated Intel chipset? I’d check. And 8GB on Windows is a bit low. The system probably requires at least 4GB alone… So you could only run very small quantized models on either GPU or CPU.
Why do you prefer running locally? If for one you don’t have the hardware, and two you don’t want to install an additional “external” application? I feel like I’m missing something here.
with 8gb ram this won’t work. you’ll need to use hosted inference of have a dedicated computer on the lan for inference
@aliasfox As of now I am using Hugging Face & I haven’t used my usage Quota in Hugging Face because I want to use that space for larger models as my laptop doesn’t support them. This is one of the main reasons. I do not install external applications because I doubt if applications like Ollama are suitable for my laptop or not. So searching for alternatives.
HuggingFace is pretty generous on token usage and caps. Maybe just sign up for two and swap between the keys (would be cool to create a kind of load balancer). For free I use GitHub for 4o, Google Gemini exp-1210 (pretty good), and HF. I pay for Openrouter to use Sonnet 3.5 mostly and a few others. It’s a little dumb to jump around so much, but still figuring things out myself too.
I think the best models on HuggingFace right now are Llama3.3-70B-Instruct, Qwen2.5-32B-Coder, Qwen-2.5-72B-Instruct, and QwQ (but didn’t currently work artifacts). And sadly you can’t filter HF specifically by Interference API (serverless)
endpoints (the ones you can use with Bolt).
@aliasfox these inputs are very insightful, but I think I finally got a breakthrough! This is what exactly I was looking for. I am attaching a YouTube video link that explains how to run any model locally without any external application ( I suppose even > 7 B as mentioned by him ).
I need assistance with how to implement it in bolt.diy
This is around 10 min. Please to solve my doubt please spare 10 min of valuable time.
I am looking forward to this solution specifically:
- How to download the model locally ( which suites bolt.diy integration)
- How to implement it in bolt.diy ( adding any model that is not present in the Hugging Face list in ". env . local file " )
Sorry, I don’t understand. I believe this is common knowledge.
This is what everyone does when they use Ollama, LMStudio, GPT4All, etc. They all just run local models and while you can use the one’s they provide; you can also download and install any that you’d like (generally from HuggingFace). If you want to run anything of consequence, you need some beefy hardware though. In my experience, for the LLM to be even the least bit capable in Bolt.diy, it needs to be at least 32B parameters.
Only “Instruct” models greater than 7 Billion Parameters generally work with artifacts (file system, terminal, canvas, etc.) with the exception of only one I have found, which is QwQ-LCoT-3B-Instruct.
And your average user isn’t going to run a 32B LLM, even with 4bit quantization, that requires at least 20GB of VRAM (40GB+ for 70B with 4bit Quantization). So, it’s nice, fun to play around with, and very useful for learning, but I personally don’t think it’s a serious option.
And with a model like Llama 3.3 70B Instruct on Openrouter costing less than 20¢ per Million Tokens, running models locally is not a compelling option for me… unless you want to say run them for 24/7, which to me would be the only use case, especially with how inefficient at token usage systems currently are.
The only advantage I see for local models is with things like Cline VS Code plugin (and others), you could let the AI go on infinitely trying to complete a task and while it will likely run into all sorts of problems, it would cost you nothing but some power, time, and maybe some frustration. Likely an advanced RAG or workflow will be the solution to this and lead to something truly groundbreaking. That’s at least where my money is at.
That’s only my two cents; you are welcome to disagree!
I got it @aliasfox, but I just want to try it with small models. I downloaded Microsft phi-2 using Python & I was able to link it to my .env.local file successfully. Please guide me on how I can add that model to the list.