Optimizing Qwen2.5-Coder:32b for oTToDev

kekePower · November 22, 2024, 12:27pm

Howdy good folks.

I’m wondering if we, together, can optimize Qwen2.5-Coder:32b by removing programming languages we generally do not need and, in turn, make the model fit below 7.5GiB. This should make it possible to run on normal GPUs like the 3070 and up. On my laptop with a 3070 and 32GiB of RAM, I’ve seen reasonable performance with models up to 11-12GiB. Over that, it becomes painfully slow.

The reason I’m focusing on Qwen2.5-Coder is that a lot of people are praising it.

Personally don’t have any experience with tuning models like this, but I’m hopeful that there are somebody here that can help.

If GPU power is necessary, perhaps I can set up a Novita.ai instance and share the credentials for it. This way we can create a model that is really optimized for oTToDev. Perhaps it can be named oTToDev2.5-Coder.

Thoughts?

mrsimpson · November 22, 2024, 4:20pm

I am not knowledgable in tuning a model either, but feel a bit reluctant to limit a model to a particular language.

My limited reasoning:

I doubt that there is a “language” in the model. It’s a model trained in programming. Some if not many structures in imperative languages are alike. A model is an associative network in which there are not partitions for particular languages. Thus, I doubt that you even could e. g. strip out Java
Even if this was possible, I am not sure we’d know what languages are relevant. I also use the tool for Java code, there’ definitely Javascript, Typescript, CSS, some frameworks like React, Vue, Angular () and of course – cough – Svelte. Then, for all data science Python of course and with WASM making a step into the browser, we must not forget the Rust-army

Therefore, I’d rather wait for somebody publishing a “Vue with Typescript and Vite” or a “Svelte with Javascript and JSDoc” model than trying to modify an existing one.

Just my 0,02$

kekePower · November 22, 2024, 5:06pm

Thanks for your feedback @mrsimpson

According to the documentation on ollama.com, Qwen 2.5 Coder 32B performs excellent across more than 40 programming languages (qwen2.5-coder).

If it’s possible to strip out the ones we kind of know we won’t have any use for, perhaps it’s also possible to make the model smaller. I really need the input from people who have knowledge on this.

AFAIK, bolt.new and oTToDev focuses mostly on creating web apps. This can, of course, be extended in the future. But for now we should at least support JS and TS and most of the frameworks.

Other than that, I know very little.

dustinwloring1988 · November 22, 2024, 7:12pm

That is not how it works, but would be nice to have a smaller one. You can try pruning and but then you would need additional fine tuning.

TheDevelolper · November 22, 2024, 10:17pm

Could fine tune multiple models for different tools, maybe things like tailwind etc. Then it may be possible to have different expert models work on each part.

vannoo67 · November 23, 2024, 5:28am

There already are smaller versions of Qwen2.5-coder:32b. There is a 14b and a 7b, but sadly they don’t have ‘critical mass’ to adequately drive oTToDev at present. Maybe tuning the prompt will help.

aliasfox · December 12, 2024, 2:08am

I had thought about that a bit too, but that’s not generally how LLM’s work. Generally they are defined by the amount of training tokens in billions. My thought was, could you take a LLM dataset, vectorize it and remove duplicate data in vector space… Basically generalizing the model which would reduce the size.

There are methods people use to reduplicate data in LLM’s though and of course there is Quantization. And 4bit quantization of larger models has shown some impressive results (in some cases <5% loss in accuracy) at a lot less memory.

However, many API’s still don’t support this.

aliasfox · December 12, 2024, 2:13am

Agreed. In my experience you need at least a 32B parameter model to be useful. And even 4bit Quantized still requires like 40GB VRAM. And the 70+B models are better.

kekePower · December 12, 2024, 9:29am

I don’t know much about this, but perhaps this is something to look into.

https://www.reddit.com/r/LocalLLaMA/comments/1hbaioc/llama_33_70b_finetuning_now_with_90k_context/

thecodacus · December 12, 2024, 12:36pm

you cannot remove specific knowledge from LLM, its trained as a whole. you can prune the model but still there is no such thing as selective pruning just to keep a language. and also it does not matter what the dataset size it. model size is depends on number of parameters, not the dataset size,

so you can train a 3B model with the same amount of training data that you are training a 70B model with. its just the matter of how much quality data the dataset has and how much the parameters can capture that knowledge

there is another thing, that is the intelligence of a model on coding does not only depends on the training it took on that particular language, but also from the other languages. it can generalize concepts it has seen on one language and can apply in another language. so keeping only one language in the training will only hurt the model after it is trained from scratch.

and these are about training from scratch…
but yes you can use fine-tuning on a specific data, but the model size will remain same.

dustinwloring1988 · December 12, 2024, 10:05pm

If you are interested in optimizing qwen for Bolt.diy maybe something like Unsloth would be a good route. Maybe ORDP I have started a dataset for training using there Notebook. I was going to give it a go after I grew the dataset some as we would need more data for fine-tuning, I was going to start with the 14B version though. If anyone is interested in talking about this more I am open: