Mac Mini M4 for Ottodev with local model

bindujohal · November 11, 2024, 8:44pm

Hi All,

I am thinking of buying new Mac Mini M4 to use as a dev machine for learning. Planning to run some local LLM’s in it. Has anyone used LLM’s locally here on Mac’s for coding? How good/usefull those are? If base machine is not good should i get M4 Pro? Want small desktop for my desk at home. Would be nice to hear what others use for running LLM’s locally.

mahoney · November 11, 2024, 8:50pm

I am using an MBP M1 Max with 32GB without issues, very interested in getting an M4 running to attempt some larger model loads. ~7B parameter models are able to load fully onto GPU for me successfully which is a necessity if you want speed on local LLMs.

bindujohal · November 11, 2024, 8:55pm

i have been checking youtube review’s to see if anyone runs LLM’s locally on M4 and see how they work on it? I know they will load but do they provide response at good enough speeds is my concern. Also wanting to see if need to get M4 Pro but want to see if 800$ jump is justifiable.

Charbax · November 11, 2024, 11:33pm

I think we need to check performance running the LLM on the NPU of the M1/M4 chipset, I believe both have NPU, which is specifically able to accelerate AI workloads, and perhaps might use the GPU too, somehow. I don’t know how good the hardware acceleration is at the moment for each LLM… Consider on second hand you might find M1 Mini for $350, M1 Air for $500, M1 Pro 14" for below $1000 and M1 Max 16" for below $1300. Might not make sense to spend a lot more for M4/Pro/Max yet… it really depens how good the hardware acceleration is yet, I guess the NPU for AI workloads is much more powerful in the M4.

Charbax · November 12, 2024, 12:34am

I’m checking out Download LM Studio - Mac, Linux, Windows
with Qwen2.5-Coder 32B https://www.youtube.com/watch?v=ALArhCnz8rY
I wonder if that’ll work fine on my M1 Pro 14" for running ottodev unlimited locally…

Metasymbiont · November 12, 2024, 1:10pm

I have a older Mac. With a 2.2 GHz. Quad-core until core i7. With only 16 GB. I can use a llama3.1 8b, but I usually use. Free versions of larger models like. GTP 40 and gemini flash. I use vs coder and have done a few experiments with my working version of bolt new that I run locally and definitely get results. I created a stock market dashboard today with it. My operating system is up to date with modified version of the new mac OS Sonoma. My MacBook Pro was created. In mid 2015. I have a few older PC computers I also use occasionally. Oh yeah, it’s value is about $300 works fast enough for me. I recommend a newer model if you can get one for a fair price otherwise go for a PC.

mahoney · November 12, 2024, 3:14pm

I’ve been using Mac Max M1, 32GB for a few months now, some notes:

Works fine with qwen coder 2.5 7B
Probably does not work well with the 32B model; I haven’t tried it yet
Non Apple silicon is going to not be great unless you have some sort of Mecha-Godzilla Mac Pro with a great GPU
The important thing about the M series is the shared memory bus. This also means if you saturate the machine with LLM usage it will gladly let you hard lock your machine due to out-of-memory conditions.
The important thing about small models is quantization; they are likely performing better with higher quantization values and reducing quantization reduces value precision.

Hope this helps!

Metasymbiont · November 15, 2024, 12:26pm

So far I haven’t had any problems. Bolt.new is working good for me. I do use https://openrouter.ai/
I like Google’s models and GTP 40. I’m glad Qwen2.5-Coder 32b is available on Openrouter. Just use the MacBook Pro. For my smaller models. Around 9 billion. They work fine. But smaller models can’t compete with larger models long token count capabilities.

juan · December 4, 2024, 12:35am

I have a Mac Mini M4 24GB. Would this run LM Studio and ser oTToDev Qwen models? HAs anyone tried it yet?

CBJ · December 5, 2024, 7:52am

I returned my Mac mini M4 for a new Mac Studio M1 with 32gpu cores and 64gb ram. Obviously as it’s an old model I got a good price. The reason for this exchange which may seem strange is the speed of memory.

M1: 68 Gbps
M2/M3 : 100 Gbps
M4 : 120 Gbps
M3 Pro : 150 Gbps
M1/M2 Pro : 200 Gbps
M4 Pro : 273 Gbps
МЗ Мах : 300/400 Gbps
M1/M2 Max : 400 Gbps
M4 Max: 410/546 Gbps
M1/M2 Ultra : 800 Gbps
M4 Ultra: 546/820/1042 Gbps

64Gb with 400Gbps speed is the best value for money in my opinion for running large models even 72B.

Here is a table of Performance of llama.cpp on Apple Silicon M-serie. An M1 max 32gpu cores is as fast as an M2 max 30 cores.

juan · December 5, 2024, 9:19am

You mean an M1 Max, correct? How much was the deal if I can ask?

CBJ · December 5, 2024, 9:27am

Mac Studio M1 Max 64 GB 32GPU cores 1 TB SSD €2095 (1731.40 excluding VAT) new with 24 months warranty (I am in Europe)

juan · December 5, 2024, 10:34am

Thank you.
Same here. But I’ve brought a Mac mini M4 24GB from USA, with a good deal.
Let’s see what model can I play with.
I’ve tried with LMStudio.
Thinking of MLX command line to see if I can get better results.
With that conf, the Mac can’t move Qwen 32B yet.

CBJ · December 5, 2024, 12:19pm

You have to reduce the context size. See here from Ottodev .env file :

# Example Context Values for qwen2.5-coder:32b
# 
# DEFAULT_NUM_CTX=32768 # Consumes 36GB of VRAM
# DEFAULT_NUM_CTX=24576 # Consumes 32GB of VRAM
# DEFAULT_NUM_CTX=12288 # Consumes 26GB of VRAM
# DEFAULT_NUM_CTX=6144 # Consumes 24GB of VRAM

wonderwhy.er · December 5, 2024, 1:23pm

I am on M3 Max work laptop with 64Gb

I don’t use locally run model in any real way.
They are aether way slower or way worse in quality than hosted ones.

Like I have quantised Qwen2.5 32b instruct GGUF 17GB on sdd
It feels slower
Decided to test and record them going

Qwen starts on 2s
ChatGPT starts at 3 seconds

ChatGPT finished code block around 40s mark or 37s

Qwen finished around 70s mark or 68s to do

Do take in to account that chatgpt costs 20$ a month.
And cheapest 64gb m3 I could find locally right now costs 4852$
Let’s devide Mac cost by 3 years or 36 months
That would be 124 dollars per month for running models locally for 3 years.
Where will cloud hardware be in 2 years?

So I am big sceptic of running local models for cost reasons.

wonderwhy.er · December 5, 2024, 1:28pm

Now one more thing to mention.
Apple has its own format called MLX. LM Studio has a lot of GGUF models.
MLX is native to Mac hardware and should perform better.

LM Studio did add support recently

Wanted to test but had no time.
Checking now and there are not good models in MLX there yet…

Will test Cohere Command as its only big one from the list with 22GB on ssd.

Others are 8B or lower.

juan · December 6, 2024, 4:18pm

Thank you. Will try it.

juan · December 6, 2024, 4:21pm

Agree with that, of course. Burt we are talking about API calling vs Local calling + Privacy. Your thoughts are right. Local AI are never going to compare to cloud except for what I’ve said.

Also, regarding Ottodev, Have you tried Claude Sonnet 3.5 with it? It’s so expensive. If you want to develop it’s way better to have an infinite calls local AI. IMO

juan · December 6, 2024, 4:22pm

Yes, testing with MLX also and outside LM Studio too. Command Line. Will upload my results.