MLX is worth paying attention to. It's still pretty young (just over a year old) but the amount of activity in that ecosystem is really impressive, and it's quickly becoming the best way to run LLMs (and vision LLMs and increasingly audio models) on a Mac.
Here's a fun way to start interacting with it (this loads and runs Llama 3.2 3B in a terminal chat UI):
uv run --isolated --with mlx-lm python -m mlx_lm.chat
Ran it and it crapped out with a huge backtrace. I spotted `./build_bundled.sh: line 21: cmake: command not found` in it, so I guessed I needed cmake installed. `brew install cmake` and try again. Then it crapped out with `Compatibility with CMake < 3.5 has been removed from CMake.`. Then I give up.
This is typical of what happens any time I try to run something written in Python. It may be easier than setting up an NVIDIA GPU, but that's a low bar.
Python problems exist on all platforms. It's just that most people using Python have figured out their 'happy path' workarounds in the past and keep using them.
Python is awesome in many ways, one of my favourite languages, but unless you are happy with venv manipulation (or live in Conda), it's often a nightmare that ends up worse than DLL-hell.
Python is in a category of things you can't just use without being an expert in the minutiae. This is unfortunate because there are a lot of people who are not Python developers who would like to run programs which happen to be written in Python.
Python is by no means alone in this or particularly egregious. Having been a heavy Perl developer in the 2000s, I was part of the problem. I didn't understand why other people had so much trouble doing things that seemed simple to me, because I was eating, breathing, and sleeping Perl. I knew how to prolong the intervals between breaking my installation, and how to troubleshoot and repair it, but there was no reason why anyone who wanted to deploy, or even develop on, my code base should have needed that encyclopedic knowledge.
This is why, for all their faults, I count containers as the biggest revolution in the software industry, at least for us "backend" folks.
I've been running it on a 64GB M2. My favorite models to run tend to be about 20GB to download (eg Mistral Small 3.1) and use about 20GB of RAM while they are running.
I don't have a token/second figure to hand but it's fast enough that I'm not frustrated by it.
I wish apple would spend some more time paying attention to metal-jax :) it crashes with a few lines still and seems like an obvious need if apple wants to be serious about enabling ML work on their new MBPs.
MLX looks really nice from the demo-level playing around with it I've done, but I usually stick to jax so, you know, I can actually deploy it on a server without trying to find someone who racks macs.
No kidding? Might switch to CPU then. And yeah jax-metal is so utterly unreliable. I ran across an issue it turns out reduces to like a 2 line repro example which has been open on github for the better part of a year without updates
Here's a fun way to start interacting with it (this loads and runs Llama 3.2 3B in a terminal chat UI):