Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.


> Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

"Memory bandwidth usage should be limited to the 37B active parameters."

Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.

Context window?

How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?


With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory.


What people who did not actually work with this stuff in practice don't realize is the above statement only holds for batch size 1, sequence size 1. For processing the prompt you will need to read all the weights (which isn't a problem, because prefill is compute-bound, which, in turn is a problem on a weak machine like this Mac or an "EPYC build" someone else mentioned). Even for inference, batch size greater than 1 (more than one inference at a time) or sequence size of greater than 1 (speculative decoding), could require you to read the entire model, repeatedly. MoE is beneficial, but there's a lot of nuance here, which people usually miss.


No one should be buying this for batch inference obviously.

I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.


Sure, nuance.

This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.


Birkin bags appreciate in value. This is more like a Lexus. It's a well-crafted luxury good that will depreciate relatively slowly.


Have you seen prices on Lexus LFAs now? They haven't depreciated ha ha. And for those that don't know: https://www.youtube.com/watch?v=fWdXLF9unOE


Computers don't usually depreciate slowly


Relatively, as in a Mac or a Lexus will depreciate slower than other computers/cars.


It used to be very true, but with Apple's popularity the second-hand market is quite saturated (especially since there are many people buying them impulsively).

Unless you have a specific configuration, depreciation isn't much better than an equivalently priced PC. In fact, my experience is that the long tail value of the PC is better if you picked something that was high-end.


I don't know. Can't imagine it's easy to sell a used Windows laptop directly to begin with, and those big resellers probably offer very little. Even refurbished Dell Latitudes seem to go for cheap on eBay. I've had an easy time selling old Macs, or high-end desktop market might be simple too.


Macs are easy to sell if they are BtO with custom configuration, in that case you may not lose too much. But the depreciation hit hard on the base models, the market is flooded because people who buy those machines tend to change them often or are people who were trying, confused, etc.

Low end PCs (mostly laptops) don't keep value very well but then again you probably got them for cheap on a deal or something like that, so your depreciation might actually not be as bad as an equivalent Mac. The units you are talking about are entreprise stuff that are swapped every 3 years or so, for accounting reasons mostly, but it's not the type of stuff I would advise anyone to buy brand new (the strategy would actually be to pick up a second-hand unit).

High-end PCs, laptops or big desktops keep their value pretty well because they are niche by definition and very rare. Depending on your original choice you may actually have a better depreciation than an equivalently priced Mac because there are fewer of them on sale at any given time.

It all depends on your personal situation, strength of local market, ease of reselling through platforms that provides trust and many other variables.

What I meant is that it's not the early 2000's anymore, where you could offload a relatively new Mac (2-3 years) very easily; while not being hit by big depreciation because they were not very common.

In my medium sized town, there is a local second-hand electronic shop where they have all kinds of Mac at all kind of price points. High-end Razers sell for more money and are a rare sight. It's pretty much the same for iPhones, you see 3 years old models hit very hard with depreciation while some niche Android phones take a smaller hit.

Apple went through a weird strategy where at the same time they went for luxury pricing by overcharging for things that makes the experience much better (RAM/storage) but also tried to make it affordable to the mass (by largely compromising on things they shouldn't have).

Apple's behavior created a shady second-hand market with lots of moving part (things being shipped in and out of china) and this is all their doing.


Well these listed prices are asks, not bids, so they only give an upper bound on the value. I've tried to sell obscure things before where there are few or 0 other sellers, and no matter what you list it for, you might never find the buyer who wants that specific thing.

And the electronic shop is probably going to fetch a higher price than an individual seller would, due to trust factor as you mentioned. So have you managed to sell old Windows PCs for decent prices in some local market?


Pretty much. In addition, PyTorch on the Mac is abysmally bad. As is Jax. Idk why Apple doesn't implement proper support, seems important. There's MLX which is pretty good, but you can't really port the entire ecosystem of other packages to MLX this far along in the game. Apple's best bet to credibly sell this as "AI hardware" is to make PyTorch support on the Mac excellent. Right now, as far as AI workloads are concerned, this is only suitable for Ollama.


This is true. Not sure why you are getting downvoted. I say this as someone who ordered a maxed out model. I know I will never have a need to run a model locally, I just want to know I can.


I run Mistral Large locally on two A6000's, in 4 bits. It's nice, but $10K in GPUs buys a lot of subscriptions. Plus some of the strongest LLMs are now free (Grok, DeepSeek) for web use.


I hear you. I make these decisions for a public company.

When engineers tell me they want to run models on the cloud, I tell them they are free to play with it, but that isn’t a project going into the roadmap. OpenAI/Anthropic and others are much cheaper in terms of token/dollar thanks to economies of scale.

There is still value in running your models for privacy issues however, and that’s the reason why I pay attention to efforts in reducing the cost to run models locally or in your cloud provider.


No one who is using this for home use cares about anything except batch size 1 sequence size 1.


What if you're doing bulk inference? The efficiency and throughput of bs=1 s=1 is truly abysmal.


People want to talk to their computer, not service requests for a thousand users.


For decode, MoE is nice for either bs=1 (decoding for a single user), or bs=<very large> (do EP to efficiently serve a large amount of users).

Anything in between suffers.


Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.


Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: