WRONG. Research shows effectively imperceptible performance difference at 4-bit and even 3-bit with GPTQ quantization. You cannot tell the difference and if you think you do you're wrong, because it barely even registers on any benchmark.
(Note: llama.cpp's 4bit is naive, not GPTQ, and sucks but they are refactoring it to use GPTQ quantization)
References:
https://arxiv.org/abs/2210.17323 - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers [Oct, 2022]
Good points, though I would gently encourage not starting a post with "WRONG." in the middle of a nuanced discussion. I remember 'way back when' when there was a .5-2% flat performance drop for UINT8 on some models when it was first introduced (depends upon the modality).
Like, 4 bit quantization really is probably enough for a number of usecases and likely beats smaller models with precision enough to make it the equivalent number of bits, but this really is only presenting half of the story. "You cannot tell the difference and if you think you do you're wrong, because it barely even registers on any benchmark" can be regarded as antagonistic, and also really doesn't line up with reality in a number of usecases. Sure, maybe for some models, UINT4 quantization is good enough. But there's a very large space of model architectures and problems, even for language learning, many of which do have very demonstrable drops in performance. And at certain perplexity levels, every bit (heh) matters.
Good points, I didn't mean to come off abrasive but I can see why I would. My attention was to get attention on a thread where my new comment would be buried under the 8 other replies, so I put a big attention grabber at the start.
But again good points about the nuances of lower precision. For LLMs at least 'The Case for 4-bit Precision' and 'GPTQ' seem fairly conclusive that over ~10B parameters even 3-bit precision has virtually undetectable loss with the right trircks. Levels which, if they even mattered, can easily be overcome with a little additional training.
Newer ongoing research on LLaMA specifically[0] shows we can reduce the model's size around 84% without any meaningful performance loss through a combination of GPTQ, binning, and 3-bit.
I know nothing about this so my opinion means little, but I imagine it's hard to know which parameters are important enough to use more bits for.
I do wonder if it would be possible to have the model determine during training how important each parameter is, while maybe rewarding it for having more small parameters?
I looked at the numbers you posted, and am feeling concerned with how aggressively you're commenting towards a number of people on this website.
For starters, I started in this field a few years after the 2012 wave started. I've been with it for a while and have seen a lot of trends come and go. One thing that stays the same is that things are always changing. Very few things are set in stone, and due to a few other things it takes years and years before anything even begins to be finalized.
The numbers you are quoting are from various research groups, and are days to weeks old. You've antagonized a number of users in this forum, from calling them wrong directly, or saying that another person is empirically incorrect based on numbers you haven't verified yourself, and that have not had time to settle in the field yet with respect to real-world usecases. I went to one of the methods you linked, GPTQ, and it indeed had a _good_ performance to size improvement, but was not 'no difference'. This also does not count that 4-bit GPU support is still not-well supported. On 13B, for 4-bit, a .1 perplexity difference is great, but I also believe that that is also at least a noticeable drop. The .42 perplexity drop for 3 bit is massive, but also still very information efficient.
This completely ignores the conversation about (back to the GPU side of things) kernel-level support for these operators, which is very underdeveloped. Technical and unvalidated lab numbers do not represent the real world, it's like battery technologies. They are two very different things, though there are impressive tech demos and numbers out there. Like many things, in my experience, at least, it comes down to a big 'it depends'. It'll all settle out in the wash and we'll see what methods end up reigning in the long run.
Again -- please stop attacking other HN users based on a partial -- if well-researched -- understanding of the subject matter. It seems you're very involved in this topic, and I agree that more people need to hear about it. I think you could do an excellent job in sharing that news to them. That is good, and I hope the evangelism efforts go well and wish you all the best on that front. However, it seems (and this may be an inappropriate judgement on my end) that you might have become personally entangled in what is generally a technical issue.
I am just a commenter on this website, though I have used hacker news for a very long time at this point. I requested previously that you tamp down flaming the other users a bit, and I'd like to ask you once more. A good litmus test to maybe ask yourself is "Am I including any information in this message that indicates that another person may be right or wrong, or that I might be right or wrong? How strongly do I feel that my perspective is reality vs their incorrect perspective?" If you trigger that line when writing out a comment -- even if there is a strong impulse to ignore it, it may be time to step back, breathe, and separate out what is a personal issue for you, and what is a technical issue that you are passionate about. You can have both at once.
Please just slow it down a bit. I want to see what you and everyone else can mutually bring to the table in this conversation. Thank you.
Many good points. I agree with essentially everything you've said, especially regarding relative perplexity.
I'm aware that I was aggressively overselling an unnuanced and overstated position on 4-bit and especially 3-bit performance. That was partially a rhetorical tactic to swing the pendulum the other way, as it were.
And partially it was simply frustration with the number of threads I've seen in the past week of LLaMA drama spreading misinformation about bit precision like "a 16bit 13B model surely outperforms a 4-bit 30B model" which could not be further from the truth. That frustration is my own responsibility to manage and I understand that.
Yeah, 7b vs 13b is basically no comparison in any situation, 16bit 7b is def worse than 4bit 13b. I'll be looking into 30B tomorrow. I may be able to do a full matrix of tests 4-16bit X 7-30b.
8 bits, imo, is the minimum.