This looks pretty cool! What's the catch? e.g. why isn't this already implemente...

lupire · on March 15, 2024

It's not just a software algorithm. It's a hardware architecture optimization. To benefit, you have to build hardware that matches the dimensions of the algorithm. That's an expensive commitment.

emacs28 · on March 15, 2024

> you have to build hardware that matches the dimensions of the algorithm

Yes the benefits are realized in custom hardware designs as opposed to software, however, the hardware architectures work for multiplying matrices of arbitrary dimensions by splitting up larger matrices into smaller tiles, then summing up the tile products to form the final larger matrix products (i.e. GEMM)

SJC_Hacker · on March 18, 2024

Not so much in FPGA ... although I'm not sure top end FPGAs would beat Nvidia TPUs even with this algorithm, and even if cost were not a consideration.

emacs28 · on March 15, 2024

IMHO, for fixed-point MM accelerators, there is no catch, I think it's an overlooked algorithm. It's based on an algorithm by Winograd who coincidentally also proposed another unrelated algorithm that later became very popular for CNN acceleration which would take some visibility away from this other algorithm by Winograd... But that is speculative

yorwba · on March 15, 2024

On the other hand, if you tried it with floating point, you'd lose significant digits. Since the approach is to sum (a[i] + b[i+1])(a[i+1] + b[i]) and subtract the sums of a[i]a[i+1] and b[i]b[i+1] in the end to get a[i]b[i] + a[i+1]b[i+1], you may be taking the difference of two large values to get a small value, losing precision.

on March 15, 2024

[dead]

3abiton · on March 15, 2024

On a tangent, go is so elegant.

OJFord · on March 15, 2024

LLM hype and this submission in particular keep making me think of a lecturer I had for Topics in Large Dimensional Data Processing, circa 2016: as I recall he was enthusiastically adamant that the most important thing, breakthroughs etc., in years/decades to come was going to be faster matrix operations. Anyway, I'm pretty sure I recognise FIP (not FFIP of course) from that course.

I wish I could remember his name, I believe he left academia after my year and went to work in industry, I'd just be curious to see what he's up to now. I'm not saying it was a particularly novel or prescient comment/attitude, we may not have had quite such ML hype but certainly 'big data' was all the rage at the time, it's just something that's stuck in my mind. One of those areas I always meant to study more, just realistically probably never had the mathematical chops for and certainly those I did have atrophied.

bee_rider · on March 15, 2024

Maybe I’m joking, but: our society is just a vehicle for economics at this point, our economy is built around science, our science has mostly been turned into observations about engineering, some time ago we changed all of engineering into differential equations, and differential equations can be solved by discretizing them and doing linear algebra, and most of linear algebra can be done with matrix multiplications (triangular solves and orthonormalizations if you are fancy). All you need is matmul.

OJFord · on March 15, 2024

> our science has mostly been turned into observations about engineering

You may be joking but that in particular seems pretty astute.

Superficially it seems accurate, and reasonably ascribable to economic forces, fewer concentrations of capital in people (aristocrats) spending it on a hobby interest or academic pursuit of their own - today's equivalents mostly prefer philanthropy (Musk is, I suppose, for whatever else you might think of him, a notable exception - preferring to explore space, AI, etc. not really it seems for personal monetary gain). But I wonder if that is fair, to modern scientists, or is it just that 'all the low-hanging stuff's been done'?

SJC_Hacker · on March 18, 2024

For life sciences need grad students / postdocs to do the grunt work of pipetting, dissected, plating etc. And whatever the equivalent is in chemistry (titration/GC/mass transfer I guess)?

But those tools created by engineers are pretty darn important, and allow plenty of experiments/observations to be performed that were previously out of reach.

MrYellowP · on March 17, 2024

So, what you're saying is ...

... that the Matrix creates the world around us.

Thanks.

gojomo · on March 16, 2024

Unless he was a guest lecturer, if the course was for credit, wouldn't his name appear on your official transcript?

OJFord · on March 16, 2024

I don't think so, this may be the case in your country of course. I may well have recorded it in my notes if I dig them out, but this was a fourth year course and they certainly degraded over the years.

pclmulqdq · on March 15, 2024

There are a lot of matrix multiplication algorithms out there with a lot of pluses and minuses. It's always a balance of accuracy, runtime, and scaling. This one probably has bad accuracy in floating point.

emacs28 · on March 15, 2024

For everyone discussing the reduced accuracy/numerical stability of the algorithms in floating-point, this is true. But note that the application of the algorithms in the work is explored for fixed-point MM/quantized integer NN inference, not floating-point MM/inference. Hence, there is no reduction in accuracy for that application of it compared to using conventional fixed-point MM.

pclmulqdq · on March 15, 2024

"Conventional fixed-point MM" is a large suite of algorithms. It is correct that this is a 2x reduction in MULs compared to naive fixed-point matrix multiply, but there is a large body of literature out there with other circuits. This is a cool trick to add to the group.

p1esk · on March 15, 2024

Inference world is gradually switching from INT formats to FP formats. FP8 is already supported in modern hardware, and FP4 support is coming. In my experiments I get better perplexity in language models with FP4 than with INT4.

SJC_Hacker · on March 18, 2024

How is FP fundamentally different than integers? I've done FPGA programming and it just seems like the programmer has to decide where/when to do the shifting based on the expected range of the data. I'm not sure how this is "natively supported" in hardware.

p1esk · on March 18, 2024

If you have designed FPUs you should know that FP computation involves a lot more additional operations than just shifting (e.g. rounding, subnormals, and special value handling). That’s why, for example, CPUs use different hardware blocks for INT vs FP computation.

But that’s not the point. The point is, this particular method to speed up matmul is not suitable for FP.

abetusk · on March 15, 2024

I'm no expert but I suspect this is wrong. To me, this is like saying you don't need to worry about integer overflow because your operations are only working on fixed integers. Really? You don't care if you multiply or add two large numbers and they spill over?

The more appropriate answer, I suspect, is that the numerical precision and stability sacrifices are more than adequate for normal usage.

If I'm wrong about this, I would certainly like to know.

pclmulqdq · on March 15, 2024

In hardware, you control your integer widths completely, so if you add two 32-bit ints to a 33-bit int, there is no chance of overflow. The same goes for multiplications, etc.

SJC_Hacker · on March 18, 2024

Yeah with shifts you can guarantee no overflow, but you have to decide under what circumstances is avoiding overflow/clipping worth the loss of precision.

Fixed point typically requires alot more programming, but sometimes its worth it if you know what the data ranges are.

abetusk · on March 15, 2024

I don't know why this answer is getting downvoted. This is absolutely correct.

W. Miller has a paper discussing, under conditions of numerical stability, O(n^3) multiplications is necessary [0]. Any algorithm that gets sub cubic runtime for matrix multiplication, like Strassen's or Coppersmith's, must sacrifice some amount of precision or stability.

[0] https://epubs.siam.org/doi/10.1137/0204009

gyrovagueGeist · on March 15, 2024

Another relevant paper is: https://epubs.siam.org/doi/10.1137/15M1032168

waynecochran · on March 15, 2024

The document said it outputs the exact same values as the conventional method. There is no accuracy trade off here.

pclmulqdq · on March 15, 2024

The paper cited is about hardware, where there is no accuracy tradeoff because you control the numerical precision completely and use fixed point. In a software implementation, neither is true. There is no chance that you will get the exact same values out of this method that you do out of other FP matmuls.

p1esk · on March 15, 2024

For floating point? Are you sure?

waynecochran · on March 15, 2024

Opening statement of README

    This repository contains the source code for ML hardware architectures that 
    require nearly half the number of multiplier units to achieve the same 
    performance, by executing alternative inner-product algorithms that trade 
    nearly half the multiplications for cheap low-bitwidth additions, while still 
    producing identical output as the conventional inner product.

p1esk · on March 15, 2024

I just looked at the paper: the answer is no, floating point is not supported.

pbsd · on March 15, 2024

It's not quite forgotten. It kind of lives on in the pseudo-dot product Wegman-Carter authenticators like UMAC. See Section 3 of [1] for context.

[1] https://cr.yp.to/antiforgery/pema-20071022.pdf

adastra22 · on March 15, 2024

I’ve only glanced at it so someone correct me if I’m wrong, but IIUC this is not a replacement for matrix multiplication but rather an approximation that only gives decent-ish results for the types of linear systems you see in AI/ML. But for that use case it is totally fine?

emacs28 · on March 21, 2024

It produces identical/bit-equivalent results as conventional/naive matrix multiplication for integer/fixed-point data types

mariocesar · on March 15, 2024

Perhaps it's less of a hidden gem and more of a spotlight moment.