Speeding up atan2f

gumby · on Aug 17, 2021

Because of the disregard for the literature common in CS I loved this part:

> This is achieved through ...and some cool documents from the 50s.

A bit of anecdote: back when I was a research scientist (corporate lab) 30+ years ago I would in fact go downstairs to the library and read — I was still a kid with a lot to learn (and still am). When I came across (by chance or by someone’s suggestion) something useful to my work I’d photocopy the article and and try it out. I’d put a comment in the code with the reference.

My colleagues in my group would give me (undeserved) credit for my supposed brilliance even though I said in the code where the idea came from and would determinedly point to the very paper on my desk. This attitude seemed bizarre as the group itself was producing conference papers and even books.

(Obviously this was not a universal phenomenon as there were other people in the lab overall, and friends on the net, suggesting papers to read. But I’ve seen it a lot, from back then up to today)

lmilcin · on Aug 17, 2021

It is very unfortunate that in the world of ubiquitous information, ability to just look up a reasonable solution to a well known and well studied problem that has already been solved long ago is practically a superpower.

As a developer, I regularly am in a situation where I solved some kind of large problem people had for a long time with significant impact for the company and this mostly with couple google searches.

Once I solved a batch task that took 11 hours to complete and reduced it to about 5s of work by looking up a paper on bitemporal indexes. The reactions ranged from "You are awesome!" to "You red a... paper????" to "No, we can't have this, this is super advanced code and it surely must have bugs. Can't you find an open source library that implements this?"

refset · on Aug 17, 2021

> bitemporal indexes [...] Can't you find an open source library that implements this?

Undoubtedly too late for you, but https://github.com/juxt/crux/blob/master/crux-core/src/crux/... :)

zokier · on Aug 17, 2021

> It is very unfortunate that in the world of ubiquitous information, ability to just look up a reasonable solution to a well known and well studied problem that has already been solved long ago is practically a superpower

Its exactly because information is so ubiquitous and plentiful that finding a reasonable solution is like looking for a needle in a haystack.

lmilcin · on Aug 17, 2021

In most cases in my experience, the people did not face challenge actually grepping the internet for the solution.

More likely they just went to roll out their own solution for any number of other reasons:

-- For the fun of intellectual challenge (I am guilty of this),

-- For having hubris to think they are first to meet that challenge (I have been guilty of this in the past, I no longer make that mistake),

-- For having hubris to think they have uniquely best solution to the problem, highly unlikely it is even worth verifying,

-- For being superior to everybody else and hence no real reason to dignify other, lesser people, by being interested in their work,

-- For feeling there is just too much pressure and no time to do it right or even to take a peek at what is right and how much time it would take,

-- For not having enough experience to suspect there might be a solution to this problem,

-- Books? Only old people read books,

-- Papers? They are for research scientists. I haven't finished CS to read papers at work.

and so on.

There have really been very few concrete problems that I could not find some solution to on the Internet.

It is more difficult to find solution to less concrete problems but even with these kinds of problems if you have a habit of regularly seeking knowledge, reading various sources, etc. you will probably amass enough experience and knowledge to deal with most even complex problems on your own.

djmips · on Aug 17, 2021

I regularly hear coders say they cannot understand other people's code so they _had_ to write it themselves. Most people can't read papers either it seems.

lmilcin · on Aug 17, 2021

Well, that's another thing.

People lost ability to read anything longer than a tweet with focus and comprehension. Especially new developers.

I know this because whatever I put in third and further paragraphs of my tickets is not being red or understood.

I sometimes quiz guys on review meetings and I even sometimes put the most important information somewhere in the middle of the ticket. Always with the same result -- they are surprised.

Now, to read a paper is to try to imagine being the person who wrote it and appreciate not just written text but also a lot more things that are not necessarily spelled.

The same with code -- when I read somebodys code I want to recognize their style, how they approach various problems. Once you know a little bit about their vocabulary you can really speed up reading the rest because you can anticipate what is going to happen next.

nuclearnice1 · on Aug 17, 2021

People may have a preference for certainty.

They work on the problem. They will make progress and likely solve it.

They search the literature they will spend time on research and quite possibly lose days before emerging with nothing to show for it.

lmilcin · on Aug 18, 2021

> They search the literature they will spend time on research and quite possibly lose days before emerging with nothing to show for it.

I see you prefer lousy results.

Also, you read BEFORE you need it.

To be literate, educated, knowledgeable developer. And to know where the knowledge is when you need it (because you can't remember everything).

nuclearnice1 · on Aug 18, 2021

> I see you prefer lousy results.

I wasn’t stating a preference merely observing another possible reason people don’t research solutions before diving in.

jacobolus · on Aug 17, 2021

One of the most valuable (but often under-appreciated) kinds of work people can do is to take hard-won knowledge that exists somewhere obscure and make it more accessible, e.g. by writing expository survey papers, adding material to Wikipedia, writing documentation, ...

caf · on Aug 18, 2021

Reminds me a little of

If Edison had a needle to find in a haystack, he would proceed at once with the diligence of the bee to examine straw after straw until he found the object of his search. … I was a sorry witness of such doings, knowing that a little theory and calculation would have saved him ninety per cent of his labor.

  (Nikola Tesla, 19 October 1931)

lmilcin · on Aug 18, 2021

Actually, it would save much more than 90 percent. More like 99.9%. Just take powerful magnet in your hand and start pushing through the haystack until you find it.

Which, of course, requires that you somehow know of magnets beforehand.

Which is exactly the point -- even if you are in business of finding needles in haystack it pays to have knowledge in other areas, and basically read books and other knowledge sources. You never know when it is going to pay off.

nextaccountic · on Aug 18, 2021

> "No, we can't have this, this is super advanced code and it surely must have bugs. Can't you find an open source library that implements this?"

Please tell me this person didn't win and they got to keep your solution.

(my response would be "let me open source this then")

lmilcin · on Aug 18, 2021

Yes, they got to keep my solution. I basically told the guy "You are free to propose your own".

As to open sourcing, not going to happen.

Companies treat the code as if it was tangible good. If you give it to somebody else you are loosing value.

Which is super strange, because code needs to be maintained and why not have other people do it for you if they find it valuable and worth maintaining?

teddyh · on Aug 17, 2021

I’ve seen an odd use of language, mostly from U.S. Americans, where “intelligent” = “smart” = “knowledgable”. And conversely, assuming that knowing many things is what makes you smart. I suspect this is what happened there; they thought you “brilliant” because you knew many things – in their minds, there might not be a difference between the two concepts.

dleslie · on Aug 17, 2021

I find that disregard for education to be tasteless, but perhaps warranted due to the modern prevalence of "bootcamps" and similar fast-but-shallow programmes.

Personally, I love when I get a chance to apply something from a published paper; I will leave a citation as a comment in the code and a brief explanation of how it works and why it was chosen. I have no regrets for having achieved a computing degree, so many years ago.

whatshisface · on Aug 17, 2021

Not brilliant? Use of the written word to pass ideas down between generations is one of the most brilliant ideas in human history, and since nobody around you was doing it, the only explanation is that you independently invented the ethos of the scholar yourself.

gumby · on Aug 17, 2021

Heh heh, Ockham’s razor provides an explanation!

specialist · on Aug 17, 2021

Nice.

I truly miss foraging the stacks. Such delight.

Browsing scholar.google.com, resulting in 100s of tabs open, just isn't the same.

jjgreen · on Aug 17, 2021

I always found that the smell of old books filled me with the urge to defecate, to the extent that I had to deliberately go for a crap before diving into the stacks. Something of a disadvantage as a pre-internet researcher. I thought I was the only person in the world with this condition, but have subsequently found a couple of other similarly afflicted souls.

IntrepidWorm · on Aug 17, 2021

There is a semi-documented phenomenon among book store workers regarding just this: employees at retail stores like Barnes and Noble and Borders Books report famously bad cleanups in their bathrooms on a daily basis, reputedly due to this.

See: https://en.m.wikipedia.org/wiki/Mariko_Aoki_phenomenon

The academic consensus seems understandably mixed, but it is certainly a fascinating ponder (what a wonderful avenue of research)!

I'd wager any significant breakthroughs in this would certainly be up for consideration for an IgNobel.

jjgreen · on Aug 18, 2021

It has a name! You have made my day.

touisteur · on Aug 17, 2021

But, knowing where to look for, being able to suss out details from papers, having the almost adequate level of conversation to try and contact authors (if still alive...). The ability to put undecypherable formulas into code and to bear stackoverflow's requirements for a 'good' question. Not a set of skills that amounts to nothing. It's rare nowadays that I see any copy of a reference textbook on numerical recipes, although we do tons of signal processing and HPC stuff.

Recently I rediscovered the ability to compute logs with 64 bits integers with a better precision than double-precision. Man, jumping back into those depths after so many years... Not as easy as reading other people's code.

_moof · on Aug 17, 2021

No one likes to do lit review, which is a shame. I get it - it's way less fun to find out someone already bested your cool idea than it is to write code. But it's definitely more productive to read the literature!

kbenson · on Aug 17, 2021

> 30+ years ago ... I was still a kid with a lot to learn (and still am).

We should all strive to stay young at heart like this. ;)

jacobolus · on Aug 17, 2021

If someone wants a fast version of x ↦ tan(πx/2), let me recommend the approximation:

  tanpi_2 = function tanpi_2(x) {
    var y = (1 - x*x);
    return x * (((-0.000221184 * y + 0.0024971104) * y - 0.02301937096) * y
      + 0.3182994604 + 1.2732402998 / y);
  }

(valid for -1 <= x <= 1)

https://observablehq.com/@jrus/fasttan with error: https://www.desmos.com/calculator/hmncdd6fuj

But even better is to avoid trigonometry and angle measures as much as possible. Almost everything can be done better (faster, with fewer numerical problems) with vector methods; if you want a 1-float representation of an angle, use the stereographic projection:

  stereo = (x, y) => y/(x + Math.hypot(x, y));
  stereo_to_xy = (s) => {
    var q = 1/(1 + s*s);
    return !q ? [-1, 0] : [(1 - s*s)/q, 2*s/q]; }

leowbattle · on Aug 17, 2021

> Almost everything can be done better ... use the stereographic projection

I was just learning about stereographic projection earlier. Isn't it odd how when you know something you notice it appearing in places.

Can you give an example of an operation that could be performed better using stereographic projection rather than angles?

jacobolus · on Aug 17, 2021

Generally you can just stick to storing 2-coordinate vectors and using vector operations.

The places where you might want to convert to a 1-number representation are when you have a lot of numbers you want to store or transmit somewhere. Using the stereographic projection (half-angle tangent) instead of angle measure works better with the floating point number format and uses only rational arithmetic instead of transcendental functions.

aj7 · on Aug 17, 2021

Around 1988, I added phase shift to my optical thin film design program written in Excel 4.0 for the Mac. At the time, this was utterly unique: each spreadsheet row represented a layer and the matrices describing each layer could be calculated right in that row by squashing them down horizontally. The S- and P-polarization matrices could be recorded this way, and the running matrix products similarly maintained. Finally, using a simple one input table, the reflectance of a typically 25-31 layer laser mirror could be calculated. And in less than a second on a 20 MHz 68020 (?) Mac II for about 50 wavelengths. The best part were the graphics which were instantaneous, beautiful, publishable, and pasteable into customer quotations. Semi-technical people could be trained to use the whole thing.

Now about the phase shift. In 1988, atan2 didn’t exist. Anywhere. Not in FORTRAN, Basic, Excel, or a C library. I’m sure phase shift calculators implemented it, each working alone. For us, it was critical. You see we actually cared not about the phase shift, but its second derivative, the group delay dispersion. This was the beginning of the femtosecond laser era, and people needed to check whether these broadband laser pulses would be inadvertently stretched by reflection off or transmission through the mirror coating. So atan2, the QUADRANT-PRESERVING arc tangent, is required for a continuous,differential phase function. An Excel function macro did this, with IF statements correcting the quadrant. And the irony of all this?

I CALLED it atan2.

dzdt · on Aug 17, 2021

Fortran did have it much earlier. Per wikipedia atan2 appeared in Fortran IV from IBM in 1961. https://en.wikipedia.org/wiki/Atan2

pklausler · on Aug 17, 2021

And it's been in every ANSI & ISO Fortran standard from F'66 through F'2018.

floxy · on Aug 17, 2021

Table 8-3, page 8-7 (pdf page 97):

http://www.bitsavers.org/www.computer.museum.uq.edu.au/pdf/D...

raphlinus · on Aug 17, 2021

1965: https://www.ecma-international.org/publications-and-standard... pdf page 36

floxy · on Aug 17, 2021

Nice find. Anyone have a way to search for old versions "math.h"? The 1989 version of ANSI C had atan2:

https://web.archive.org/web/20161223125339/http://flash-gord...

...but I wonder when it first hit the scene.

kimixa · on Aug 17, 2021

According to https://unix.superglobalmegacorp.com/ it was in 3bsd libm -https://unix.superglobalmegacorp.com/cgi-bin/cvsweb.cgi/3BSD...

3BSD was released at the end of 1979 according to wikipedia, and that's just the oldest BSD source I could find, so it may have been in even older versions.

Before ANSI/ISO C I don't think there was really a "math.h" to look for - as it may be different on different systems, but as so many things derive from BSD I wouldn't be surprised if it was "de-facto" standardised to what that supported.

Narishma · on Aug 17, 2021

> Around 1988, I added phase shift to my optical thin film design program written in Excel 4.0 for the Mac.

Excel 4 was released in 1992. In 1988 the only version of Excel that would have been available on the Mac would be the very first one.

drej · on Aug 17, 2021

Nice. Reminds me of an optimisation trick from a while ago: I remember being bottlenecked by one of these trigonometric functions years ago when working with a probabilistic data structure... then I figured the input domain was pretty small (a couple dozen values), so I precomputed those and used an array lookup instead. A huge win in terms of perf, obviously only applicable in these extreme cases.

ThePadawan · on Aug 17, 2021

One of the things I recently learned that sounded the most "that can't possibly work well enough" is an optimization for sin(x):

If abs(x) < 0.1, "sin(x)" is approximated really well by "x".

That's it. For small x, just return x.

(Obviously, there is some error involved, but for the speedup gained, it's a very good compromise)

tzs · on Aug 17, 2021

To put some numbers on it, using N terms of the Taylor series for sin(x) [1] with |x| <= 0.1, the maximum error percentage cannot exceed [2]:

  N  Error Limit
  1  0.167% (1/6%)
  2  8.35 x 10^-5% (1/11980%)
  3  1.99 x 10^-8% (1/50316042%)
  4  2.76 x 10^-12% (1/362275502328%)

Even for |x| as large as 1 the sin(x) = x approximation is within 20%, which is not too bad when you are just doing a rough ballpark calculation, and a two term approximation gets you under 1%. Three terms is under 0.024% (1/43%).

For |x| up to Pi/2 (90°) the one term approximation falls apart. The guarantee from the Taylor series is within 65% (in reality it does better, but is still off by 36%). Two terms is guaranteed to be within 8%, three within 0.5%, and four within 0.02%.

Here's a quick and dirty little Python program to compute a table of error bounds for a given x [3].

[1] x - x^3/3! + x^5/5! - x^7/7! + ...

[2] In reality the error will usually be quite a bit below this upper limit. The Taylor series for a given x is a convergent alternating series. That is, it is of the form A0 - A1 + A2 - A3 + ... where all the A's have the same sign, |Ak| is a decreasing sequence past some particular k, and |Ak| has a limit of 0 as k goes to infinity. Such a series converges to some value, and if you approximate that by just taking the first N terms the error cannot exceed the first omitted term as long as N is large enough to take you to the point where the sequence from there on is decreasing. This is the upper bound I'm using above.

[3] https://pastebin.com/thN8B7Gf

MauranKilom · on Aug 17, 2021

Some trivia, partly stolen from Bruce Dawson[0]:

The sin(x) = x approximation is actually exact (in terms of doubles) for |x| < 2^-26 = 1.4e-8. See also [1]. This happens to cover 48.6% of all doubles.

Similarly, cos(x) = 1 for |x| < 2^-27 = 7.45e-9 (see [2]).

Finally, sin(double(pi)) tells you the error in double(pi) - that is, how far the double representation of pi is away from the "real", mathematical pi [3].

[0]: https://randomascii.wordpress.com/2014/10/09/intel-underesti...

[1]: https://github.com/lattera/glibc/blob/master/sysdeps/ieee754...

[2]: https://github.com/lattera/glibc/blob/master/sysdeps/ieee754...

[3]: https://randomascii.wordpress.com/2012/02/25/comparing-float...

quietbritishjim · on Aug 17, 2021

That is precisely the technique discussed in the article: it's the first term of the Taylor expansion. Except that the article used more terms of the expansion, and also used very slightly "wrong" coefficients to improve the overall accuracy within the small region.

RicoElectrico · on Aug 17, 2021

Come on, at this point I've seen this "engineering approximation" memed so many times, even on Instagram ;)

What is more interesting to me is that this can be one of rationales behind using radians.

And that tan(x)~x also holds for small angles, greatly easing estimations of distance to objects of known size (cf. mil-dot reticle).

lapetitejort · on Aug 17, 2021

tan(x)~x because cos(x)~1. It's approximations all the way down.

chriswarbo · on Aug 17, 2021

This is a very common assumption in Physics, e.g. https://en.wikipedia.org/wiki/Pendulum_(mathematics)#Small-a...

Whether it's appropriate in a numerical calculation obviously depends on the possible inputs and the acceptable error bars :)

sharikone · on Aug 17, 2021

I think that you will find that for subnormal numbers any math library will use the identity function for sin(x) and 1 for cos(x)

ThePadawan · on Aug 17, 2021

Right, but the largest subnormal number in single-precision floats is ~ 10^-38.

That the sin(x) approximation still works well for 10^-1 (with an error of ~0.01%) is the really cool thing!

jvz01 · on Aug 17, 2021

Another good hint is the classical half-angle formulas. You can often avoid calling sin() and cos() altogether!

Bostonian · on Aug 17, 2021

Why wouldn't you at least include the x^3 term in the Taylor series for abs(x) < 0.1?

ThePadawan · on Aug 17, 2021

That's what I assumed would have been a reasonable optimization!

What I really found amazing was that rather than reducing the number of multiplications to 2 (to calculate x^3), you can reduce it to 0, and it would still do surprisingly well.

mooman219 · on Aug 17, 2021

This would be a decent lookup for the atan2 function:

https://gist.github.com/mooman219/19b18ff07bb9d609a103ef0cd0...

tantalor · on Aug 17, 2021

https://en.wikipedia.org/wiki/Memoization

bluedino · on Aug 17, 2021

Technically it's a lookup table if you pre-compute them. Memoization would just be caching them as you do them.

tantalor · on Aug 17, 2021

Not necessarily, you could "cache" them in a compilation step and then use the table at runtime.

bruce343434 · on Aug 17, 2021

Tangential at best, but why was the 'r' dropped from that term? Or why not call it caching? Why the weird "memo-ization"? It makes me think of a mass extinction event where everything is turned into a memo.

franciscop · on Aug 17, 2021

It's explained right in the linked Wikipedia page:

> The term "memoization" was coined by Donald Michie in 1968[3] and is derived from the Latin word "memorandum" ("to be remembered"), usually truncated as "memo" in American English, and thus carries the meaning of "turning [the results of] a function into something to be remembered". While "memoization" might be confused with "memorization" (because they are etymological cognates), "memoization" has a specialized meaning in computing.

wongarsu · on Aug 17, 2021

The term memoization likely precedes the word caching (as related to computing, obviously weapon caches are far older). Memoization was coined in 1968. CPU caches only came about in the 80s as registers became significantly faster than main memory.

As wikipedia outlines, the r was dropped because of the memo. It's derived from the latin word memorandum that does contain the r, just like memory, but apparently it was more meant as an analogy to written memos.

Const-me · on Aug 17, 2021

I wonder how does it compare with Microsoft’s implementation, there: https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/Di...

Based on the code your version is probably much faster. It would be interesting to compare precision still, MS uses 17-degree polynomial there.

pclmulqdq · on Aug 17, 2021

The "small error" in this article isn't so small when people want exact results. Unfortunately, a high-degree polynomial is necessary if you want 24 bit precision across the input range for functions like this.

That said, I wonder if a rational approximation might not be so bad on modern machines...

azalemeth · on Aug 17, 2021

Indeed. I love Padé approximants -- they're very underused. Another really fun trick for approximations of expensive functions are turning them into Chebsyshev functions, like Nick Trefethen's chebfun package. You can trivially differentiate, integrate and compute chebfun approximations of really horrible special functions (e.g. DE solutions) with fantastic accuracy. Criminally underused.

jacobolus · on Aug 17, 2021

If you like Padé approximants, have you been following Trefethen & al.’s recent work on the "AAA" algorithm etc.? https://arxiv.org/pdf/1612.00337.pdf https://arxiv.org/pdf/1705.10132.pdf

See also the lecture https://www.youtube.com/watch?v=S1upJPMIFfg

azalemeth · on Aug 17, 2021

No, I haven't -- thank you very much for sharing. I've only ever used them, e.g. for spectral decomposition in magnetic resonance in a medical setting (1) with the fast Padé transformation, which has a wonderfully deep relationship with the DEs describing magnetic resonance (2).

Thanks also for the lecture - I'll enjoy. Nick Trefethen is such a good speaker. He taught me in a graduate course in numerical methods -- honestly, I'd watch those lectures on a loop again.

(1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5179227/

(2) https://iopscience.iop.org/article/10.1088/0031-9155/51/5/00...

petschge · on Aug 17, 2021

24 bits precision would be possible with the Hastings approximation with 15 terms instead of 6 terms. A naive estimate would be that this requires 2.5 times as much work and could run at 6GB/s and take a little less than 5 cycles per element.

Someone · on Aug 17, 2021

I think that remains to be seen. Exact calculations will give you 24 bits, but in computers, floating point calculations typically aren’t exact.

If, as this article does, you compute all your intermediate values as 32-bit IEEE floats, it would hugely surprise me if you still got 24 bits of precision after 1 division (to compute y/x or x/y), 15 multiplications and 14 additions (to evaluate the polynomial) and, sometimes, a subtraction from π/2.

(That’s a weakness in this article. They point out how fast their code is, but do not show any test that it actually computes approximations to atan2, let alone about how accurate their results are)

Getting all the bits right in floating point is expensive.

adgjlsfhk1 · on Aug 17, 2021

The good news is that the high order terms of the polynomial don't notably affect the error. with fma, horner's method converges to .5 ULP accuracy. The subtraction with pi/2 can also be done without adding inaccuracy by using compensated summation.

Const-me · on Aug 17, 2021

> I wonder if a rational approximation might not be so bad on modern machines

Uops.info says the throughput of 32-byte FP32 vector division is 5 cycles on Intel, 3.5 cycles on AMD.

The throughput of 32-byte vector multiplication is 0.5 cycles. The same is for FMA, which is a single instruction computing a*b+c for 3 float vectors.

If the approximation formula only has 1 division where both numerator and denominator are polynomials, and the degree of both polynomials is much lower than 17, the rational version might be not too bad compared to the 17-degree polynomial version.

pclmulqdq · on Aug 17, 2021

A Pade approximation is generally like that. Especially when you are dealing with singularities, it should be a much smaller polynomial than 17 terms in the numerator and denominator. I think I'm going to have to start a blog, and this will be on the list.

janwas · on Aug 17, 2021

Indeed, degree-4 rational polynomials are generally enough for JPEG XL. Code for evaluating them using Horner's scheme (just FMAs): https://github.com/libjxl/libjxl/blob/bdde644b94c125a15e532b...

I initially used Chebfun to approximate them, then a teammate implemented the same Caratheodory-Fejer method in C. We subsequently convert from Chebyshev basis to monomials.

Blog sounds good :)

stephencanon · on Aug 17, 2021

> if we’re working with batches of points and willing to live with tiny errors, we can produce an atan2 approximation which is 50 times faster than the standard version provided by libc.

Which libc, though? I assume glibc, but it's frustrating when people talk about libc as though there were a single implementation. Each vendor supplies their own implementation, libc is just a common interface defined by the C library. There is no "standard version" provided by libc.

In particular, glibc's math functions are not especially fast--Intel's and Apple's math libraries are 4-5x faster for some functions[1], and often more accurate as well, for example (and both vendors provide vectorized implementations). Even within glibc versions, there have been enormous improvements over the last decade or so, and for some functions there are big performance differences depending on whether or not -fno-math-errno is specified. (I would also note that atan2 has a lot of edge cases, and more than half the work in a standards-compliant libc is in getting those edge cases with zeros and infinities right, which this implementation punts on. There's nothing wrong with that, but that's a bigger tradeoff for most users than the small loss of accuracy and important to note.)

So what are we actually comparing against here? Comparing against a clown-shoes baseline makes for eye-popping numbers, but it's not very meaningful.

None of this should really take away from the work presented, by the way--the techniques described here are very useful for people interested in this stuff.

[1] I don't know the current state of atan2f in glibc specifically; it's possible that it's been improved since last I looked at its performance. But the blog post cites "105.98 cycles / element", which would be glacially slow on any semi-recent hardware, which makes me think something is up here.

rostayob · on Aug 17, 2021

(I'm the author)

You're right, I should have specified -- it is glibc 2.32-48 . This the source specifying how glibc is built: https://github.com/NixOS/nixpkgs/blob/97c5d0cbe76901da0135b0... .

I've amended the article so that it says `glibc` rather than `libc`, and added a sidenote specifying the version.

I link to it statically as indicated in the gist, although I believe that shouldn't matter. Also see https://gist.github.com/bitonic/d0f5a0a44e37d4f0be03d34d47ac... .

Note that the hardware is not particularly recent (Q3 2017), but we tend to rent servers which are not exactly on the bleeding edge, so that was my platform.

floxy · on Aug 17, 2021

Thanks for the interesting writeup. I wonder if it is because glibc has a less-optimized atan2f (for floats). The double version seems quite involved in glibc anyway:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/iee...

...where the atan2f version ends up calling atan, which doesn't seem as sophisticated:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/iee...

stephencanon · on Aug 17, 2021

Thanks for clarifying!

Without benchmarking, I would expect atan2f to be around 20-30 cycles per element or less with either Intel's or Apple's scalar math library, and proportionally faster for their vector libs.

rostayob · on Aug 17, 2021

Thanks for the info.

By the way, your writing on floating point arithmetic is very informative -- I even cite a message of yours on FMA in the post itself!

rostayob · on Aug 17, 2021

Ah, regarding edge cases: the only edge cases I do not handle are infinities and the origin. In my case these are both uninteresting: we deal with points from panoramic images, so those both make no sense.

I did want to deal with 0s in either coordinates though, and to preserve NaNs.

You're right that atan2 has a lot of edge cases, which is why it made for an interesting subject. I thought it was neat that the edge cases I did care about fell out fairly naturally from the efficient bit-twiddling.

That said, the versions presented in the article post are _not_ conforming, as I remark repeatedly. The article is more about getting people curious about some topics that I think are neat, more than a guide on how to implement those standard functions correctly.

hdersch · on Aug 17, 2021

A comparison with one of the many SIMD-mathlibraries would have been fairer than with plain libm. Long time ago I wrote such a dual-platform library for the PS3 (cell-processor) and x86 architecture (outdated, but still available here [1]). Depending on how standard libm implements atan2f, a speedup of 3x to 15x is achieved, without sacrifying accuracy.

1. https://webuser.hs-furtwangen.de/~dersch/libsimdmath.pdf

prionassembly · on Aug 17, 2021

I wonder whether Padé approximants are well known by this kind of researcher. E.g. http://www-labs.iro.umontreal.ca/~mignotte/IFT2425/Documents...

nimish · on Aug 17, 2021

Numerical analysts are definitely familiar. Library authors may not be unless they took a course in function approximation.

Personally I prefer chebyshev approximants like chebfun but pade are really good at the cost of a division.

stephencanon · on Aug 17, 2021

A math library implementor will generally be familiar with at least minimax and Chebyshev approximations, and generally Padé and Carathéodory-Fejér approximations as well. (Source: I implemented math libraries for a decade. We used all of these frequently.)

jvz01 · on Aug 17, 2021

I have developed very fast, accurate, and vectorizable atan() and atan2() implementations, leveraging AVX/SSE capabilities. You can find them here [warning: self-signed SSL-Cert].

https://fox-toolkit.org/wordpress/?p=219

jvz01 · on Aug 17, 2021

Little side-note: algorithm as given is scalar; however, its branch-free, and defined entirely in the header file. So, compilers will typically be able to vectorize it, and thus achieve speed up directly based on the vector size. I see potential [but architecture-dependent] optimization using Estrin scheme for evaluating the polynomial.

ghusbands · on Aug 17, 2021

Your result is significantly slower than the versions presented in the article, though yours has more terms and so may be more accurate.

jvz01 · on Aug 17, 2021

Yes, aim was to be acurate down to 1 lsb while significantly faster. Feel free to drop terms from the polynomial if you can live with less accurate results!

The coefficients were generated by a package called Sollya, I've used it a few times to develop accurate chebyshev approximations for functions.

Abramowitz & Stegun is another good reference.

touisteur · on Aug 17, 2021

Please, Would you mind one of these days updating your blog post with the instructions you gave to sollya? I'm trying something stupid with log1p and can't get sollya to help, mostly because I'm not putting enough time to read all the docs...

drfuchs · on Aug 17, 2021

Wouldn’t CORDIC have done the trick faster? There’s no mention that they even considered it, even though it’s been around for half a century or so.

adrian_b · on Aug 17, 2021

CORDIC was great for devices that were too simple to include hardware multipliers.

CORDIC, in its basic form, produces just 1 bit of result per clock cycle. Only for very low precisions (i.e. with few bits for each result) you would have the chance to overlap enough parallel atan2 computations to achieve a throughput comparable to what you can do with polynomial approximations on modern CPUs, with pipelined multipliers.

CORDIC remains useful only for ASIC/FPGA implementations, in the cases when the area and the power consumption are much more important than the speed.

mzs · on Aug 17, 2021

CORDIC unlikely to be faster than this article's: "the approximation produces a result every 2 clock cycles"

zokier · on Aug 17, 2021

I would have wished to see the error analysis section expanded a bit, or maybe seeing some sort of tests to validate the max error. In particular if the mathematical approximation function arctan* has max error of 1/10000 degrees then I'd naively expect that the float-based implementation to have worse error. Furthermore it's not obvious if additional error could be introduced by the division

    float atan_input = (swap ? x : y) / (swap ? y : x);

shoo · on Aug 17, 2021

Related -- there's a 2011 post from Paul Minero with fast approximations for logarithm, exponential, power, inverse root. http://www.machinedlearnings.com/2011/06/fast-approximate-lo...

Minero's faster approximate log2, < 1.4% relative error for x in [1/100, 10]. Here's the simple non-sse version:

  static inline float 
  fasterlog2 (float x)
  {
    union { float f; uint32_t i; } vx = { x };
    float y = vx.i;
    y *= 1.1920928955078125e-7f;
    return y - 126.94269504f;
  }

This fastapprox library also includes fast approximations of some other functions that show up in statistical / probabilistic calculations -- gamma, digamma, lambert w function. It is BSD licensed, originally lived in google code, copies of the library live on in github, e.g. https://github.com/etheory/fastapprox

It's also interesting to read through libm. E.g. compare Sun's ~1993 atan2 & atan:

https://github.com/JuliaMath/openlibm/blob/master/src/e_atan...

https://github.com/JuliaMath/openlibm/blob/master/src/s_atan...

pklausler · on Aug 17, 2021

(Undoubtedly) stupid question: would it be any faster to project (x, y) to the unit circle (x', y'), then compute acos(x') or asin(y'), and then correct the result based on the signs of x & y? When converting Cartesian coordinates to polar, the value of r=HYPOT(x, y) is needed anyway, so the projection to the unit circle would be a single division by r.

sorenjan · on Aug 17, 2021

How do you handle arrays of values where the array lengths are not a multiple of 8 in this kind of vectorized code? Do you zero pad them before handling them to the vectorized function, or do you run a second loop element by element on the remaining elements after the main one? What happens if you try to do `_mm256_load_ps(&ys[i])` with < 8 elements remaining?

zbjornson · on Aug 17, 2021

You could pad the input as you said, which avoids a "tail loop," but otherwise you usually do a partial load (load <8 elements into a vector) and store. Some instruction set extensions provide "masked load/store" instructions for this, but there are ways to do it without those too.

To your last question specifically, if you _mm256_load_ps(&ys[i]) and you're at the edge of a page, you'll get a segfault. Otherwise you'll get undefined values (which can be okay, you could ignore them instead of dealing with a partial load).

twic · on Aug 17, 2021

RISC-V has a solution to this which seems quite elegant - the vector length is set by the program, up to some maximum, and for the final block, it is simply whatever is left:

https://gms.tf/riscv-vector.html

vlovich123 · on Aug 17, 2021

> Do you zero pad them before handling them to the vectorized function, or do you run a second loop element by element on the remaining elements after the main one?

Typically the latter. That's why I find SVEs so interesting. Should improve code density.

TonyTrapp · on Aug 17, 2021

Typically you have have the vectorized loop followed by a non-vectorized loop handling the remaining items, or even just explicit if-then-else statements for each possible number of remaining items.

aconz2 · on Aug 17, 2021

Nice writeup and interesting results. I hadn't seen the use of perf_event_open(2) before directly in code which looks cool.

The baseline is at a huge disadvantage here because the call to atan2 in the loop never gets inlined and the loop doesn't seem to get unrolled (which is surprising actually). Manually unrolling by 8 gives me an 8x speedup. Maybe I'm missing something with the `-static` link but unless they're using musl I didn't think -lm could get statically linked.

xxpor · on Aug 17, 2021

Could it be that calls to glibc never get inlined, since inlining is essentially static linking by another name? Or since they're in separate compilation units, without LTO you'd never perform the analysis to figure out if it's worth it. Would LTO inspect across a dynamic loading boundary? Just speculating, I really have no idea. Everything I work with is statically linked, so I've never really had to think about it.

aconz2 · on Aug 17, 2021

You've made a silly mistake and just did 1/8th the work, so of course it was 8x speedup. I am still wondering how the numbers would look if you could inline atan2f from libc. Too bad that isn't easier

nice2meetu · on Aug 17, 2021

I did something similar for tanh once, though I found I could get to 1 ulp.

Part of the motivation was that I could get 10x faster than libc. However, I then tried on my FreeBSD and could only get 4x faster. After a lot of head scratching and puzzling it turned out there was a bug in the version of libc on my linux box that slowed things down. It kind of took the wind out of the achievement, but it was still a great learning experience.

azhenley · on Aug 17, 2021

This is pretty similar to my quest to make my own cos() when my friend didn't have access to libc. It was fun! Though I don't have the math or low-level knowledge that this author does.

https://web.eecs.utk.edu/~azh/blog/cosine.html

spaetzleesser · on Aug 17, 2021

I am envious of people who can deal with such problems. The problem is defined clearly and can be measured easily.

This is so much more fun than figuring out why some SAAS service is misbehaving.

cogman10 · on Aug 17, 2021

I'm actually a bit surprised that the x86 SIMD instructions don't support trig functions.

h0mie · on Aug 17, 2021

Love posts like this!

unemphysbro · on Aug 17, 2021

coolest blog post I've seen here in a while. :)