Because of the disregard for the literature common in CS I loved this part:
> This is achieved through ...and some cool documents from the 50s.
A bit of anecdote: back when I was a research scientist (corporate lab) 30+ years ago I would in fact go downstairs to the library and read — I was still a kid with a lot to learn (and still am). When I came across (by chance or by someone’s suggestion) something useful to my work I’d photocopy the article and and try it out. I’d put a comment in the code with the reference.
My colleagues in my group would give me (undeserved) credit for my supposed brilliance even though I said in the code where the idea came from and would determinedly point to the very paper on my desk. This attitude seemed bizarre as the group itself was producing conference papers and even books.
(Obviously this was not a universal phenomenon as there were other people in the lab overall, and friends on the net, suggesting papers to read. But I’ve seen it a lot, from back then up to today)
It is very unfortunate that in the world of ubiquitous information, ability to just look up a reasonable solution to a well known and well studied problem that has already been solved long ago is practically a superpower.
As a developer, I regularly am in a situation where I solved some kind of large problem people had for a long time with significant impact for the company and this mostly with couple google searches.
Once I solved a batch task that took 11 hours to complete and reduced it to about 5s of work by looking up a paper on bitemporal indexes. The reactions ranged from "You are awesome!" to "You red a... paper????" to "No, we can't have this, this is super advanced code and it surely must have bugs. Can't you find an open source library that implements this?"
> It is very unfortunate that in the world of ubiquitous information, ability to just look up a reasonable solution to a well known and well studied problem that has already been solved long ago is practically a superpower
Its exactly because information is so ubiquitous and plentiful that finding a reasonable solution is like looking for a needle in a haystack.
In most cases in my experience, the people did not face challenge actually grepping the internet for the solution.
More likely they just went to roll out their own solution for any number of other reasons:
-- For the fun of intellectual challenge (I am guilty of this),
-- For having hubris to think they are first to meet that challenge (I have been guilty of this in the past, I no longer make that mistake),
-- For having hubris to think they have uniquely best solution to the problem, highly unlikely it is even worth verifying,
-- For being superior to everybody else and hence no real reason to dignify other, lesser people, by being interested in their work,
-- For feeling there is just too much pressure and no time to do it right or even to take a peek at what is right and how much time it would take,
-- For not having enough experience to suspect there might be a solution to this problem,
-- Books? Only old people read books,
-- Papers? They are for research scientists. I haven't finished CS to read papers at work.
and so on.
There have really been very few concrete problems that I could not find some solution to on the Internet.
It is more difficult to find solution to less concrete problems but even with these kinds of problems if you have a habit of regularly seeking knowledge, reading various sources, etc. you will probably amass enough experience and knowledge to deal with most even complex problems on your own.
I regularly hear coders say they cannot understand other people's code so they _had_ to write it themselves. Most people can't read papers either it seems.
People lost ability to read anything longer than a tweet with focus and comprehension. Especially new developers.
I know this because whatever I put in third and further paragraphs of my tickets is not being red or understood.
I sometimes quiz guys on review meetings and I even sometimes put the most important information somewhere in the middle of the ticket. Always with the same result -- they are surprised.
Now, to read a paper is to try to imagine being the person who wrote it and appreciate not just written text but also a lot more things that are not necessarily spelled.
The same with code -- when I read somebodys code I want to recognize their style, how they approach various problems. Once you know a little bit about their vocabulary you can really speed up reading the rest because you can anticipate what is going to happen next.
One of the most valuable (but often under-appreciated) kinds of work people can do is to take hard-won knowledge that exists somewhere obscure and make it more accessible, e.g. by writing expository survey papers, adding material to Wikipedia, writing documentation, ...
If Edison had a needle to find in a haystack, he would proceed at once with the diligence of the bee to examine straw after straw until he found the object of his search. … I was a sorry witness of such doings, knowing that a little theory and calculation would have saved him ninety per cent of his labor.
Actually, it would save much more than 90 percent. More like 99.9%. Just take powerful magnet in your hand and start pushing through the haystack until you find it.
Which, of course, requires that you somehow know of magnets beforehand.
Which is exactly the point -- even if you are in business of finding needles in haystack it pays to have knowledge in other areas, and basically read books and other knowledge sources. You never know when it is going to pay off.
Yes, they got to keep my solution. I basically told the guy "You are free to propose your own".
As to open sourcing, not going to happen.
Companies treat the code as if it was tangible good. If you give it to somebody else you are loosing value.
Which is super strange, because code needs to be maintained and why not have other people do it for you if they find it valuable and worth maintaining?
I’ve seen an odd use of language, mostly from U.S. Americans, where “intelligent” = “smart” = “knowledgable”. And conversely, assuming that knowing many things is what makes you smart. I suspect this is what happened there; they thought you “brilliant” because you knew many things – in their minds, there might not be a difference between the two concepts.
I find that disregard for education to be tasteless, but perhaps warranted due to the modern prevalence of "bootcamps" and similar fast-but-shallow programmes.
Personally, I love when I get a chance to apply something from a published paper; I will leave a citation as a comment in the code and a brief explanation of how it works and why it was chosen. I have no regrets for having achieved a computing degree, so many years ago.
Not brilliant? Use of the written word to pass ideas down between generations is one of the most brilliant ideas in human history, and since nobody around you was doing it, the only explanation is that you independently invented the ethos of the scholar yourself.
I always found that the smell of old books filled me with the urge to defecate, to the extent that I had to deliberately go for a crap before diving into the stacks. Something of a disadvantage as a pre-internet researcher. I thought I was the only person in the world with this condition, but have subsequently found a couple of other similarly afflicted souls.
There is a semi-documented phenomenon among book store workers regarding just this: employees at retail stores like Barnes and Noble and Borders Books report famously bad cleanups in their bathrooms on a daily basis, reputedly due to this.
But, knowing where to look for, being able to suss out details from papers, having the almost adequate level of conversation to try and contact authors (if still alive...). The ability to put undecypherable formulas into code and to bear stackoverflow's requirements for a 'good' question. Not a set of skills that amounts to nothing. It's rare nowadays that I see any copy of a reference textbook on numerical recipes, although we do tons of signal processing and HPC stuff.
Recently I rediscovered the ability to compute logs with 64 bits integers with a better precision than double-precision. Man, jumping back into those depths after so many years... Not as easy as reading other people's code.
No one likes to do lit review, which is a shame. I get it - it's way less fun to find out someone already bested your cool idea than it is to write code. But it's definitely more productive to read the literature!
If someone wants a fast version of x ↦ tan(πx/2), let me recommend the approximation:
tanpi_2 = function tanpi_2(x) {
var y = (1 - x*x);
return x * (((-0.000221184 * y + 0.0024971104) * y - 0.02301937096) * y
+ 0.3182994604 + 1.2732402998 / y);
}
But even better is to avoid trigonometry and angle measures as much as possible. Almost everything can be done better (faster, with fewer numerical problems) with vector methods; if you want a 1-float representation of an angle, use the stereographic projection:
Generally you can just stick to storing 2-coordinate vectors and using vector operations.
The places where you might want to convert to a 1-number representation are when you have a lot of numbers you want to store or transmit somewhere. Using the stereographic projection (half-angle tangent) instead of angle measure works better with the floating point number format and uses only rational arithmetic instead of transcendental functions.
Around 1988, I added phase shift to my optical thin film design program written in Excel 4.0 for the Mac. At the time, this was utterly unique: each spreadsheet row represented a layer and the matrices describing each layer could be calculated right in that row by squashing them down horizontally. The S- and P-polarization matrices could be recorded this way, and the running matrix products similarly maintained. Finally, using a simple one input table, the reflectance of a typically 25-31 layer laser mirror could be calculated. And in less than a second on a 20 MHz 68020 (?) Mac II for about 50 wavelengths. The best part were the graphics which were instantaneous, beautiful, publishable, and pasteable into customer quotations. Semi-technical people could be trained to use the whole thing.
Now about the phase shift. In 1988, atan2 didn’t exist. Anywhere. Not in FORTRAN, Basic, Excel, or a C library. I’m sure phase shift calculators implemented it, each working alone. For us, it was critical. You see we actually cared not about the phase shift, but its second derivative, the group delay dispersion. This was the beginning of the femtosecond laser era, and people needed to check whether these broadband laser pulses would be inadvertently stretched by reflection off or transmission through the mirror coating. So atan2, the QUADRANT-PRESERVING arc tangent, is required for a continuous,differential phase function. An Excel function macro did this, with IF statements correcting the quadrant. And the irony of all this?
3BSD was released at the end of 1979 according to wikipedia, and that's just the oldest BSD source I could find, so it may have been in even older versions.
Before ANSI/ISO C I don't think there was really a "math.h" to look for - as it may be different on different systems, but as so many things derive from BSD I wouldn't be surprised if it was "de-facto" standardised to what that supported.
Nice. Reminds me of an optimisation trick from a while ago: I remember being bottlenecked by one of these trigonometric functions years ago when working with a probabilistic data structure... then I figured the input domain was pretty small (a couple dozen values), so I precomputed those and used an array lookup instead. A huge win in terms of perf, obviously only applicable in these extreme cases.
To put some numbers on it, using N terms of the Taylor series for sin(x) [1] with |x| <= 0.1, the maximum error percentage cannot exceed [2]:
N Error Limit
1 0.167% (1/6%)
2 8.35 x 10^-5% (1/11980%)
3 1.99 x 10^-8% (1/50316042%)
4 2.76 x 10^-12% (1/362275502328%)
Even for |x| as large as 1 the sin(x) = x approximation is within 20%, which is not too bad when you are just doing a rough ballpark calculation, and a two term approximation gets you under 1%. Three terms is under 0.024% (1/43%).
For |x| up to Pi/2 (90°) the one term approximation falls apart. The guarantee from the Taylor series is within 65% (in reality it does better, but is still off by 36%). Two terms is guaranteed to be within 8%, three within 0.5%, and four within 0.02%.
Here's a quick and dirty little Python program to compute a table of error bounds for a given x [3].
[1] x - x^3/3! + x^5/5! - x^7/7! + ...
[2] In reality the error will usually be quite a bit below this upper limit. The Taylor series for a given x is a convergent alternating series. That is, it is of the form A0 - A1 + A2 - A3 + ... where all the A's have the same sign, |Ak| is a decreasing sequence past some particular k, and |Ak| has a limit of 0 as k goes to infinity. Such a series converges to some value, and if you approximate that by just taking the first N terms the error cannot exceed the first omitted term as long as N is large enough to take you to the point where the sequence from there on is decreasing. This is the upper bound I'm using above.
The sin(x) = x approximation is actually exact (in terms of doubles) for |x| < 2^-26 = 1.4e-8. See also [1]. This happens to cover 48.6% of all doubles.
Similarly, cos(x) = 1 for |x| < 2^-27 = 7.45e-9 (see [2]).
Finally, sin(double(pi)) tells you the error in double(pi) - that is, how far the double representation of pi is away from the "real", mathematical pi [3].
That is precisely the technique discussed in the article: it's the first term of the Taylor expansion. Except that the article used more terms of the expansion, and also used very slightly "wrong" coefficients to improve the overall accuracy within the small region.
That's what I assumed would have been a reasonable optimization!
What I really found amazing was that rather than reducing the number of multiplications to 2 (to calculate x^3), you can reduce it to 0, and it would still do surprisingly well.
Tangential at best, but why was the 'r' dropped from that term? Or why not call it caching? Why the weird "memo-ization"? It makes me think of a mass extinction event where everything is turned into a memo.
It's explained right in the linked Wikipedia page:
> The term "memoization" was coined by Donald Michie in 1968[3] and is derived from the Latin word "memorandum" ("to be remembered"), usually truncated as "memo" in American English, and thus carries the meaning of "turning [the results of] a function into something to be remembered". While "memoization" might be confused with "memorization" (because they are etymological cognates), "memoization" has a specialized meaning in computing.
The term memoization likely precedes the word caching (as related to computing, obviously weapon caches are far older). Memoization was coined in 1968. CPU caches only came about in the 80s as registers became significantly faster than main memory.
As wikipedia outlines, the r was dropped because of the memo. It's derived from the latin word memorandum that does contain the r, just like memory, but apparently it was more meant as an analogy to written memos.
The "small error" in this article isn't so small when people want exact results. Unfortunately, a high-degree polynomial is necessary if you want 24 bit precision across the input range for functions like this.
That said, I wonder if a rational approximation might not be so bad on modern machines...
Indeed. I love Padé approximants -- they're very underused. Another really fun trick for approximations of expensive functions are turning them into Chebsyshev functions, like Nick Trefethen's chebfun package. You can trivially differentiate, integrate and compute chebfun approximations of really horrible special functions (e.g. DE solutions) with fantastic accuracy. Criminally underused.
No, I haven't -- thank you very much for sharing. I've only ever used them, e.g. for spectral decomposition in magnetic resonance in a medical setting (1) with the fast Padé transformation, which has a wonderfully deep relationship with the DEs describing magnetic resonance (2).
Thanks also for the lecture - I'll enjoy. Nick Trefethen is such a good speaker. He taught me in a graduate course in numerical methods -- honestly, I'd watch those lectures on a loop again.
24 bits precision would be possible with the Hastings approximation with 15 terms instead of 6 terms. A naive estimate would be that this requires 2.5 times as much work and could run at 6GB/s and take a little less than 5 cycles per element.
I think that remains to be seen. Exact calculations will give you 24 bits, but in computers, floating point calculations typically aren’t exact.
If, as this article does, you compute all your intermediate values as 32-bit IEEE floats, it would hugely surprise me if you still got 24 bits of precision after 1 division (to compute y/x or x/y), 15 multiplications and 14 additions (to evaluate the polynomial) and, sometimes, a subtraction from π/2.
(That’s a weakness in this article. They point out how fast their code is, but do not show any test that it actually computes approximations to atan2, let alone about how accurate their results are)
Getting all the bits right in floating point is expensive.
The good news is that the high order terms of the polynomial don't notably affect the error. with fma, horner's method converges to .5 ULP accuracy. The subtraction with pi/2 can also be done without adding inaccuracy by using compensated summation.
> I wonder if a rational approximation might not be so bad on modern machines
Uops.info says the throughput of 32-byte FP32 vector division is 5 cycles on Intel, 3.5 cycles on AMD.
The throughput of 32-byte vector multiplication is 0.5 cycles. The same is for FMA, which is a single instruction computing a*b+c for 3 float vectors.
If the approximation formula only has 1 division where both numerator and denominator are polynomials, and the degree of both polynomials is much lower than 17, the rational version might be not too bad compared to the 17-degree polynomial version.
A Pade approximation is generally like that. Especially when you are dealing with singularities, it should be a much smaller polynomial than 17 terms in the numerator and denominator. I think I'm going to have to start a blog, and this will be on the list.
I initially used Chebfun to approximate them, then a teammate implemented the same Caratheodory-Fejer method in C. We subsequently convert from Chebyshev basis to monomials.
> if we’re working with batches of points and willing to live with tiny errors, we can produce an atan2 approximation which is 50 times faster than the standard version provided by libc.
Which libc, though? I assume glibc, but it's frustrating when people talk about libc as though there were a single implementation. Each vendor supplies their own implementation, libc is just a common interface defined by the C library. There is no "standard version" provided by libc.
In particular, glibc's math functions are not especially fast--Intel's and Apple's math libraries are 4-5x faster for some functions[1], and often more accurate as well, for example (and both vendors provide vectorized implementations). Even within glibc versions, there have been enormous improvements over the last decade or so, and for some functions there are big performance differences depending on whether or not -fno-math-errno is specified. (I would also note that atan2 has a lot of edge cases, and more than half the work in a standards-compliant libc is in getting those edge cases with zeros and infinities right, which this implementation punts on. There's nothing wrong with that, but that's a bigger tradeoff for most users than the small loss of accuracy and important to note.)
So what are we actually comparing against here? Comparing against a clown-shoes baseline makes for eye-popping numbers, but it's not very meaningful.
None of this should really take away from the work presented, by the way--the techniques described here are very useful for people interested in this stuff.
[1] I don't know the current state of atan2f in glibc specifically; it's possible that it's been improved since last I looked at its performance. But the blog post cites "105.98 cycles / element", which would be glacially slow on any semi-recent hardware, which makes me think something is up here.
Note that the hardware is not particularly recent (Q3 2017), but we tend to rent servers which are not exactly on the bleeding edge, so that was my platform.
Thanks for the interesting writeup. I wonder if it is because glibc has a less-optimized atan2f (for floats). The double version seems quite involved in glibc anyway:
Without benchmarking, I would expect atan2f to be around 20-30 cycles per element or less with either Intel's or Apple's scalar math library, and proportionally faster for their vector libs.
Ah, regarding edge cases: the only edge cases I do not handle are infinities and the origin. In my case these are both uninteresting: we deal with points from panoramic images, so those both make no sense.
I did want to deal with 0s in either coordinates though, and to preserve NaNs.
You're right that atan2 has a lot of edge cases, which is why it made for an interesting subject. I thought it was neat that the edge cases I did care about fell out fairly naturally from the efficient bit-twiddling.
That said, the versions presented in the article post are _not_ conforming, as I remark repeatedly. The article is more about getting people curious about some topics that I think are neat, more than a guide on how to implement those standard functions correctly.
A comparison with one of the many SIMD-mathlibraries would have been fairer than with plain libm. Long time ago I wrote such a dual-platform library for the PS3 (cell-processor) and x86 architecture (outdated, but still available here [1]). Depending on how standard libm implements atan2f, a speedup of 3x to 15x is achieved, without sacrifying accuracy.
A math library implementor will generally be familiar with at least minimax and Chebyshev approximations, and generally Padé and Carathéodory-Fejér approximations as well. (Source: I implemented math libraries for a decade. We used all of these frequently.)
I have developed very fast, accurate, and vectorizable atan() and atan2() implementations, leveraging AVX/SSE capabilities.
You can find them here [warning: self-signed SSL-Cert].
Little side-note: algorithm as given is scalar; however, its branch-free, and defined entirely in the header file. So, compilers will typically be able to vectorize it, and thus achieve speed up directly based on the vector size. I see potential [but architecture-dependent] optimization using Estrin scheme for evaluating the polynomial.
Yes, aim was to be acurate down to 1 lsb while significantly faster. Feel free to drop terms from the polynomial if you can live with less accurate results!
The coefficients were generated by a package called Sollya, I've used it a few times to develop accurate chebyshev approximations for functions.
Please, Would you mind one of these days updating your blog post with the instructions you gave to sollya? I'm trying something stupid with log1p and can't get sollya to help, mostly because I'm not putting enough time to read all the docs...
CORDIC was great for devices that were too simple to include hardware multipliers.
CORDIC, in its basic form, produces just 1 bit of result per clock cycle. Only for very low precisions (i.e. with few bits for each result) you would have the chance to overlap enough parallel atan2 computations to achieve a throughput comparable to what you can do with polynomial approximations on modern CPUs, with pipelined multipliers.
CORDIC remains useful only for ASIC/FPGA implementations, in the cases when the area and the power consumption are much more important than the speed.
I would have wished to see the error analysis section expanded a bit, or maybe seeing some sort of tests to validate the max error. In particular if the mathematical approximation function arctan* has max error of 1/10000 degrees then I'd naively expect that the float-based implementation to have worse error. Furthermore it's not obvious if additional error could be introduced by the division
float atan_input = (swap ? x : y) / (swap ? y : x);
Minero's faster approximate log2, < 1.4% relative error for x in [1/100, 10]. Here's the simple non-sse version:
static inline float
fasterlog2 (float x)
{
union { float f; uint32_t i; } vx = { x };
float y = vx.i;
y *= 1.1920928955078125e-7f;
return y - 126.94269504f;
}
This fastapprox library also includes fast approximations of some other functions that show up in statistical / probabilistic calculations -- gamma, digamma, lambert w function. It is BSD licensed, originally lived in google code, copies of the library live on in github, e.g. https://github.com/etheory/fastapprox
It's also interesting to read through libm. E.g. compare Sun's ~1993 atan2 & atan:
(Undoubtedly) stupid question: would it be any faster to project (x, y) to the unit circle (x', y'), then compute acos(x') or asin(y'), and then correct the result based on the signs of x & y? When converting Cartesian coordinates to polar, the value of r=HYPOT(x, y) is needed anyway, so the projection to the unit circle would be a single division by r.
How do you handle arrays of values where the array lengths are not a multiple of 8 in this kind of vectorized code? Do you zero pad them before handling them to the vectorized function, or do you run a second loop element by element on the remaining elements after the main one? What happens if you try to do `_mm256_load_ps(&ys[i])` with < 8 elements remaining?
You could pad the input as you said, which avoids a "tail loop," but otherwise you usually do a partial load (load <8 elements into a vector) and store. Some instruction set extensions provide "masked load/store" instructions for this, but there are ways to do it without those too.
To your last question specifically, if you _mm256_load_ps(&ys[i]) and you're at the edge of a page, you'll get a segfault. Otherwise you'll get undefined values (which can be okay, you could ignore them instead of dealing with a partial load).
RISC-V has a solution to this which seems quite elegant - the vector length is set by the program, up to some maximum, and for the final block, it is simply whatever is left:
> Do you zero pad them before handling them to the vectorized function, or do you run a second loop element by element on the remaining elements after the main one?
Typically the latter. That's why I find SVEs so interesting. Should improve code density.
Typically you have have the vectorized loop followed by a non-vectorized loop handling the remaining items, or even just explicit if-then-else statements for each possible number of remaining items.
Nice writeup and interesting results. I hadn't seen the use of perf_event_open(2) before directly in code which looks cool.
The baseline is at a huge disadvantage here because the call to atan2 in the loop never gets inlined and the loop doesn't seem to get unrolled (which is surprising actually). Manually unrolling by 8 gives me an 8x speedup. Maybe I'm missing something with the `-static` link but unless they're using musl I didn't think -lm could get statically linked.
Could it be that calls to glibc never get inlined, since inlining is essentially static linking by another name? Or since they're in separate compilation units, without LTO you'd never perform the analysis to figure out if it's worth it. Would LTO inspect across a dynamic loading boundary? Just speculating, I really have no idea. Everything I work with is statically linked, so I've never really had to think about it.
You've made a silly mistake and just did 1/8th the work, so of course it was 8x speedup. I am still wondering how the numbers would look if you could inline atan2f from libc. Too bad that isn't easier
I did something similar for tanh once, though I found I could get to 1 ulp.
Part of the motivation was that I could get 10x faster than libc. However, I then tried on my FreeBSD and could only get 4x faster. After a lot of head scratching and puzzling it turned out there was a bug in the version of libc on my linux box that slowed things down. It kind of took the wind out of the achievement, but it was still a great learning experience.
This is pretty similar to my quest to make my own cos() when my friend didn't have access to libc. It was fun! Though I don't have the math or low-level knowledge that this author does.
> This is achieved through ...and some cool documents from the 50s.
A bit of anecdote: back when I was a research scientist (corporate lab) 30+ years ago I would in fact go downstairs to the library and read — I was still a kid with a lot to learn (and still am). When I came across (by chance or by someone’s suggestion) something useful to my work I’d photocopy the article and and try it out. I’d put a comment in the code with the reference.
My colleagues in my group would give me (undeserved) credit for my supposed brilliance even though I said in the code where the idea came from and would determinedly point to the very paper on my desk. This attitude seemed bizarre as the group itself was producing conference papers and even books.
(Obviously this was not a universal phenomenon as there were other people in the lab overall, and friends on the net, suggesting papers to read. But I’ve seen it a lot, from back then up to today)