AVX512/VBMI2: A Programmer’s Perspective

37ef_ced3 · on Aug 14, 2021

I work with GPUs professionally, but I've written a lot of AVX-512 code (e.g., https://NN-512.com) and AVX-512 is a big step forward from AVX2.

The regularity and simplicity and completeness of the instruction set is a big win.

The lane predication (masking) of every instruction is useful when the data doesn't fit the vectors (and in loop tails) but it has numerous other uses, too. For example, it makes blending (vector splicing) and partial operations easy.

The flexible permute instructions (two data vectors in, one control vector in, and one data vector out) are fast and enormously useful. Anyone who has puzzled with AVX2 will breathe a sigh of relief.

The register file is big enough (32 vectors, each 16 floats wide) that register starvation typically isn't an issue. For example, you don't worry about having registers for your permutation control vectors. And the predicate masks are in another set of registers, which again are plentiful (eight!).

The easing of alignment requirements (unaligned load/store instructions operating on aligned addresses are as fast as aligned loads/stores) is also a big win.

AVX-512 is a real pleasure to use.

dragontamer · on Aug 14, 2021

Avx512 probably needs bpermute for parity with GPU shared memory.

Arbitrary swizzles both in the forward permute (currently in AVX512) and backwards direction (GPU only) is as necessary and proper as gather + scatter. Any program that uses one is highly likely to use the other.

--------

Butterfly permutes should be especially accelerated, as that pattern continues to show up. It seems like arbitrary permutes / bpermutes are expensive to implement in hardware, but butterfly permutes are the fundamental building block.

Butterfly permutes are needed from FFT to scan/prefix sum operations. It's also fundamental (and simpler at the hardware level than arbitrary bpermute/permute)

AMD implements the butterfly permute in DPP instructions. NVidia provides a simple shfl.bfly PTX instruction.

Butterfly networks (and inverse butterfly) are how pdep and pext are implemented under the hood. http://palms.ee.princeton.edu/PALMSopen/hilewitz06FastBitCom...

-----------

IIRC,the butterfly / inverse butterfly network can implement any arbitrary permute in just log2(n) steps.

psykotic · on Aug 15, 2021

I think you need 2 log2(n) - 1 butterfly stages to realize an arbitrary permutation.

wyldfire · on Aug 14, 2021

I work on some tooling for Hexagon DSPs and I wonder how the HVX instructions compare against AVX-512 (for integers at least). Different targets / use cases but I'm curious how it stacks up.

NohatCoder · on Aug 14, 2021

Question is, how many cores did we have to sacrifice to get AVX512? Intel moved from shipping 10 core CPUs to 8 core CPUs with AVX512. Clock speed may not go down much, but the price of that is a very high power consumption on AVX512 workloads.

As best I can tell, it would be reasonable to expect around 12 AVX2 cores operating on the same power budget as the current 8 AVX512 cores. Does the average user have enough AVX512 workloads to make that tradeoff worth it?

zigzag312 · on Aug 14, 2021

For SIMD workload performance increase of AVX512 over AVX2 is 2x. How many cores would you need to add to double the performance of 8-core CPU?

Any media processing can generally gain significant performance with SIMD instructions. Even web browsing is a workload that is affected, as (among other things) libjpeg-turbo uses SIMD instructions to accelerate decoding. It doesn't yet use AVX512, but it does use AVX2. If there are gains with AVX2, why wouldn't there be with AVX512.

Even if AVX512 would remain 256 bits, it would be an improvement over AVX2 as new instructions enable acceleration of workload that were not possible before, plus there is an increase in productivity/ease of use.

However, AVX512 won't be used much for consumer applications until there is big enough market penetration of CPUs that support these instructions, as it has been the case with any new SIMD instructions when they were first introduced.

dragontamer · on Aug 14, 2021

> Any media processing can generally gain significant performance with SIMD instructions

To a limit. JPEG (and many video codecs based on JPEG) have 8x8 macroblocks, which means the "easiest" SIMD-parallel is 64-way. And AVX512 taken 8-bits at a time is in fact, 64-way SIMD.

To get further parallel processing after that, you'll probably have to change the format. GPUs go up to 1024-way NVidia blocks (or AMD Thread groups), which are basically SIMD-units ganged together so that thread-barrier instructions can keep them in sync better. 1024-work items corresponds to a 32x32 pixel working area.

But that's no longer the format of JPEG. It'd have to be some future codec. Maybe modern codecs are seeing the writing on the wall and are increasing macroblock size for better parallel processing 10 years into the future (they are a surprisingly forward looking group in general).

janwas · on Aug 14, 2021

> Maybe modern codecs are seeing the writing on the wall and are increasing macroblock size for better parallel processing 10 years into the future

We did indeed do this for JPEG XL - the future is now :) 256x256 pixel groups are independently decodable (multi-core), each with >= 64-item (float) SIMD.

cma · on Aug 14, 2021

AVX-512 lines up with 64-byte cache lines, it seems like it would be a huge change to go bigger.

dragontamer · on Aug 14, 2021

NVidia GPUs are 32 wide warps, AMD CDNA are 64 wide. That's 1024 bit and 2048 bit respectively.

Cache lines are probably 64 wide for the purpose of burst length 8 (64 bit burst length 8 is 64 bytes / 512 bits).

brigade · on Aug 14, 2021

Amdahl's law. JPEG decoding especially - the majority of the decode time these days is entropy decoding that cannot be parallelized by either SIMD or threads.

Tostino · on Aug 14, 2021

I'm confused by your argument that this is similar to prior rollouts of new instruction sets. In the past when Intel pushed into instruction set, it generally went to their whole line of chips very quickly. That's not what I've seen with AVX at all.

zigzag312 · on Aug 14, 2021

I was referring to software adoption. Software adoption cycle is similar, as wider software adoption happens only when there is enough market penetration of CPUs that support new instructions.

There was AMD's 3DNow! that saw limited software adoption, because Intel didn't support it. Newer instruction sets are getting adopted progressively slower as consumers are replacing computer less often, AMD is slower at adopting each new AVX instruction set and Intel is getting more aggressive with market segmentation. Because market penetration of new instruction sets is getting slower, SW adoption is also much slower.

janwas · on Aug 14, 2021

Totally agree software uptake is the pain point.

Not even all gamer CPUs have SSE4 (https://store.steampowered.com/hwsurvey), so it seems that runtime dispatch is unavoidable.

Given that, if we can afford to generate code for new instruction sets and bundle it all into one slightly larger binary/library, the problem is solved, right?

Highway makes it much easier to do that - no need to rewrite code for each new instruction set. As to Intel's market segmentation, we target 'clusters' of features, e.g. Haswell-like (AVX2, BMI2); Skylake (AVX-512 F/BW/DQ/VL), and Icelake (VNNI, VBMI2, VAES etc) instead of all the possible combinations.

adwn · on Aug 14, 2021

> For SIMD workload performance increase of AVX512 over AVX2 is 2x

It's less than 2x, because the core downclocking for AVX-512 on older CPUs is higher than for AVX2: 60% vs 85% on Skylake, so only a ~1.4x speedup. Newer CPU architectures do not downclock, though.

account42 · on Aug 19, 2021

> Newer CPU architectures do not downclock, though.

Newer CPU architectures may not have enforced downclocks, but they will eventually respond to the higher thermal output.

zigzag312 · on Aug 14, 2021

To be fair, multicore workload also causes CPU to operate at lower frequency than at single core workload. In the end, multi-threading and SIMD are both types of parallel processing and each has pros and cons.

brigade · on Aug 14, 2021

Sunny/Cypress Cove had lots of other area growth than just AVX-512 [1]. The better comparison is Skylake vs Skylake-SP, from which AVX512 is estimated to cost about 5% of the tile area [2]. So that’s the area cost of about 2 cores in the 28 core chip.

[1] https://en.wikichip.org/wiki/intel/microarchitectures/sunny_...

[2] https://www.realworldtech.com/forum/?threadid=193291&curpost...

wtallis · on Aug 14, 2021

> The better comparison is Skylake vs Skylake-SP, from which AVX512 is estimated to cost about 5% of the tile area

That analysis might be slightly underestimating the area penalty of AVX512, because the consumer Skylake cores that didn't have AVX512 execution units still reserved space for the AVX512 register file. (And given that fact, it's all the more surprising that while Intel was repeatedly refreshing 14nm Skylake for the consumer market, they never added the rest of the AVX512 bits or redid the layout of the consumer cores to reclaim the blank space of the register file.)

brigade · on Aug 14, 2021

Unlike the ALUs, it's trivial to locate and determine the area of a 512 bit x 168 register file. No way would Kanter have missed that in his analysis.

wtallis · on Aug 14, 2021

It's not really a matter of whether Kanter missed those spots, but of whether subtracting them from the total core area is relevant to the question he was trying to answer. (Also, both the ALUs and the register files are trivial to locate by comparing die photos of consumer and server versions of the Skylake core.)

If you're going to hypothesize about a rearranged Skylake core that no longer reserves space for the full 512 bit vectors in the register file, then you probably should also try to estimate the other area savings that would result from truly removing AVX512 from the core design at a high level, rather than merely masking off a few regions of silicon that serve no purpose other than AVX512 support. And answering that question is a lot more subjective than simply tallying up the area occupied by the extra execution units hanging off the side of the AVX512-enabled server Skylake cores.

Or, to put it another way: there unquestionably is an area cost to AVX512 beyond that of the extra execution units tacked on to SKL-X cores. But that cost was already being paid by consumer CPUs years before SKL-X/SKL-SP shipped. So it wasn't really a contributing factor to the loss of two cores when Skylaked-derived Comet Lake was succeeded by Rocket Lake.

acomjean · on Aug 14, 2021

I often wonder how much performance is being left behind from these extensions not being used explicitly (are compilers using these extensions automatically?). SSE, AVX.. Making it simplier would seem to be a huge win as people start to use this clearly improves performance significantly.

Its seems most new CPUs support a fairly large subset of the older ones.

I haven't dont low lever work of a while but remember doing some analysis of a signal processing code going from PA-RISC to Intel. The math compiler libraries for PA-RISC made the initial code transition so much slower on Intel. Using the Intel compiler and tweaking things started making things work so much better.

arthur2e5 · on Aug 14, 2021

Compilers are using them,^1 but aren’t necessarily using them enough. They (generally) can’t change alignments of a struct willy-nilly, nor can they just, say, split a big FP reduction into many parallel ones without violating 754. In most loop-related cases however, I would say a nudge in the form of #pragma omp simd goes a long way.^2

And then there’s the genius stuff like SIMD UTF8 decode and whatnot. Compilers don’t magically figure those out. Heck, they sometimes have trouble with shuffles, which is why the author mentions that the new features will help.

^1 Oh, not GCC on -O2 IIRC. There were some talks about turning autovec on that level on like clang does, but I’m not sure it went anywhere… I also think MSVC doesn’t do that.

^2 This pragma also turns on autovec even if the compiler is not told to do that with the flags. Although my original point was about the extra information you can provide with it.

NohatCoder · on Aug 14, 2021

In general, almost all of it. If you haven't laid out your data to make it accessible to SIMD instructions, at best the program will get a modest speedup by using most of the instructions rearranging data to be SIMD-compatible.

By far the simplest way of making the correct data-layout is to use the intrinsics manually, so that you discover any issues that arise.

A step further is to change the program logic to better take advantage of SIMD-instructions, for instance Fabian Giesen has written a great deal on making compression code that splices the jobs in creative ways in order to utilize SIMD-instructions.

colejohnson66 · on Aug 14, 2021

The big problem with Intel’s compiler is that their dispatch function requires the CPU identify itself as “GenuineIntel” (for the vectorization advantage), which AMD processors don’t do. https://www.agner.org/optimize/blog/read.php?i=49

acomjean · on Aug 14, 2021

Interesting. and kinda sad. and based on the comments an ongoing problem.

One hopes that ARM's ascendance would make "team x86-64" work in cooperative competition for better performance through compilers (it they can't through silicon).

arthur2e5 · on Aug 14, 2021

AMD and recently Intel both moving to LLVM is a good step in that direction IMO. LLVM also already comes with a dispatcher, so hopefully they aren’t going to… do it again. Personally I’m more excited about being able to port __attribute__ rich code more easily.

(IIRC Intel has tuned down the Cripple AMD thing in MKL around late 2020 by providing a specialized Zen code path. It was slower than what you get with the detection backdoor, but only slightly.)

jeffbee · on Aug 14, 2021

It's all application-specific. It's not even true that compiling with -march=${cpu} gives you the best performance on that CPU. As an example it was once discovered that for a certain large program -march=haswell was counterproductive on Haswell, and choosing a subset of those features was faster in practice. The same turned out to be true for Skylake. And BMI2 did not work right on AMD processors until very recently, so the fact that the PDEP and PEXT existed was not helpful. Your program would run, but the instructions had high and data-dependent latency.

In short, you need to measure the combination of compiler options that get you the best performance on your real platform. Most people can probably pick up a quick 10-20% win by recompiling their MySQL or whatever for arch=sandybridge, instead of k8-generic, but beyond that it gets trickier.

dataflow · on Aug 14, 2021

> It's not even true that compiling with -march=${cpu} gives you the best performance on that CPU

Are you thinking of -mtune? I'd never heard of people using -march for performance, I thought it was just specifying the ISA.

jeffbee · on Aug 14, 2021

-march tells the compiler which instruction set extensions it is allowed to use. For e.g. the difference between arch=haswell and arch=sandybridge is AVX2, MOVBE, FMA, BMI, and BMI2. If you build with haswell, or any given CPU, the compiler might emit instructions that's won't run on older hardware. That's why most Linux distributions for PC are compiled for k8-generic, an x86-64 platform with few extensions (MMX, SSE, and SSE2, if I recall correctly).

In the particular case as I recall it was AVX2 that was counterproductive on haswell. Disabling it made the program faster, even though AVX2 was supposed to be a headline feature of that generation.

beached_whale · on Aug 14, 2021

I was recently wondering how much CPU time would be regained if we got an instruction to do x * 10 ^ y where x is an int64/y an int (or something like that) and do so without error. This is a really really common operation and needs a bigint library to do properly now for many cases. It’s also slow, I think aggregate it’s around 500MB/s to 1GB/s, but the slow path is < 100MB/s.

But pretty much every JSON/XML/Number.from_string is using an operation like this, to the scale is quite large.

stephencanon · on Aug 14, 2021

Converting a number from string can be done without any bignum operations at all, this has been well-understood for a while now. This is mostly just widely-used libraries having not caught up with the state of the art.

There is some opportunity for ISA support to speed it up, but multiplication by powers of ten is not the bottleneck, that’s just a table lookup and full-width product in high-performance implementations (i.e. a load and a mul instruction).

beached_whale · on Aug 14, 2021

Got links? Even Daniel Lemire's Fast Double parser will dump to strtod when the numbers are not handleable because of this.

You cannot represent 10^x, or even all 5^x beyond a certain point in IEEE754 double's but need to do the full y * 10^x operation without loss.

Doing it lossy is easy(0 to 1 or 2 ulp off optimal)

stephencanon · on Aug 14, 2021

You don’t need to do it without loss, you need to do it with loss smaller than the closest that the digit sequence could be to an exact halfway point between two representable double-precision values.

This means you may need arbitrary-precision for arbitrary-precision decimal strings, but in practice these libraries are “always” converting strings that came from formatting exact doubles (eg round-tripping through JSON), and so have bounded precision. Thus you can tightly bound the precision required.

This precisely mirrors how formatting fp numbers used to require bignum arithmetic, but all the recent algorithms (Ryu, Dragonbox, SwiftDtoa) do it with fixed int128 arithmetic and deliver always correct results. We’ll see analogous algorithms for the other direction adopted in the next few years—the only reason this direction came later is that the other was a bigger performance problem originally.

beached_whale · on Aug 14, 2021

I use dragonbox and it makes the serialization task quite easy and fast. It would be great for a mirror library that did so for deserialization of arbitrary decimal numbers. The current libraries are not fast, and it isn't the parsing, it's the math they do with those characters.

jeffbee · on Aug 14, 2021

Adding hardware to speed up decimal number handling strikes me as an old-school approach, the type of thing you might expect from 1960s mainframes. Isn't the more significant opportunity at the application architecture level instead? Why exchange numbers between computers in decimal? Even for people who are just devoted to the idea of human readability, which is an idea in need of scrutiny in my opinion, you can still exchange real numbers in base 16 format, 0xh.hPd.

Someone · on Aug 14, 2021

I wouldn’t expect that from 1960’s mainframes. In cases where ingestion and outputting of decimal numbers could be a bottleneck, they would be programmed in COBOL, with a compiler using BCD (https://en.wikipedia.org/wiki/Binary-coded_decimal) to store numbers.

Also, x87 has/had the FBLD and FBSTP instructions that can be used to convert between floating point and packed decimal (https://en.wikipedia.org/wiki/Intel_BCD_opcode)

If these still are supported, I doubt Intel or AMD spend lots of effort making/keeping them fast, though.

beached_whale · on Aug 14, 2021

hex floats are great, but text formats using decimal floats are ubiquitous and never going away.

gpderetta · on Aug 14, 2021

Some compilers these days offer 128bit floats (which is different from 754 quad) which is basically twice the mantissa of a standard double. IIRC multiplications and additions are relatively cheap.

Const-me · on Aug 14, 2021

> A Programmer’s Perspective

I'm a programmer too but I don't share that perspective.

In my line of work, AVX512 is useless. This won't change until at least 25% market penetration on clients, which may or may not happen.

Not all programmers are writing server code. Also, very small count of them are Intel partners with early access to the chips.

> In AVX terminology, intra-lane movement is called a `shuffle` while inter-lane movement is called a `permute`

I don't believe that's true. _mm_shuffle_ps( x, x, imm ) and _mm_permute_ps( x, imm ) are 100% equivalent. Also, _mm256_permute_ps permutes within lanes, _mm256_permutevar8x32_ps permutes across lanes.

janwas · on Aug 14, 2021

> In my line of work, AVX512 is useless. This won't change until at least 25% market penetration on clients, which may or may not happen.

Why not use it where available? github.com/google/highway lets you write code once using 'platform-independent intrinsics', and generate AVX2/AVX-512/NEON/SVE into a single binary; dynamic dispatch runs the best available codepath for the current CPU.

Disclosure: I am the main author. Happy to discuss.

Const-me · on Aug 14, 2021

> Why not use it where available?

AFAIK things like that (highway, OpenMP 4.0+, automatic vectorizers) are only good for vertical operations.

When the only thing you’re doing is billions of vertical operations, computing that thing on CPU is often a poor choice because GPUs are way faster at that, and way more power efficient too.

Therefore, when I optimize CPU-running things, that code is usually not a good fit for GPUs. Sometimes the data size is too small (PCIe latency gonna eat all the profit from GPU), sometimes too many branches, but most typical reason is horizontal operations.

An example is multiplying 6x24 by 24x1 matrices. A good approach is a loop with 12 iterations, in the loop body 2x _mm256_broadcast_sd, then _mm256_blend_pd (these 3 instructions are making 3 vectors [v1,v1,v1,v1], [v1,v1,v2,v2], [v2,v2,v2,v2] ), then 3x _mm256_fmadd_pd with memory source operands. Then after the loop add 12 values into 6 with shuffles and vector adds.

janwas · on Aug 15, 2021

> things like that (highway, OpenMP 4.0+, automatic vectorizers) are only good for vertical operations.

I agree autovec and #pragma omp simd struggle with any kind of shuffles, but Highway does not belong in that category - it offers more control, similar to intrinsics.

There are about two dozen non-vertical ops (see https://github.com/google/highway/blob/master/g3doc/quick_re... and "blockwise" and "swizzle") and it sounds like you want Set and ConcatUpperLower.

In your experience, have you needed any non-vertical ops that are not in that list?

Const-me · on Aug 15, 2021

> have you needed any non-vertical ops that are not in that list?

Yes indeed. I rarely using SIMD for vertical-only ops, for such use cases GPUs are very often better than CPUs.

I’ve already wrote an example in my previous comment. It’s possible to emulate with the stuff you have, however _mm256_blend_pd is very fast instruction, a single cycle of latency. The highway’s emulation is going to be way more expensive. You probably compiling your UpperHalf() into _mm256_extractf128_pd and Combine() into _mm256_insertf128_pd, that’s 2 instructions and (on Skylake) 6 cycles of latency instead of 1 cycle.

6 cycles instead of 1 cycle is a large overhead in that context. That particular small matrix multiplication is called rather often. I only optimizing code when the profiler tells me so. For the majority of CPU bound code in that project, Eigen’s implementation is actually good enough.

I’ve searched the source code of that project (CAM/CAE software). Here’s the list of the shuffle intrinsics I use, some of them a lot: _mm256_blend_pd, _mm_blend_ps, _mm_blend_epi32, _mm256_permute2f128_pd, _mm256_permute_ps, _mm256_permute4x64_pd, _mm256_permutevar8x32_ps, _mm256_permutevar8x32_epi32, _mm_permute_ps, _mm_permute_pd, _mm_insert_ps, _mm_movehdup_ps, _mm_moveldup_ps, _mm_loaddup_pd, _mm_extract_ps, _mm_dp_ps, _mm_extract_epi32, _mm_extract_epi64, _mm_shuffle_epi32.

A similar list for this project https://github.com/Const-me/Vrmac (a GPU-centric library for 3D and 2D graphics, not using any AVX): _mm_shuffle_epi8, _mm_alignr_epi8, _mm_shuffle_epi32, _mm_shuffle_ps, _mm_addsub_ps (that one is vertical but still missing from highway), _mm_insert_epi32, _mm_insert_ps, _mm_extract_ps, _mm_extract_epi16, _mm_movehdup_ps, _mm_dp_ps. BTW the project is portable between AMD64 and ARMv7, I have tons of #ifdef there to support NEON which differs substantially, there’s stuff like vrev64q_f32 and vextq_f32, 64-bit SIMD vectors, and quite a few other instructions missing on AMD64.

Even if you expose all the missing horizontal stuff in highway — won’t be much better than intrinsics. Such code ain’t gonna use AVX512 when available. Only going to inflate the software complexity for no good reason, by adding an unneeded layer of abstraction between the application’s code and the actual hardware.

janwas · on Aug 15, 2021

> _mm256_blend_pd is very fast instruction, a single cycle of latency. The highway’s emulation is going to be way more expensive.

Actually it uses exactly the same instruction :) https://github.com/google/highway/blob/master/hwy/ops/x86_25...

Thanks for sharing the list! Out of curiosity, can you expand on what _mm256_permute4x64_pd is used for?

> _mm_addsub_ps (that one is vertical but still missing from highway)

Indeed, we have not needed complex-valued functions yet. AFAIK NEON does not have such an instruction but it could be emulated with an extra XOR+constant. Will add to wishlist, we implement them whenever there's a use case.

> I have tons of #ifdef there

Seems reasonable if you only target v7 and ASIMD; but #ifdef is increasingly infeasible now that SVE and Risc-V V are coming, no?

> to support NEON which differs substantially, there’s stuff like vrev64q_f32 and vextq_f32

Those can actually be bridged just fine, they correspond to Shuffle2301 and CombineShiftRightLanes.

> Even if you expose all the missing horizontal stuff in highway — won’t be much better than intrinsics. Such code ain’t gonna use AVX512 when available.

Because of the hardcoded vector size? I agree that's best avoided, in your case perhaps by batching the matrix mul and applying SIMD over that dimension instead? Even if not, I'd rather have type-safety, friendlier function names and gaps in the ISA filled, but tastes differ.

Const-me · on Aug 15, 2021

> can you expand on what _mm256_permute4x64_pd is used for?

In one place I was doing single-pass marching cubes on multiple FP32 scalar fields. A source data for every cube is 8 FP32 number in 1 AVX register. Sometimes I need to broadcast FP32 lanes from the vector, where the source index is known at compile time. For most of these indices, a combo of _mm256_movehdup_ps/_mm256_moveldup_ps + _mm256_permute4x64_pd is the fastest way to do so.

In another place I needed to broadcast FP64 values from non-zero lane.

In another place I wanted to avoid MXCSR error flag for 3D FP64 vectors:

    // Duplicate Z into W to avoid division by zero
    size = _mm256_permute4x64_pd( size, _MM_SHUFFLE( 2, 2, 1, 0 ) );

In a few other places it’s just math e.g.

    const __m256d b_xxy = _mm256_permute4x64_pd( b, _MM_SHUFFLE( 3, 1, 0, 0 ) );
    const __m256d bw_yzz = _mm256_permute4x64_pd( bWeight, _MM_SHUFFLE( 3, 2, 2, 1 ) );
    __m256d tmp = _mm256_mul_pd( b_xxy, bw_yzz );

> we have not needed complex-valued functions yet

Wasn’t related to complex numbers, I was doing something like that, saving a few instructions: https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_lin...

> but #ifdef is increasingly infeasible now that SVE and Risc-V V are coming, no?

I think that’s wishful thinking like all these years of Linux on desktop, or rewriting everything in Rust. I think ARM is good enough for most applications, except very small niches (very low power, very price-sensitive, mostly embedded).

> Those can actually be bridged just fine

Yeah, but other things can’t. For one, ARM has 8-byte SIMD vectors. Quite useful for a library which implements non-trivial algorithms processing 2D vector data with tons of FP32 2D vectors. Another thing, this whole source file https://github.com/Const-me/DtsDecoder/blob/master/Utils/sto... implementing an equivalent of a single vst3q_s16 NEON instruction. That code is called in a loop which does not do much else and is inlined, i.e. these 9 shuffle constants stay in vector registers.

janwas · on Aug 17, 2021

Thanks for sharing the examples! Looks like broadcast and replicating one element, which works if we know there are only 4 elements.

> I was doing something like that, saving a few instructions

Oh, nice one.

> I think ARM is good enough for most applications, except very small niches (very low power, very price-sensitive, mostly embedded).

Are you talking about the ARM arch vs Risc-V? That may be, but ARM is also getting SVE, and adding yet another #if for that is going to be expensive.

> Yeah, but other things can’t. For one, ARM has 8-byte SIMD vectors.

hm, Highway does support half vectors which map nicely to the non-q instructions. We also support 8-bit StoreInterleaved3 which is implemented similar to yours (more instructions but fewer constants): https://github.com/google/highway/blob/master/hwy/ops/x86_12...

In my experience, integer mul are the hardest to bridge, but at least for dot products even that is feasible.

nikita · on Aug 15, 2021

Are you looking for a job by any chance :)?

AdamProut · on Aug 14, 2021

If your building software as a service that changes this calculus a fair bit. You only need to wait for AWS/Azure/GCP/CSP of choice to support the hardware (you don't need to wait for mass market use of it). For example, many (all?) database as a service offerings now days run on very precisely configured hardware.

Const-me · on Aug 14, 2021

> many (all?) database as a service offerings now days run on very precisely configured hardware

I'm not sure AVX512 is a win even in that case. AMD Epyc peaks at 64 cores/socket, Intel Xeon at 40 cores/socket. It's not immediately obvious AMD's performance advantage over Intel is smaller than the Intel-only win from AVX512.

dragontamer · on Aug 14, 2021

Are sockets the best measurement? AMD has a pretty complex network inside that socket: 8 dies + a switch.

The Xeon 40 core is just one die, which means L3 cache and memory is more unified.

L3 cache is probably king (data may not fit, but maybe some indexes?), but it's not obvious if EPYCs separate L3 caches are comparable to Xeons (or Power)

Const-me · on Aug 14, 2021

AMD is much faster overall. This page has a benchmark of a MySQL database, the slide is called MariaDB: https://www.servethehome.com/amd-epyc-7763-review-top-for-th... The only benchmark where AMD didn't have substantial advantage is chess, I think the reason is AMD's slow pdep/pext instructions.

dragontamer · on Aug 14, 2021

The 6258R is just 28 cores though, 56 total after dual socket.

AMD Zen3 has single cycle pext / pdep now and the chess benchmarks are better as a result.

Const-me · on Aug 15, 2021

Good point. Indeed, for DB-only workloads the Xeon appears to be better: https://www.phoronix.com/scan.php?page=article&item=intel-xe...

Still, this doesn’t mean Epyc is only good for HPC, e.g. this is some enterprise Java-based benchmark where it’s substantially faster: https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-scal...

BTW, on my job I don’t do databases, but I do quite a lot of HPC stuff.

dragontamer · on Aug 15, 2021

Add to the fact that MariaDB doesn't have AVX512 instructions used yet, and you can see that Intel's unified L3 cache is just better for database-like workloads than AMD's split L3 cache.

x265 is more of an AVX512 test scenario, which the Xeon 40-core also is demonstrating proficiency at over and above the EPYC 64 core.

------------

I'm mostly impressed at how well the split "chiplet" strategy is doing in all the other benchmarks however. The AMD "I/O die" is clearly a winner. Its probably not a latency winner, but it is efficiently giving the memory bandwidth and distributing it to all of the dies.