How about trying "xor ecx, ecx; inc ecx"? Or the even shorter "mov cl, 1"? *It i...

tavianator · on Jan 3, 2025

> How about trying "xor ecx, ecx; inc ecx"?

Fast. But inc rcx is slow.

> Or the even shorter "mov cl, 1"?

Fast.

> I suspect this is a width restriction in the bypass/forwarding network...

I think we just found the explanation on Twitter: https://x.com/corsix/status/1874965887108976858

Alder Lake adds support for mov/add/sub with small immediates to the register renamer. So `add rcx, 1` gets handled by the renamer, potentially with zero latency.

Unfortunately shifts are slow when the shift count is renamed in this way. This leads to fun things like `mov rcx, 1024` being fast while `mov rcx, 1023` is slow. I'll update the blog post in a bit.

phire · on Jan 3, 2025

Ok... That does make some sense.

Essentially, Intel have allocated a bunch of fake renaming registers to hold all small immediates. Like how MIPS had a zero register, Intel now have over 1000 constant registers. These registers don't really exist, the immediate is just reconstituted at the execution unit.

And I'm guessing these small immediates predate Golden Cove; It's probably how they have always represented shift by constant and any instructions with 8bit immediate (presumably at least as far back as Sandybridge). The only change is that now the renamer can now execute simple movs/adds.

So essentially these variable shifts are being converted to constant shifts on the fly. The problem is that there is no constant shift version of shlx, so it wasn't in the test suite. There is probably some kind of stupid implementation bug in the scheduler that simply wasn't exposed until the renamer changes.

tavianator · on Jan 3, 2025

I don't think there are a thousand constant registers. I think the renamer just represents (reg, 10-bit offset) pairs rather than just registers.

Also the problem affects SHL as well as SHLX, I didn't realize until just now.

phire · on Jan 3, 2025

The speculation of a "reg + 10-bit offset" representation feels wrong to me.

That requires a whole bunch of extra 64-bit full-adders everywhere one of these pairs might be consumed (so realistically, on every single read port of the register file). 64-bit adders take quite a bit of latency, so you don't want extra adders on all your critical paths.

In the case where it appears to be holding a reg + offset pair, what I think has actually happened is that renamer (and/or uop fusion) has rewritten the uop to a 3-input add, with the offset as the third input.

> Also the problem affects SHL as well as SHLX, I didn't realize until just now.

And presumably SHR/SHRX/SAR/SARX too?

dzaima · on Jan 3, 2025

You don't quite need a full 64-bit adder to materialize the proper value, as one argument is only a 10-bit int. So a 10-bit adder, a 54-bit increment, and muxing it with the original depending on the add carry.

And, presumably, the OP shift case here is in fact a case of there not being a built-in immediate adder and thus a need for fixup uops being inserted to materialize it?

tavianator · on Jan 3, 2025

Right. Actually it turns out it's 11 bits, since [-1024, 1023] are all supported by the immediate add renamer.

In general I think people are overstating the delay of an additional 64-bit add on register file reads (though I'm not a hardware guy so maybe someone can correct me). There are carry-lookahead adders with log_2(n) == 6 gate delays. Carry-save adders might also be relevant to how they can do multiple dependent adds with 0c latency.

> And, presumably, the OP shift case here is in fact a case of there not being a built-in immediate adder and thus a need for fixup uops being inserted to materialize it?

No, the perf counters show 1 uop dispatched/retired in both the slow and fast cases.

dzaima · on Jan 3, 2025

Ah, good to know on the uop count. Still could be (or, well, has to be to some extent) the same concept, just pipelined within one uop.

eigenform · on Jan 3, 2025

I dunno, you could imagine it happens [speculatively?] in parallel, at the cost of a read port and adder for each op that can be renamed in a single cycle:

1. Start PRF reads at dispatch/rename

2. In the next cycle, you have the result and compute the 64-bit result. At the same time, the scheduler is sending other operands [not subject to the optimization] to read ports.

3. In the next cycle, the results from both sets of operands are available

phire · on Jan 3, 2025

Seems rather pointless to implement it like that. You would save space in the scheduler because the uops have been fused, but execution time and execution unit usage is the same as just executing the original ops before this optimisation was implemented.

It also adds an extra cycle of scheduling latency, which may or may not be an issue (I really have no idea how far ahead these schedulers can schedule).

If you think about the "1024 constant registers" approach: It allows you to have a small 10bit adder in the renamer which handles long chains of mov/add/sub/inc/dec ops, as long as the chain stays in the range of 1024. This frees the main adders for bigger sums, or maybe you can power gate them.

And in the case where one of the inputs is larger than 1024, or it's value is unknown during renaming, the renamer can still merge two two-input adds into a single three input add uop.

Bulat_Ziganshin · on Jan 3, 2025

add3 operation will have 3 inputs, though. do we have other integer operations with 3 64-bit inputs?

BeeOnRope · on Jan 3, 2025

Yeah I am pretty sure that the renamer adds the tracked immediate(s) to the emitted uop, but that's not inconsistent with tracking "reg offset pairs" is it?

userbinator · on Jan 3, 2025

So `add rcx, 1` gets handled by the renamer, potentially with zero latency.

Unfortunately shifts are slow when the shift count is renamed in this way.

That's because the value still has to make it to the shifter... and the renamer's adder output clearly takes a different and higher-latency path than the one which doesn't go through it, confirming my original suspicion that this is a limitation of the bypass/forwarding network.

Ultimately it seems they are just playing around with where things happen; the addition can be done here but the value ultimately needs to go there for the shift, or the addition and shift can both happen there. One could envision a future change that puts the shifter in the renamer too. They're really constrained by the laws of physics.

Bulat_Ziganshin · on Jan 3, 2025

we set CX only once and then use it 10000 times. the problem is not the slow calculation of CX per se, but the slow shift once we got CX from the renamer

eigenform · on Jan 3, 2025

Oooh, I forgot about this. Only thing I can think of is something like:

1. Assume that some values can be treated internally as "physical register plus a small immediate"

2. Assume that, in some cases, a physical register is known to be zero and the value can be represented as "zero plus a small immediate" (without reference to a physical register?)

3. Since 'count' is always expected to be <64, you technically don't need to use a physical register for it (since the immediate bits can always just be carried around with the name).

4. The 3-cycle case occurs when the physical register read cannot be optimized away??

(Or maybe it's just that the whole shift operation can occur at rename in the fast case??)

phire · on Jan 3, 2025

> 4. The 3-cycle case occurs when the physical register read cannot be optimized away??

No... the 3-cycle case seems to be when the physical register read is optimized away. I think it's some kind of stupid implementation bug.

eigenform · on Jan 3, 2025

Like, the scheduler is just waiting unnecessarily for both operands, despite the fact that 'count' has already been resolved at rename?

phire · on Jan 3, 2025

The 3 cycles latency casts massive suspicion on the bypass network. But I don't see how the bypass network could be bugged without causing the incorrect result. So the scheduler doesn't know how to bypass this "shift with small immediate" micro op.

Or maybe the bypass network is bugged, and what we are seeing is a chicken bit set by the microcode that disables the bypass network for this one micro op that does shift.

eigenform · on Jan 3, 2025

> But I don't see how the bypass network could be bugged without causing the incorrect result.

Maybe if they really rely on this kind of forwarding in many cases, it's not unreasonable to expect that latency can be generated by having to recover from "incorrect PRF read" (like I imagine there's also a case for recovery from "incorrect forwarding")

phire · on Jan 3, 2025

Yeah, "incorrect PRF read" is something that might exist.

I know modern CPUs will sometimes schedule uops that consume the result of load instruction, with the assumption the load will hit L1 cache. If the load actually missed L1, it's not going to find out until that uop tries to read the value coming in from L1 over the bypass network. So that uop needs to be aborted and rescheduled later. And I assume this is generic enough to catch any "incorrect forwarding", because there are other variable length instructions (like division) that would benefit from this optimistic scheduling.

But my gut is to only have these checks on the bypass network, and only ever schedule PRF reads after you know the correct value has been stored.

Bulat_Ziganshin · on Jan 3, 2025

maybe, the bypass network doesn't include these "constant registers"? a bit like zen5 where some 1-cycle SIMD ops are executed in 2 cycles, probably for shortcomings of the same network

tptacek · on Jan 3, 2025

I got nothing here other than: this is very cool.

monocasa · on Jan 3, 2025

I would have thought that the movs would have been dumped right into a rob entry without even going through the bypass network, much like a zeroing idiom does.