The slow state is when rcx has a special "hidden immediate" attached it, from the `mov rcx, 1` or `add rcx, 1`. If the last option on the register doesn't fit that narrow template, it is fast, so loading from memory in any way, like pop, xrestor, plan load, etc, will be in fast mode.
I wonder whether IRET fixes it. Can you try MOV RCX, 1; push an IRET frame; IRET; SHLX to see if IRET fixes it?