These SSE instructions that operate only on aligned data are a pain. It's not well known that Linux/x86 stack frames must always be 16 byte aligned. GCC uses this knowledge to use the SSE aligned instructions when accessing certain fields on the stack.
Unfortunately a while back the OCaml compiler generated non-aligned stack frames. Which is no problem for pure OCaml code and even saves a little bit of memory. However if the code called out to C, then sometimes and unpredictably (think different call stacks, ASLR) the C code would crash. That was a horrible bug to track down:
> It's not well known that Linux/x86 stack frames must always be 16 byte aligned.
Always wasn't always always; that sad story is the source of your OCaml problems, among many others. Linux on x86 originally used 4-byte alignment, and 4-byte alignment is what you see if you RTFM¹. Later, gcc decided that they were in control, and unilaterally switched to 16-byte alignment. Backwards compatibility? Screw you. Other tools? Screw you.²
The worst part is that today 16 bytes alignment is no longer necessary as x86 can do unaligned vector load with little to no penalty while keeping the stack aligned all the time still has a cost.
I had the same problem with my jit, which also generated stack frames not aligned to 16-byte. My test program crashed on an SSE instruction in the Rust standard library (I dont' recall if this bug only occured in release mode, may have been already compiled code). I was pretty proud when I fixed this. Although I have to admit that after finding out that the accessed address was actually valid, I was already supposing that alignment was a problem. Fixing it was then straightforward since it was my own toy compiler.
Agreed, I've always found them unusual and perhaps a bit of a shortsighted decision --- they've been making processors seamlessly handle any alignment with perhaps an extra cycle, even for the MMX instructions, yet somehow felt the need to restrict much of the SSE ones into aligned and only provide one unaligned move.
The stack alignment restriction is also annoying when handwriting Asm, although fortunately it's only when calling into other C libraries that it needs to be minded.
> seamlessly handle any alignment with perhaps an extra cycle
I'm not up to date on the latest mitigation strategies, but the hairball of cache implications caused by unaligned access make me suspicious of that claim. If you (or your compiler) signal that you want performance by using vector instructions, I think it's completely fair for Intel to demand that you pay attention to alignment.
Unfortunately a while back the OCaml compiler generated non-aligned stack frames. Which is no problem for pure OCaml code and even saves a little bit of memory. However if the code called out to C, then sometimes and unpredictably (think different call stacks, ASLR) the C code would crash. That was a horrible bug to track down:
https://caml.inria.fr/mantis/view.php?id=5700#c10779