These SSE instructions that operate only on aligned data are a pain. It's not we...

kps · on Nov 7, 2016

  > It's not well known that Linux/x86 stack frames must always be 16 byte aligned.

Always wasn't always always; that sad story is the source of your OCaml problems, among many others. Linux on x86 originally used 4-byte alignment, and 4-byte alignment is what you see if you RTFM¹. Later, gcc decided that they were in control, and unilaterally switched to 16-byte alignment. Backwards compatibility? Screw you. Other tools? Screw you.²

¹ https://refspecs.linuxfoundation.org/

² https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38496

gpderetta · on Nov 7, 2016

The worst part is that today 16 bytes alignment is no longer necessary as x86 can do unaligned vector load with little to no penalty while keeping the stack aligned all the time still has a cost.

readittwice · on Nov 7, 2016

I had the same problem with my jit, which also generated stack frames not aligned to 16-byte. My test program crashed on an SSE instruction in the Rust standard library (I dont' recall if this bug only occured in release mode, may have been already compiled code). I was pretty proud when I fixed this. Although I have to admit that after finding out that the accessed address was actually valid, I was already supposing that alignment was a problem. Fixing it was then straightforward since it was my own toy compiler.

userbinator · on Nov 7, 2016

Agreed, I've always found them unusual and perhaps a bit of a shortsighted decision --- they've been making processors seamlessly handle any alignment with perhaps an extra cycle, even for the MMX instructions, yet somehow felt the need to restrict much of the SSE ones into aligned and only provide one unaligned move.

The stack alignment restriction is also annoying when handwriting Asm, although fortunately it's only when calling into other C libraries that it needs to be minded.

jjoonathan · on Nov 7, 2016

> seamlessly handle any alignment with perhaps an extra cycle

I'm not up to date on the latest mitigation strategies, but the hairball of cache implications caused by unaligned access make me suspicious of that claim. If you (or your compiler) signal that you want performance by using vector instructions, I think it's completely fair for Intel to demand that you pay attention to alignment.

qb45 · on Nov 8, 2016

Case in point: a simple example of code which copies 1MB of data with SSE and slows down on misalignment:

https://news.ycombinator.com/item?id=12718625

Presumably due to the hairball of cache implications, as you put it.

But it also is true that the choice of aligned/unaligned instructions makes no difference if the array is aligned.

pcwalton · on Nov 7, 2016

> yet somehow felt the need to restrict much of the SSE ones into aligned

Because it wasn't worth spending die space on that as opposed to other things that matter a lot more for performance, presumably.

yuhong · on Nov 7, 2016

When I wrote the code in https://bugzilla.mozilla.org/show_bug.cgi?id=1283585, I had to spend most of the time dealing with alignment.