Luckily BF16 is just a truncated FP32. That means that the hardware can do BF16,...

sillysaurusx · on July 3, 2023

At that point it’d be better to do everything in fp32. The hardware can’t do bf16 in the way you’re saying; the conversions would consume all your time.

BooneJS · on July 3, 2023

Compute in F32, but then round and pack a pair of BF16 into 4 bytes.

brrrrrm · on July 4, 2023

The conversions are just a mask and shift? Super cheap

stephencanon · on July 4, 2023

You still get a perf benefit from half the memory traffic and keeping twice as much data in caches, since you can do the expansion to f32 when loading into registers.

pklausler · on July 3, 2023

Conversions from IEEE-32 to BF16 don't round?

londons_explore · on July 3, 2023

I don't believe the standard defines it. I believe implementations truncate (ie. round towards zero).

Remember BF16 was invented specifically to be able to be backwards compatible with existing silicon - and pulling 2 bytes out of 4 is a far cheaper operation than any rounding.

kelnos · on July 3, 2023

Just to elaborate, as I was confused about this and had to look it up: BF16 is indeed designed to just be a truncated F32: you can grab the top 16 bits of a F32 value and it'll still "make sense": the sign bits are in the same place in both (unsurprisingly), and the exponent part of BF16 and F32 are both 8 bits. In the case of the mantissa, you end up grabbing the top 7 bits of the F32's 23-bit mantissa, so it all works out, as this will "round" the value toward zero.

pclmulqdq · on July 4, 2023

There's no standardized definition of BF16.