It's more like "I will assume UTF-8 and ignore edge case encoding problems which...

koito17 · on April 29, 2024

I mention Japanese because I deal with Japanese text daily. I could mention some Chinese documents and sites using GBK to save space (since such encodings use exactly 2 bytes per character whereas the average size in UTF-8 is strictly larger than 2 bytes). But I am not very familiar with it. Overall, I would not say these are "strange reasons".

samatman · on April 29, 2024

Other encodings exist, yes. But they can all be mapped to UTF-8 without loss of information[0]. If someone wants to save space, they should use compression, which will reduce any information, regardless of encoding, to approximately the same size. So it's perfectly reasonable to write software on the assumption that data encoded in some other fashion must be first reëncoded as UTF-8

[0]: Except Japanese, people hasten to inform us every time this comes up. Why? Why haven't your odd characters and icons been added to Unicode, when we have cuneiform? That's the strange part. I don't understand why it's the case.

Cloudef · on April 30, 2024

Unicode did kind of dumb thing with CJK, the unified Chinese and Japanese kanjis makes displaying the CJK text much harder problem than it should be, as it now relies also on a language specific font to be displayed correctly[0]. I guess this could be bandaided by some sort of language marker in the UTF8 bytestring which then a text shaping engine would have to understand and switch the font accordingly..

0: https://heistak.github.io/your-code-displays-japanese-wrong/

mnau · on April 30, 2024

Wasn't it solved by https://en.wikipedia.org/wiki/Han_unification#Ideographic_Va... ?

Kind of a band-aid (it's necessary to stuff a variant selector after a CJK codepoint), but should work.

These decisions were made back in 1992 and codepoint in 16-bit was one of desired goals. Non-unified CJK wouldn't fit. In hindsight, it looks like a rather unfortunate decision, but having more codepoints that would fit to 16 bits could seriously hamper adoption and different standard would win (compute resources were far more limiting back then).

In either case, it's like 4 byte addressing in IPv4, in hindsight, 6+ bytes would be better, but what's done is done.

Edit: Even in 2000s, when C# was released, string was just a sequence of 16-bit code units (not codepoints), so they could deal with BMP without problems and astral planes were ... mostly DIY. They added Rune support (32-bit codepoint) only in .NET Core 3.0 (2019).

Cloudef · on April 30, 2024

Seems so, I wonder when the whole stack starts supporting it (IMEs, fonts, text shaping engines)

Adobe / Google seems to have a font https://ken-lunde.medium.com/improving-font-information-proc...

EDIT: Seems IVS is really old, but its still a problem so not holding my breath

int_19h · on April 30, 2024

They aren't strange, but they are sort of self-inflicted, so it's not unreasonable for others to say, "we're not going to spend time and effort to deal with this mess".

I'm Russian. 20 years ago that meant having to deal with two other common encodings aside from UTF-8 (CP1251 and KOI8-R). 25 years ago, it was three encodings (CP866 was the third one). Tricks like what the article describes were very common. Things broke all the time anyway because heuristics aren't reliable.

These days, everything is in UTF-8, and we're vastly better off for it.

kelnos · on April 30, 2024

Unless the Unicode Consortium decides to undo the Han Unification stuff, I don't think it's going to get better for Japanese users, and programmers who build for a Japanese audience will have to continue to suffer with Shift-JIS.

samatman · on April 30, 2024

There will be no undoing of anything, fortunately. Unicode is committed to complete backward compatibility, to the point where typos in a character name are supplemented with an alias, rather than corrected. Han Unification was an unforced error based on the proposition, which was never workable, that sixteen bits could work for everyone. This is entirely Microsoft's fault, by the way. But it shouldn't be, and won't be, fixed by breaking compatibility. That way lies madness.

There are two additional planes set aside for further Hanzi, the Supplementary and Tertiary Ideographic Planes, the latter is still mostly empty. Eventually the last unique ideograph used only to spell ten known surnames from the 16th century will also be added as a codepoint.

I view the continued use of Shift-JIS in Japan as part of a cultural trend, related to the continued and widespread use of fax machines, or the survival of floppy disks for many years after they were effectively dead everywhere else. That isn't at all intended as an insult, it's that matters Japanese stay within Japan to a high degree. Japanese technology has less outside pressure for cross-compatibility.

Shift-JIS covers all the corner cases of the language, and Unicode has been slow to do likewise, and it isn't like Japanese computers don't understand UTF-8, so people have been slow to switch. It's the premise of "unaware of how it works in the rest of the world" that I object to. It's really just Japan. Everywhere else, including the Chinese speaking parts of the world, there's Unicode data and legacy-encoded data, and the solution to the later is to encode it in the former.