WTF business do emojis have in Unicode? The BMP is all there ever should have be...

saltminer · on Jan 28, 2022

> WTF business do emojis have in Unicode?

Unicode didn't invent emoji, they incorporated it because they were already popular in Japan, and if they didn't incorporate it, it would greatly reduce Japanese adoption.

Keep in mind that Unicode was intended to unify all the disparate encodings that had been brewed up to support different languages and which made exchanging documents between non-English speaking countries a nightmare. The term "mojibake" comes to mind [0] - Japan alone had so many encodings that a slang term for text encoded with something different than what your device expected (and subsequently got rendered as nonsensical/garbled text) came about. And they weren't alone, of course [1].

> What we need now is a standardized, sane subset of Unicode that implementations can support while rejecting the insane scope creep that got added on top of that.

Unicode wasn't intended to be pretty. It was intended to be the one system that everyone used, and a way to increase adoption was to do some less than ideal things, like duplicate characters (so it would be easier to convert to Unicode).

You may never need anything outside the BMP, but that doesn't make the rest of the planes worthless. Ignoring the value of including dead and nearing-extinct languages for preservation purposes (not being able to type a language will basically guarantee its extinction, with inventing a new encoding and storing text as jpgs being the only real alternatives), there are a lot of people speaking languages found in the SMP [2][3] ([2] has 83 million native speakers, for example).

[0]: https://en.wikipedia.org/wiki/Mojibake

[1]: https://segfault.kiev.ua/cyrillic-encodings/

[2]: https://en.wikipedia.org/wiki/Modi_(Unicode_block)

[3]: https://en.wikipedia.org/wiki/Chakma_(Unicode_block)

lmm · on Jan 28, 2022

> The term "mojibake" comes to mind [0] - Japan alone had so many encodings that a slang term for text encoded with something different than what your device expected (and subsequently got rendered as nonsensical/garbled text) came about.

Mojibake was not a "Japan has too many encodings" problem. It was a "western developers assume everyone is using CP1252" problem.

> Unicode wasn't intended to be pretty. It was intended to be the one system that everyone used, and a way to increase adoption was to do some less than ideal things, like duplicate characters (so it would be easier to convert to Unicode).

Unfortunately they undermined all that with Han Unification, with the result that it's never going to be adopted in Japan.

fomine3 · on Jan 28, 2022

Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8.

Unicode/UTF-8 is widely adopted/recommended in Japan and there are no widely used alternative. Japanese company tend to still use SJIS but it's just laziness. Han unification isn't a problem to handle only Japanese text: just use Japanese font everywhere. To handle multiple language text, it's pain but anyway there are no alternatives.

lmm · on Jan 28, 2022

> Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8.

In theory it can happen with any combination of character sets, sure, but in practice every example of mojibake I've seen has been SJIS (or UTF-8) encoded text being decoded as CP1252 ("Latin-1" but that's an ambiguous term) by software that assumed the whole world used that particular western codepage. If you've got examples of SJIS vs EUC-JP confusion in the wild I'd be vaguely interested to see them (is there even anywhere that still uses EUC-JP?)

> Japanese company tend to still use SJIS but it's just laziness.

It's not just laziness; switching to unicode is a downgrade, because in practice it means you're going to get your characters rendered wrongly (Chinese style) for a certain percentage of your customers, for little clear benefit.

> To handle multiple language text, it's pain but anyway there are no alternatives.

Given that you have to build a structure that looks like a sequence of spans with language metadata attached to each one, there's not much benefit to using unicode versus letting each span specify its own encoding.

fomine3 · on Jan 28, 2022

Maybe the guess order depends on locale reasonably. GP is my experience mainly on old days ja-JP localed Windows software. IIRC Unix software tend to not good at guess so maybe you referring them.

Nowadays I rarely see new EUC-JP contents (or I just not recognized) but still sometimes I encounter mojibake on Chrome while visiting old homepage (like once per month). For web page, anyway most modern pages (including SJIS) don't rely on guess but have <meta charset> tag so mojibake very rarely happen. For plaintext files, I still see UTF-8 file shown as SJIS on Windows Chrome.

Viewing Japanese only UTF-8 text is totally fine for Japanese localed Windows/Mac/(Linux but YMMV). So your case is to view the text on non-Japanese locale. It possibly have a problem but how SJIS solved the issue? What software switches font if it opens SJIS file? Is the app/format don't support specifying font/lang like HTML/Word?

I believe no developer want to treat foreign charset like GBK/Big-5/whatever. There are very few information. If developer can switch reading charset on a file, then they can also switch font.

lmm · on Jan 28, 2022

> Viewing Japanese only UTF-8 text is totally fine for Japanese localed Windows/Mac/(Linux but YMMV). So your case is to view the text on non-Japanese locale.

The issue is that realistically a certain proportion of customers are going to have the wrong locale setting or wrong default font set.

> It possibly have a problem but how SJIS solved the issue? What software switches font if it opens SJIS file? Is the app/format don't support specifying font/lang like HTML/Word?

Certainly Firefox will use a Japanese font by default for SJIS whereas it will use a generic (i.e. Chinese) font by default for UTF-8. I would expect most encoding-aware programs would do the same?

> If developer can switch reading charset on a file, then they can also switch font.

Sure, but it works both ways. And it's actually much easier for a lazy developer to ignore the font case because it's essentially only an issue for Japan. Whereas if you make a completely encoding-unaware program it will cause issues in much of Europe and all of Asia (well, it did pre-UTF8 anyway).

numpad0 · on Jan 28, 2022

I think by far the largest contributor that coined mojibake was E-mail MTA. Some E-mail implementations assumed 7-bit ASCII for all text and dropped MSB on 8-bit SJIS/Unicode/etc, ending up as corrupt text at the receiving end. Next up was texts written in EUC(Extended UNIX Code)-JP probably by someone either running a real Unix(likely a Solaris) or early GNU/Linux, and floppies from a classic MacOS computer. Those must have defined it and various edge cases on web like header-encoding mismatch popularized it.

"Zhonghua fonts" issue is not necessarily linked to encoding, it's an issue about assuming or guessing locales - that has to be solved by adding a language identifier or by ending han unification.

account42 · on Jan 28, 2022

> Unfortunately they undermined all that with Han Unification, with the result that it's never going to be adopted in Japan.

This is an absolute shame and there is no excuse for fixing it so that variations for unified characters can be encoded before adding unimportant things like skin tones.

wodenokoto · on Jan 28, 2022

> So rather than treat the issue as a rich text problem of glyph alternates, Unicode added the concept of variation selectors, first introduced in version 3.2 and supplemented in version 4.0.[10] While variation selectors are treated as combining characters, they have no associated diacritic or mark. Instead, by combining with a base character, they signal the two character sequence selects a variation (typically in terms of grapheme, but also in terms of underlying meaning as in the case of a location name or other proper noun) of the base character. This then is not a selection of an alternate glyph, but the selection of a grapheme variation or a variation of the base abstract character. Such a two-character sequence however can be easily mapped to a separate single glyph in modern fonts. Since Unicode has assigned 256 separate variation selectors, it is capable of assigning 256 variations for any Han ideograph. Such variations can be specific to one language or another and enable the encoding of plain text that includes such grapheme variations. - https://en.m.wikipedia.org/wiki/Han_unification

This is what you’re asking for, right? Control characters that designates which version of a unified character is to be displayed.

Sure looks like it exists.

kingcharles · on Jan 27, 2022

Have emoji not become part of our writing structure though? A decent percentage of online chats and comments, especially on social networks, includes at least one emoji that couldn't be easily or accurately represented in the regular written language.

lmm · on Jan 28, 2022

Recently implementers of unicode have censored the gun emoji in a way that changes the meaning of many existing online chats and comments. So you can't easily or accurately represent things even with unicode.

Emoji have never been effective or consistent at conveying meaning; at best they convey something to someone in the same subculture and time period, and often not even that. Given that unicode implementers are ok with erasing the meaning of some of them, it should be ok to eliminate more of them.

kingcharles · on Jan 28, 2022

> Emoji have never been effective or consistent at conveying meaning; at best they convey something to someone in the same subculture and time period

Isn't that the same with all words though? Think how much English usage changes in a generation. For instance, my girlfriend will use the term "I'm dead!" in a similar context to where I would say "LOL" and where my father would have said "What the fuck is loll?"

lmm · on Jan 31, 2022

There's a spectrum. Subculture-specific slang changes quickly, but most words have a longer lifetime; reading Chaucer today is difficult but doable. Given that we don't encode words but only letters, for English you have to go back to the disappearance of þ to get a change that's relevant to text encoding. Emoji shift faster and are less effective at conveying meaning than any "real" language.

Gigachad · on Jan 27, 2022

This argument was lost the moment Unicode was created. Japanese carriers had created their own standard for emoji encoding for sms. And they would not switch to Unicode unless the emoji were ported over.

It’s a tricky situation. Maybe allowing an arbitrary bitmap char to represent any emoji would have been better but then we could have ended up in a situation where normal text or meaningful punctuation or perhaps even fonts would get encoded as bitmaps.

For something like a face or hand gesture, a bitmap likely would have been better since it would at least look the same on all platforms.

Findecanor · on Jan 28, 2022

I don't think that argument holds water. Emoji could just as well have been encoded as markup. There were for instance long-established conventions of using strings starting with : and ; . Bulletin boards extended that to a convention using letters delimited by : for example :rolleyes: . Not to mention that those codes can be typed more efficiently than browsing in an Emoji Picker box.

Because emoji became characters, text rendering and font formats had to be extended to support them. There are four different ways to encode emoji in OpenType 1.8:

* Apple uses embedded PNG

* Google uses embedded colour bitmaps

* Microsoft uses flat glyphs in different colours layered on top of one-another

* Adobe and Mozilla use embedded SVG.

lmm · on Jan 28, 2022

> Emoji could just as well have been encoded as markup.

They could have, but they were already being encoded as character codepoints in existing charactersets. So any character encoding scheme that wanted to replace all use cases for existing charactersets needed to match that. If switching charactersets meant you lost the ability to use emoji until you upgraded all your applications to support some markup format, people would just not switch.

account42 · on Jan 28, 2022

> If switching charactersets meant you lost the ability to use emoji until you upgraded all your applications to support some markup format, people would just not switch.

You need to upgrade those applications to support Unicode too.

lmm · on Jan 28, 2022

Not necessarily, most applications already supported multiple encodings, having the OS implement one of the unicode encodings was often all that was needed.

numpad0 · on Jan 28, 2022

I might think the important part was Japanese carriers were weaponizing flip phone culture to gatekeep "PCs" and open standard smartphones out of their microtransaction ecosystem. Emoji was one of the keys to disprove the FUD that iPhone can't be equal to flip phones and establish first class citizen status.

stubish · on Jan 28, 2022

You are underestimating how much language evolves. In fact, you are proposing brakes to stop if evolving. If nothing else, new currency symbols need to be incorporated every few years. The initial emoji were part of the actual writing systems of the world, even if it was relatively new and only being used by foreigners. Or maybe they have been part of world culture since the 1950s :-) ? https://en.wikipedia.org/wiki/Smiley

fomine3 · on Jan 28, 2022

BMP was failed concept even without emoji 16bit isn't enough to contain all CJK characters.