My last name contains an ü and it has been consistenly horrible.
* When I try to preemptively replace ü with ue many institutions and companies refuse to accept it because it does not match my passport
* Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again
* Sometimes I can enter my name as-is and there seems to be no problem, only for some other system to mangle it to � or or a box. This often triggers error downstream I have no way of fixing
* Sometimes, people print a u and add the diacritics by hand on the label. This is nice, but still somehow wrong.
I wonder what the solution is. Give up and ask people to consistenly use a ascii-only name? Allow everybody 1000+ unicode characters as a name and go off that string? Officially change my name?
The part I came to love about France in general is that while all of these are broken, the people dealing with it will completely agree it's broken and amply sympathize, but just accept your name is printed as G�nter.
Same for names that don't fit field lengths, addresses that require street numbers etc. It's a real pain to deal with all of it and each system will fail in its own way to make your life a mess, but people will embrace the mess and won't blink an eye when you bring paper that just don't match.
That's a pretty unexpected twist, and I'm frilled with it.
I don't see every institution come up with a fix anytime soon, but having it clear that they're breaking the law is such a huge step. That will also have a huge impact on bank system development, and I wonder how they'll do it (extend the current system to have the customer facing bits rewritten, or just redo it all from top to bottom)
There is the tale of Mizuho bank [0], botching their system upgrade project so hard they were still seeing widespread failures after a decade into it.
> I don't see every institution come up with a fix anytime soon, but having it clear that they're breaking the law is such a huge step.
It's excellent, but also sad that it takes legislation to motivate companies to fix their crappy legacy systems, and they will likely fight tooth and nail rather than comply.
This is not just convenience, it also has spoofing security implications for all names. C and C++11 are insecure since C11. https://github.com/rurban/libu8ident/blob/master/doc/c11.md
Most other programming languages and OS kernels also.
> Does it mean Z̸̰̈́̓a̸͖̰͗́l̸̻͊g̸͙͂͝ǒ̷̬́̐ can finally have a bank account?
I wonder if this also means one can require a European bank have a name on file in Kanju, Thai script or some other not-so-well-known in Europe alphabet.
A bank can specially request it to be the name on a passport or domestic ID card. That's one way to make sure that the name falls within some parameters, though that can be tough on the customer in some conditions.
I guess every country has a technical document on what's allowed in names, but then say EU banks have to cater for full superset of EU rules.
As far as the passports go, ICAO 9303-3 allows for latin characters, additional latin characters, such as Þ and ß, and "diacritics", so something not too crazy, i.e. Z̷̪͘a̵͈͘l̷̹̃g̷̣̈́ő̶͍ would still be plausible.
Since work on central ID in Europe moves slowly banks will only need to bother with local name rules atm since only local names are valid. I am guessing we will have normalization rules in the end and that looks completely unplausible.
Ahah, I can relate to that. My driving license doesn't spell my name correctly, and somehow nobody cares. I somehow like this "nah, who cares" attitude
> * Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again
In Unicode umlaut and diaeresis are both represented by same codepoint, U+0308 COMBINING DIAERESIS.
The only solution is going to be a lot of patience, unfortunately.
Everyone should be storing strings as UTF-8, and any time strings are being compared they should undergo some form of normalization. Doesn't matter which, as long as it's consistent. There's no reason to store string data in any other format, and any comparison code which isn't normalizing is buggy.
But thanks to institutional inertia, it will be a very long time before everything works that way.
> Everyone should be storing strings as UTF-8, and any time strings are being compared they should undergo some form of normalization. Doesn't matter which, as long as it's consistent. There's no reason to store string data in any other format, and any comparison code which isn't normalizing is buggy.
This will result in misprinting Japanese names (or misprinting Chinese names depending on the rest of your system).
Can we please talk about Unicode without the myth of Han Unification being bad somehow? The problem here is exactly the lack of unification in Roman alphabets!
> Can we please talk about Unicode without the myth of Han Unification being bad somehow?
It's not a myth, as anyone living in Japan knows, and the "just use Unicode, all you need is Unicode" dogma is really harmful; a lot of "international" software has become significantly worse for Japanese users since it took hold.
> The problem here is exactly the lack of unification in Roman alphabets!
Problems caused by failing to unify characters that look the same do not mean it was a good idea to unify characters that look different!
> "just use Unicode, all you need is Unicode" dogma is really harmful; a lot of "international" software has become significantly worse for Japanese users since it took hold.
The alternative would be that the software used Shift_JIS with a Japanese font. If the software used a Japanese font for Japanese it wouldn't need metadata anyway.
There really isn't a problem with Han unification as long as you always switch to a font appropriate for your language; you don't need to configure metadata. If you don't you are always going to run into missing codepoint problems.
In cases where the system or user configures the font, properly using Unicode is still easier than configuring alternate encodings for multiple languages.
> The alternative would be that the software used Shift_JIS with a Japanese font.
As far as I know all Shift_JIS fonts are Japanese; you would have to be wilfully perverse to make one that wasn't.
> If the software used a Japanese font for Japanese it wouldn't need metadata anyway.
If it just uses the system default font for that encoding, as almost all software does, then it will also behave correctly.
> There really isn't a problem with Han unification as long as you always switch to a font appropriate for your language
Right. But approximately no software does that, because if you don't do it then your software will work fine everywhere other than Japan, and even in Japan it will kind-of-sort-of work to the point that a non-native probably won't notice a problem.
> In cases where the system or user configures the font, properly using Unicode is still easier than configuring alternate encodings for multiple languages.
I'm not convinced it is. Configuring your software to use the right font on a Unicode system is, as far as I can see, at least as hard as configuring your software to use the right encoding on a non-Unicode system. It just fails less obviously when you don't, particularly outside Japan.
> Right. But approximately no software does that, because if you don't do it then your software will work fine everywhere other than Japan, and even in Japan it will kind-of-sort-of work to the point that a non-native probably won't notice a problem.
Most games that I know of that target CJK + English (and are either CJK-developed, or have a local publisher based in East Asia) do indeed switch fonts depending on language (and on TC vs. SC).
> I'm not convinced it is. Configuring your software to use the right font on a Unicode system is, as far as I can see, at least as hard as configuring your software to use the right encoding on a non-Unicode system. It just fails less obviously when you don't, particularly outside Japan.
I'm considering 3 scenarios:
1. You are configuring for the Japanese-speaking market. In which case, fix a font, or fonts.
2. You are localizing into multiple languages and care about localization quality. In which case, yes, you need to know that localization in Unicode is more than just replacing content strings, but this is comparable to dealing with multiple encodings.
3. You are localizing into multiple languages and do not care about localization quality, or Japanese is not a localization target. In which case Japanese (user input / replaced strings) in your app / website will appear childish and shoddy, but it is still a better experience than mojibake.
In any case, it seems to me that it is not a worse experience than pre-Unicode. It's just that people who have no experience in localization expect Unicode systems to do things it cannot do by just replacing strings. You indeed frequently run into issues even in European languages if you just think it's a matter of replacing strings.
> Japanese programs aren't globalized and already rely on the system being fine tuned for Japanese
Right, because unicode-based systems don't work well in Japan. E.g. a unicode-based application framework that ships its own font and expects to use it will display ok everywhere that's not Japan. So Japan is increasingly cut off from the paradigms that the rest of the world is using.
Custom fonts are often a mistake for any language, especially google fonts often look wrong. Due to this browsers often have an option to force usage of system fonts and set minimum size to improve readability.
If the tag mechanism was used consistently and handled by all software, no. But in practice the only way that would happen is if the tag mechanism was required for many languages. Unicode is, in practice, a system that works the same way for ~every human language except Japanese, which makes it much worse than the previous "byte stream + encoding" system where any program written to support anything more than just US English would naturally work correctly for every other language, including Japanese.
> Unicode is, in practice, a system that works the same way for ~every human language except Japanese
This is simply not true. As I've pointed out in a sibling comment, Unicode has a lot of surprising and frustrating behaviors with many European languages as well if you use it without locale data. The characters will look right, but e.g. searching, sorting and case-insensitive comparisons will not work as expected if the application is not locale aware.
> The characters will look right, but e.g. searching, sorting and case-insensitive comparisons will not work as expected
This is quite a different situation from Japan. A lot of applications don't do searching, sorting, or case-insensitive comparisons, but virtually every application displays text.
Both problems are missing the point: you cannot handle Unicode correctly without locale information (which needs to be carried alongside as metadata outside of the string itself).
To a Swede or a Finn, o and ö are different letters, as distinct as a and b (ö sorts at the very end at the alphabet). A search function that mixes them up would be very frustrating. On the other hand, to an American, a search function that doesn't find "coöperation" when you search for "cooperation" is also very frustrating. Back in Sweden, v and w are basically the same letter, especially when it comes to people's last names, and should probably be treated the same. Further south, if you try to lowercase an I and the text is in Turkish (or in certain other Turkic languages), you want a dotless i (ı), not a regular lowercase i. This is extremely spooky if you try to do case insensitive equality comparisons and aren't paying attention, because if you do it wrong and end up with a regular lowercase i, you've lost information and uppercasing again will not restore the original string.
There are tons and tons of problems like this in European languages. The root cause is exactly the same as the Han unification gripes: Unicode without locale information is not enough to handle natural languages in the way users expect.
If you mean in-band language tagging inside the string itself, the page you're linking to points out that this is deprecated. The tag characters are now mostly used for emoji stuff. If you only need to be compatible with yourself you can of course do whatever you like, but otherwise, I agree with what the linked page says:
> Users who need to tag text with the language identity should be using standard markup mechanisms, such as those provided by HTML, XML, or other rich text mechanisms. In other contexts, such as databases or internet protocols, language should generally be indicated by appropriate data fields, rather than by embedded language tags or markup.
The interesting question is why you agree, the deprecation fact isn't telling much, the quote also doesn't explain anything, like, the "appropriate data fields" might not exist for mixed content, a rather common thing, and why resort to the full ugliness of XML just for this?
(and that emojis have had their positive impact in forcing apps into better Unicode support would be a + for the use of a tag)
Most applications do not do anything useful with in-band language tags. They never had widespread adoption in the first place and have been deprecated since 2008, so this is unsurprising. If you're using them in your strings and those strings might end up displayed by any code you don't control, you'll probably want to strip out the language tags to avoid any potential problems or unexpected behaviors. Out-of-band metadata doesn't have this problem.
As I said though, if you're in full control and only need to be compatible with yourself, you can do whatever you want.
in 2008 uft-8 was only ~20% of all web pages! Again, that deprecation fact is not meaningful, a quick search shows that rfc for tagging is dated 1999, so that's just 10 years before deprecation, that's a tiny timeframe for such things, so I agree, it's not surprising there was no widespread use.
Out-of-band metadata has plenty of other problems besides the fact that it doesn't exist in a lot of cases
Unicode reuses codepoints for characters that the committee decided were in some sense "the same", including Japanese and Chinese characters that are written differently from each other (different numbers of strokes etc.). This is a minor irritation for everyday text, but can be quite upsetting when it's someone's name that's getting printed wrong.
No system will get support for unicode by just the passing of time. Software needs to be upgraded/replaced for that to happen. Reluctant institutions will not just do that, and need external pressure.
> a normative subset of Unicode Latin characters, sequences of base characters and diacritic signs, and special characters for use in names of persons, legal entities, products, addresses etc
My German last name also contains an ü, so when we emigrated to an English-speaking country and obtained dual-citizenship we used 'ue' for that passport and I now use 'ue' on a day-to-day basis. This also means I have two slightly different legal surnames depending by which passport I go.
At least German transliteration is 1-to-1. Slavic names among others often have multiple transliterations available. The Russian name Валерий can be rendered for example as Valery, Valeriy, or Valeri. It's very confusing for documents that require the person's name.
Also don't forget Chinese, which due to different romanizations or different dialects being used for the romanization, can result in different outputs depending on whether a person is from PRC, ROC, Macao, Hong Kong, or Singapore.
just out of curiosity, can you port the ue back to Germany (or wherever) or will they automatically transform it to ü? (could you change your name in a German speaking country to Mueller et al?)
In Germany, there are some names that use ue, ae or oe instead of ü, ä, ö, and you run into issues with some systems wrongly autocorrecting it to the umlaut. Usually not a big deal, but having the umlaut is less error prone than the transliteration in Germany.
> Give up and ask people to consistenly use a ascii-only name?
> Officially change my name?
Yes. That's the only one that's going to actually work. You can go on about how these systems ought to work until until the cows come home, and I'm sure plenty of people on HN will, but if you actually want to get on with your life and avoid problems, legally change your name to one that's short and ascii-only.
a friend of mine in china had a character in his name that was not in the recognized set of characters. he refused to change his name and instead submitted the character to be added to unicode (which i believe eventually happened)
in the meantime he was unable to own the company he founded (instead made his wife the owner), had a national ID card with a different character, and i am not sure if he had a bank account, but i think the bank didn't care because laws that enforced the names to match the passport/ID only came later. i don't know how the ID didn't automatically imply a name change, but the IDs were issued automatically and maybe he filed a complaint about his name being wrong.
Names changes are only permitted in a very narrow set of conditions in my place of residence. And this would not be one of them. And I imagine that's the case in many nations.
Interestingly, it seems that Japan does have a procedure for foreigners to officially adopt a Japanese name. Changing your name is often very hard, and doing it in a country where you're not a citizen might be completely impossible, depending on the country.
> Japan does have a procedure for foreigners to officially adopt a Japanese name.
Sort of but not really. The post-2012 residence cards do not display a registered alias anywhere, and since those cards are what banks are required to KYC you on, a lot of banks won't allow you to use a registered alias which in turn means it's hard to use it for anything else (credit cards, phone, pension...). It's very non-joined-up government.
We clearly need to phase out name-based identification within software. "What's your name?" should never be a question heard from workers as any means of locating one's official identity in any system.
Some form of biometrics to pull up an ID in a globally agreed-upon system is certainly the way forward. Whether or not it is close to what a final solution should be, World ID is making some effort into solving global identification problems https://worldcoin.org/world-id
Can ü be printed on a passport rather than a u? I have a ş and a ç so I have been successfully substituting s and c for them in a somewhat consistent manner.
On the human-readable zone ("VIZ" in ICAO 9303) yes, see part 3 section 3.1 [1]. The MRZ however, not - it is limited to Latin alphanumeric only, see section 4.3. How to transliterate non-Latin characters is left to the discretion of the issuing government, and that has been a consistent source of annoyances for people who have identity cards issued by different governments (e.g. dual-nationals of Western European and Turkish, Arabic or Cyrillic-using Slavic countries).
When my child was born, one of the requirements I had to choose his name was that it shouldn't have any accent (or character that's not in the 26 universal letters basically).
ah, possibly. the way it is worded i didn't read it that way. but i get it.
we did something comparable to make sure our kids had names that transliterated nicely into chinese so that they could use the same or at least a similar name in english and chinese, instead of having two names like it is common for many expats and locals in china.
* When I try to preemptively replace ü with ue many institutions and companies refuse to accept it because it does not match my passport
* Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again
* Sometimes I can enter my name as-is and there seems to be no problem, only for some other system to mangle it to � or or a box. This often triggers error downstream I have no way of fixing
* Sometimes, people print a u and add the diacritics by hand on the label. This is nice, but still somehow wrong.
I wonder what the solution is. Give up and ask people to consistenly use a ascii-only name? Allow everybody 1000+ unicode characters as a name and go off that string? Officially change my name?