Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Might we run out of Unicode code points, like we (seem to) be running out of IPv4 addresses?

As another comment mentions, once you add all these snowmen, with/without snow, male female and gender-neutral, in a few skin colour options (plus neutral)... it adds up. Plus, exponential growth once you consider family of snowmen (different number/genders/races of "parents", different number/gender/races of "children" and so on...).



There is no reason to believe the current rate (about ~35,000 over the period 2010--2020) to change rapidly, so we are probably safe for this century. You should be aware that emoji gender and skin color is encoded in character sequences and modifiers rather than atomic characters, exactly in order to avoid that exponential growth.

And in the unlikely case that Unicode gets so many characters somehow, you can always extend it: http://ucsx.org/


Ok but what about all the cryptocurrency symbols? Those will probably accelerate the rate.

Perhaps not by a significant or even measurable amount. Nonetheless, it's a great reason to start investigating a blockchain alternative to Unicode


The successful bitcoin sign proposal [1] explicitly deals with such a criticism:

> Will Unicode be flooded with symbols for many crypto-currencies?

> Most other crypto-currencies have learned from the difficulty that a non-Unicode symbol causes for Bitcoin, and use a symbol already in Unicode. For instance, Dogecoin uses Đ, Ethereum uses Ξ, Litecoin uses Ł, Namecoin uses ℕ, Peercoin uses Ᵽ and Primecoin uses Ψ. Some, like Ripple, use Roman capital letters (XRP), mimicking ISO 4217 currency codes.

> While it is possible another crypto-currency will have a non-Unicode symbol that is extensively used in text, this is unlikely.

I think this section was crucial for the eventual acceptance, because Unicode people do care (a lot) about long-term consequences of proposals.

[1] https://www.unicode.org/L2/L2015/15229-bitcoin-sign.pdf


It seem to me that this is something best handled with tag characters, like ¤XBT + (U+E007F) = ₿ (where the letters are from the tag block, U+E00xx). This mirrors one of the two systems for rendering national flags[0], just with a different starting codepoint, and can easily accommodate all the ISO 4217 currency codes and common unofficial extensions. If a system doesn't know how to render a particular glyph it can just fall back to showing the Roman capital letters.

The downside of this approach is size: each tag codepoint (including the end marker) requires four bytes in UTF-8, plus two for ¤, so the sequence above is 18 bytes long.

[0] https://en.wikipedia.org/wiki/Tags_(Unicode_block)#Current_u...


That sounds interesting, but modern currency symbols are already fast-tracked anyway---they almost always get assigned in the next version of Unicode---and more than one currency symbols for given ISO 4217 code can exist so I don't think it would work.


> modern currency symbols are already fast-tracked anyway

For national currencies, perhaps. New national currencies aren't introduced all that often, and there is a lot of pressure to support them quickly as their use is often mandatory for anyone living in that jurisdiction. For new private currencies, including crypto-currencies, we don't see quite the same eagerness—the observation that new crypto-currencies were more likely to reuse existing Unicode symbols than invent new ones was a consideration in getting the Bitcoin symbol adopted, as they didn't want to open up the floodgates to large numbers of new currency symbols. The tag-based system offers a compromise.

> and more than one currency symbols for given ISO 4217 code can exist so I don't think it would work

That is a bit of a problem, but it could be handled with the variant selector codepoints, for example ¤MOP = MOP$, ¤MOP(VS1) = 圓, and ¤MOP(VS2) = 元, if the symbols have the same meaning. To save some space the VS could replace the end codepoint. For fractional units there could be a different prefix such as ¢ for 1/100 or ₥ for 1/1000 in place of the ¤, or incorporating one of the Unicode fraction codepoints for other ratios up to ⅞ (or ⅑ or ⅒). These would be rendered verbatim in the fallback version, like ¢USD.


> emoji gender and skin color is encoded in character sequences

A good tool to see this broken down is https://unicode-x-ray.vercel.app/?t=%E2%9C%8C%F0%9F%8F%BC%F0... (edit: fixed url to use percent encoded emoji)


> Might we run out of Unicode code points, like we (seem to) be running out of IPv4 addresses?

No. There are currently 144697 codepoints allocated, out of a possible 1.1 millions. And most updates allocate a few hundreds. The large allocations (in the thousands at a time) overwhelmingly concern large additions of CJK unified ideographs (see: 13.0 with 4969 out of 5930 new codepoints, 10.0 with 7494 / 8518, 8.0 with 5771/7716).

There have been large additions of historical scripts (9.0 added the entire Tangut script, 7.0 added 23 different scripts) but those occurrences have slowed down a lot.


The snowmen are in Unicode because they existed in a character set before the Unicode standard was created. Unicode was deliberately created as a superset of all existing character sets at the time.


Some of the glyphs you mention are combinatorial code points. i.e. they are multibyte characters combined to a single character. So you add a gender modifier and skin color modifier to change the appearance. You don't add multiple code points.

It's your device rendering these 2-3 byte character sets as single icons/emojis.


> So you add a gender modifier and skin color modifier to change the appearance. You don't add multiple code points.

FWIW that's true for the skin colors (there are 5 fitzpatrick scale modifiers, U+1F3FB to U+1F3FF), but it's not true for the gender: the basic gendered characters (e.g. U+1F468 "MAN", U+1F469 "WOMAN") were part of the original set "merged" from japanese emoji so the gender-neutral equivalent (e.g. U+1F9D1 "ADULT") was added as a separate codepoints.


According to this document [0], there are "Gender Alternates", which change the gender of an Emoji. Relevant part is starting near the end of Page 2.

[0]: https://www.unicode.org/L2/L2016/16181-gender-zwj-sequences....


We are nowhere close to running out of code points. Unicode as currently defined has 1.1 million, but even that could be increased if there was a need. There isn't, since only 114 thousand are defined.

There are not separate code points for all combinations of genders and skin colors; the characters are made as combinations.


Things like skin tone variations are not defined as individual code points. They are sequences of code points that combine to make the full, customized glyph. So you have one code point for "medical", one for "professional", one for "female", one for "brown skin", one for "blond hair", and from that you get a more specific picture of a doctor..


We already did! That's what happened when UTF-16 was exhausted, which was never the original plan. Just like how the IPv4 internet degraded into a mess of hacks once addresses ran short (like NAT), so too did Unicode start becoming wildly more complex.

Amongst other things, hitting the limit of 16 bits meant the introduction of:

- The concept of "planes"

- UTF-16 combining characters

- UTF-32

- The newfound desire to encode emoji using combining characters, which means many apparently simple emoji are actually hacked together out of a mini programming language (e.g. black man = man emoji + skin tone modifier). Same thing for flags, which are actually two English letters mapped into a different part of the code space and then combined e.g. the British flag is G+B.

It's one reason why emoji broke so much software. It used to be that before emoji nobody cared about characters beyond the basic multilingual plane and ignored them. Then emoji came along and broke everything that assumed a UTF-16 code point == a character.


1) there's only ~150k unicode values defined. If we assume a signed int for available space, we have 2,147,333,647 of 2,147,483,647 remaining. moreso if the int is unsigned. We're fine. 2) they use values that combine like ligatures to create the variants of values. there isn't a combinatorial explosion because color is a modifier value, and sex, and then the underlying symbol. It's not a unique symbol for each combination.

IPv4 ran down because everything needs an IP to be on the net and there are more humans than available addresses, and more gear than humans.

We don't need different characters per human, only to document existing languages and to account for the slow growth of modern hieroglyphs.


We can't assume a signed int, as character encodings limit the number of codepoints: "Excluding surrogates and noncharacters leaves 1,111,998 code points available for use." -- https://en.wikipedia.org/wiki/Unicode#:~:text=Excluding%20su...


But character encodings don't limit the number of codepoints. Unicode is just a big list of correspondences between an integer and a glyph. There's no limit to how many integers you can assign.

Unicode encodings are separate standards that give correspondences between Unicode code points (integers) and byte sequences. If Unicode changes in a way that invalidates an encoding, that just calls for a new encoding.


Yes, it could technically be extended, but the transition would be a massive undertaking, so in practice the encodings do limit the number of codepoints. UTF-16, which creates the limitation, is very widely used and required by major programming language standards like ECMAScript. A lot of software still can't cope with codepoints outside the BMP, and they were established with UTF-16 in 1996.


Besides the difference between the abstract and unlimited Unicode and the encodings, our current "modern" encodings, UTF-8 and the new UTF-16 are artificially restricted and can be trivially expanded into a huge number of codepoints just by removing those restrictions.


New UTF-16? I'm only aware of the original 1996 one, which uses all of its 20 surrogate-pair bits for the codepoint (unlike UTF-8 which can use bits to extend to more bytes). In my understanding, "just" removing that restriction would mean completely replacing the encoding, like UCS-2 being replaced with UTF-16. The new one may have some overlap, but transitioning to it would still be a huge undertaking, and far from trivial (quite a few programs today still use UCS-2, quarter of a century after UTF-16 was introduced to replace it).


Unicode has been limited to 21-bits for a while so that UTF8 is guaranteed to encode no more than four bytes per code point. It can support the full 32-bit code space but changing now will break a lot of validation code.


If we assume a signed int for available space

Note that as it is currently defined, the Unicode codespace ranges from U+0000 to U+10FFFF, with some reserved codepoints (eg to encode surrogate pairs), yielding a total number of 1,112,064 assignable code points.


> as it is currently defined

I find it completely implausible that this will ever change: the current size is baked in too heavily.

• The abomination UTF-16, which is distressingly popular, cannot possibly support it. Replacing UTF-16 would be a massive upheaval in many ecosystems (e.g. JavaScript, Qt, Windows), and there’s no real prospect of most of those environments moving away from UTF-16, because it’s a massive breaking change for them by now. Rather, if the code space were running out, they’d devise something along the lines of second-level surrogate pairs. (And then we’d curse UTF-16 even more, because it’d have ruined Unicode for everyone again.)

• All code that performs Unicode validation (which isn’t as much as it should be, but is still probably a majority) would need to be upgraded. Any systems not upgraded would either mangle or more commonly fail on new characters.

• UTF-8 software would also need to be adjusted, since it’s artificially limited to the 21-bit space; and it wouldn’t be just a matter of flipping a few switches here and there to remove that limit—there will be lots of small places that bake in the the assumption that representing a scalar value requires no more than four UTF-8 code units.


1,112,064 code points ought to be enough for anybody. — Bill Gates


> If we assume a signed int for available space

While UTF8 was originally defined as able to encode 31 bits, because of the limitations of UTF-16 RFC 3629 explicitly restricted the unicode code-space to 21 bits (or about 1.1 million codepoints).


> We don't need different characters per human

Unicode NFTs here we come


I think the current approach is to just invent yet another "meta layer" of characters and declare that this particular sequence of bytes/codepoints/surrogate pairs/grapheme clusters/extended grapheme clusters/zwj sequences/whatever else you can think of has a special meaning and does not behave like you think it does. See also Henri Sivonen's essay on unicode string length [1]

So in a way, Unicode is already long past the time where you invent NATs and other hacks to buy you time with the scarcity problem.

[1] https://hsivonen.fi/string-length/


> it adds up.

It really, really doesn't.

According to UTS #51, as of unicode 14 (and its ~140000 allocated codepoints) there are under 3500 codepoints classified as emoji.

And do keep in mind that #, or ®, are classified as emoji.

And incidentally, U+2654 "white chess king" (♔) was in unicode 1.0. The moral panic around emoji is really tiring, it's absolute, utter nonsense, every single time.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: