Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I still don't understand why Unicode has all these obscure symbols but they still haven't added all superscript/subscript numbers and letters

https://stackoverflow.com/questions/6638471/why-does-the-uni...

To quote a reply from the above StackOverflow thread: "So, they added a snowman with snow AND a snowman without snow , so that the weather forecaster of this world can avoid the dull snowflake , but we will never get our missing superscript q‽"



I don't understand, why Unicode must (should?) contain superscript and subscript glyphes at all. Declared goal of Unicode is to have encoding of all characters used by all languages, past and modern. Subscript and superscript are not used by any language as separate characters, it is typesetting property. It should be solved by other means, not by character/glyph encoding. Should Unicode include ALL characters strike-out? Underlined? Double-underlined? Small-caps variant for all letters for languages where small-caps are used in typography tradition?

And, BTW, what do you mean by "all letters"? Should Unicode contain sub/superscript variants of Hangul or Devanagari or letters from hundreds other non-latin-alphabae languages? So, Unicode must be approximately tripled, bar hieroglyphic part (and why hieroglyphics should not be sub/superscripted?)?


This is probably an edge case, but I work in lab software that uses chemical symbols and having sub and super characters saves lots of headaches. I can just store "CO₂" in a database, query it, and display it back as a simple string, or display values in scientific notation like 1,3×10³, without having to use any formatting.

But to be honest I'm not sure what the parent comment wants to see added because at the moment having all the letters from A-Z, numbers from 0-9, and plus minus and equals signs as both subscript and superscript seems to be enough.


Upper-case subscripts are missing, for one: I'm not allowed to talk about the normal force F_N in plain text email. Superscript and subscript Greek letters would also be nice to have, eg in context of relativity.


Why not Devanagari then? This Europe-centric point of view bother me.

Also, I've seen a lot of different symbols as subscripts in mathematical and physical articles, like squares, triangles, arrows, etc.


Why not Devanagari then? This Europe-centric point of view bother me.

Sure: As I mentioned in another comment, I'd add markers to enable arbitrary super and subscripting.

However, the question I responded to was asking what specifically people were missing in practice, and the examples I gave are things I personally would have used if they had been available.


Should Unicode contain sub/superscript variants of Hangul or Devanagari or letters from hundreds other non-latin-alphabae languages?

Nope, you'd use markers similar to U+200E (LEFT-TO-RIGHT MARK) and U+200F (RIGHT-TO-LEFT MARK) that already exist to indicate text direction (which is also a typesetting property).


They are relevant because Unicode had to define the bidirectional rendering and not every rendering can be automatically inferred from logical (abstract) characters. Unicode has no reason to define the general text rendering including subscripts and superscripts, so there is no reason for Unicode to define control characters for them.


Unicode had to define the bidirectional rendering

Why? They could have left this for a higher layer to handle.


Unicode defines characters, their semantics and (very flexible) guidelines for rendering them. Unlike, say, bold, italic or super/subscripts, bidirectionality is an intrinsic property of those characters and can't be easily refactored.


Should a universal text encoding provide a way to encode the names of mathematical and physical quantities?

In my opinion, yes. If it can't, it's not fit for purpose, no matter what is or is not an intrinsic property of some characters...


> Unicode defines characters, their semantics

Unicode specifically states that it doesn't define the semantics of characters. That would seriously interfere with its purpose of defining characters.

There are some notable exceptions, and they are acknowledged to be mistakes.


> Unicode specifically states that it doesn't define the semantics of characters.

The Unicode Standard explicitly says otherwise:

> Characters have well-defined semantics. These semantics are defined by explicitly assigned character properties, rather than implied through the character name or the position of a character in the code tables (see Section 3.5, Properties). [1]

> The Unicode Standard associates a rich set of semantics with characters and, in some instances, with code points. The support of character semantics is required for conformance; see Section 3.2, Conformance Requirements. [2]

To be fair, it refers to "character" semantics which is more or less abstracted by character properties. It is not like that, for example, △ U+25B2 WHITE UP-POINTING TRIANGLE UNICODE CHARACTER can only ever be used for denoting triangles. But it has defined semantics in the way that the character has properties expected for such symbols.

[1] https://www.unicode.org/versions/Unicode14.0.0/ch02.pdf#page...

[2] https://www.unicode.org/versions/Unicode14.0.0/ch04.pdf#page...


Unicode superscript and subscript is not intended for mathematical usages [1].

[1] https://unicode.org/faq/ligature_digraph.html#Pf8


That's a cop out. You could equally say that new emojis shouldn't be added because you should use inline images for those. Or RTL markers shouldn't be added because you should use dedicated text styling for that.

There are a ton of places that don't support superscript markup.


> You could equally say that new emojis shouldn't be added because you should use inline images for those.

If emojis weren't allocated out of compatibility concern, this would be exactly my opinion from the day 1. To be honest I'm not still happy with the current emoji assignments and semantics. Not even Unicode people are satisfied either, there are numerous proposals for replacing emoji with something else (example keyword: QID emoji).

> RTL markers shouldn't be added because you should use dedicated text styling for that.

> There are a ton of places that don't support superscript markup.

Unlike most text attributes, bidirectionality is an intrinsic property of abstract characters and thus absolutely within the Unicode's scope. Ideally you can't and shouldn't make some LTR character to behave like RTL characters or vice versa. Bidi control characters only exist to correct automatic rendering, and can be presented out of band (the Bidi specification is explicitly designed for this use case in mind [1]).

[1] https://www.unicode.org/reports/tr9/#Markup_And_Formatting


> You could equally say that new emojis shouldn't be added because you should use inline images for those.

Well, that's really a better solution. Or a unicode character that allows you to set a pixel on a 256x256 grid and one to compose them. Strike that. Better not give anyone bad ideas.


Almost sounds like you reinvented DEC Sixel.


Should we also have slanted, bold, semi-bold, light and underlined versions of every code point? Versions with/without serifs? For monospaced text? Those are all presentational matters. That we have super/subscripts in Unicode in the first place seems to have been just a hack to help terminal emulator software deal with obsolete encodings like ISO-8859-1: https://www.unicode.org/L2/L2000/00159-ucsterminal.txt


𝐒𝐡𝐨𝐮𝐥𝐝 𝘄𝗲 𝗵𝗮𝘃𝗲 𝗯𝗼𝗹𝗱 𝙖𝙣𝙙/𝘰𝘳 𝑠𝑙𝑎𝑛𝑡𝑒𝑑 𝒄𝒉𝒂𝒓𝒂𝒄𝒕𝒆𝒓𝒔 𝗶𝗻 𝗨𝗻𝗶𝗰𝗼𝗱𝗲? 𝓘𝓽 𝓼𝓮𝓮𝓶𝓼 𝓈𝑜𝓂𝑒𝑜𝓃𝑒 𝔱𝔥𝔬𝔲𝔤𝔥𝔱 𝖘𝖔!


Those are intended for maths, not for formatted text. Variables in mathematics are usually a single character, so there is a great variety of ways to format the characters to create different symbols. Diacritical marks, underlines, etc. are also used for this.



> but they still haven't added all superscript/subscript numbers and letters

That would triple the size of Unicode.


They would just need to add one Unicode modifier for superscript and one for subscript like there is for gender and skin color.


Fair enough, but general formatting codes would overlap with what is already supported in rich-text formats like HTML or LaTeX. Unicode is a standard for encoding characters, it is not supposed to be a rich-text document format itself.


I mean they could at least add q.


I've been told we'll never run out of space in Unicode.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: