Python 3's approach is the most correct: Unicode defines text as a sequence of c...

chrismorgan · on June 2, 2023

Python 3’s approach snatched defeat from the jaws of victory.

They aimed to work with a nice, clean, abstract concept, untrammelled by encoding squabbles. They failed badly by choosing code units rather than scalar values (Unicode strings are sequences of scalar values, not code points—'\udead' is a valid Python string, but you can’t encode it into any UTF-* format since [U+DEAD] is not a valid Unicode string).

Then they also neglected to observe that they were optimising for something that you should practically never be doing, so that now everyone has to pay the costs. As the article summarises it part-way through: “The choice of UTF-32 (or Python 3-style code point sequences) arises from wanting the wrong thing.”

Seriously, Python 3’s approach is almost the worst of all available worlds. I loathe UTF-16 with such fiery passion that I can’t quite bring myself to say Python 3’s approach is worse than weak UTF-16, but it’s of similar badness in practical terms. The decisions were very clearly made by people that were not expert in the domain and who were caught up in a Concept of Mathematical Purity. They’ve since walked some of it back as far as they could, and I think did recognise it all as a mistake (no citation, just a vague memory of seeing such an admission), but they can’t fix it all properly without a breaking change.

arcticbull · on June 2, 2023

> Unicode defines text as a sequence of code points.

Does it? Do you have a link?

[edit] I looked up the spec and here is what it says.

> The Unicode Standard does not define what is and is not a text element in different processes; instead, it defines elements called encoded characters. An encoded character is represented by a number from 0 to 10FFFF_16, called a code point. A text element, in turn, is represented by a sequence of one or more encoded characters. [1]

The definition of 'text' in the context of Unicode seems to explicitly not be defined as a sequence of code points, but rather a more nebulous sequence of aggregations of code points. It's probably closest to a grapheme cluster but they seem to want to avoid pinning it down.

[1] https://www.unicode.org/versions/Unicode15.0.0/UnicodeStanda... p. 7 (1.3 - Text Handling), PDF page 33.

hgs3 · on June 2, 2023

Review chapter 2.2 Unicode Design Principles in the Unicode Standard: "Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes."

Text elements are an abstract concept whose definition depends upon what is being processed. It might be a grapheme, it might be word, etc...

arcticbull · on June 2, 2023

There might be something a little imprecise here: code points vs code units vs character codes.

I'm open to being wrong but I would be very surprised if they defined text as a "series of code units" the count of which can vary by encoding even for the same character. IMO in this context 'character codes' would likely be far more consistent with 'code points' and they're just trying to differentiate between styled and un-styled text. Whereas the 1.3 definition appears to be trying to make an authoritative definition of 'text.'

If we read 2.2's "character codes" as code points, then that can be multiple code points as referenced in 1.3

[edit] I originally flipped 'units' and 'codes' - cleaned it up.

hgs3 · on June 2, 2023

"Character code" is short for "character code point" or just code point. All Unicode algorithms and properties are defined in terms of the code point. UTF encodings are just a way of encoding a code point. From Unicode's perspective, you care about what is encoded (i.e. the code point) and not how it is encoded (i.e. UTF-8).

Unicode is one of the most poorly understood topics. I think the confusion stems from 1. most programming languages getting the abstraction wrong, and 2. programmers trying to reconcile their non-technical interpretation of what "character" means.

arcticbull · on June 2, 2023

I agree with everything you said, I think I'm just trying to reconcile that with the top of thread saying python was the most correct because it was returning '7 code points' and that 'UTF-whatever is an implementation detail'

But 7 is not the number of code points/USVs - that's the number of UTF-16 code units. The string is 5 USVs. If UTF-whatever is an implementation detail, wouldn't the correct answer to length be 5?

What am I missing haha.

hgs3 · on June 2, 2023

Python does return 5. JavaScript returns 7. Python is returning the number of code points, JavaScript is returning the number of UTF-16 code units.

arcticbull · on June 2, 2023

There's my mistake. Thank you. Flipped them in my head, it's ben a long day.

pjscott · on June 2, 2023

Treating Unicode strings as a sequence of code points is a completely valid thing to do, but is usually not what you actually care about when dealing with text. Really, are code points any less of an implementation detail?

planede · on June 2, 2023

Code points are what you care about when you do any kind of text-based format encoding or decoding. Any of JSON, XML, HTML, YAML or whatever is defined by sequence of code points. There is no reason to complicate these with visual representation-specific concepts.

If you have to care about the visual representation of text then you probably need to be familiar with other concepts as well.

chrismorgan · on June 2, 2023

But, given the root ancestor of this comment, it’s worth clarifying that Python’s approach to strings doesn’t help at all with things like decoding JSON/XML/HTML/YAML; what Python gives you is random access by code point index, which you won’t ever need to use in such tasks.

croes · on June 2, 2023

I think parent means most correct of the three given examples from Rust, JS and Python.

Especially because the article says that Python's take is the worst.

paulddraper · on June 2, 2023

Yes.

They are less of an implementation detail.

Grapheme > Code point > Encoding > Endianness > Media

It's all "implementations" but some are lower then others

adgjlsfhk1 · on June 2, 2023

Unicode defines text as a number of different types of things. They are sequences of codepoints, sequences of graphemes, sequences of graphime clusters. Furthermore, codepoints are different depending on how you normalize them. Accented characters can be written two different ways and have a different number of codepoints depending on how you write them (and if normalization is used)

Spivak · on June 2, 2023

Grapheme are a made up human thing that, while useful, is locale dependent. Most people when they talk about grapheme clusters mean the default "locale-independent" graphemes but it's not the only one (in Hungarian for example 'ly' is a single letter). Having the same string be two different lengths in two countries is… let's go with surprising. The common denominator where everyone computes the same number is code points.

adgjlsfhk1 · on June 2, 2023

Except they won't (if they are doing normalization). à and à have different numbers of codepoints (the first is 0x00E0, the second is 0xc3 0xa0).

duped · on June 2, 2023

There is no "most" correct, since the "length" of UTF encoded text is ambiguous. The point of the post is to highlight which semantics are the most useful and the tradeoffs.