UTF-8 Everywhere

Manifesto

Purpose of this document

To promote usage and support of the UTF-8 encoding, to convince that this should be the default choice of encoding for storing text strings in memory or on disk, for communication and all other uses. We believe that all other encodings of Unicode (or text, in general) belong to rare edge-cases of optimization and should be avoided by mainstream users.

This document contains special characters. Without proper rendering support, you may see question marks, boxes, or other symbols.

In particular, we believe that the very popular UTF-16 encoding (mistakenly used as a synonym to ‘widechar’ and ‘Unicode’ in the Windows world) has no place in library APIs (except for specialized libraries, which deal with text).

This document recommends choosing UTF-8 as string storage in Windows applications, despite the fact that this standard is less popular there, due to historical reasons and the lack of native UTF-8 support by the API. Yet, we believe that, even on this platform, the following arguments outweigh the lack of native support. Also, we recommend forgetting forever what ‘ANSI codepages’ are and what they were used for. It is in the customer’s bill of rights to mix any number of languages in any text string.

We recommend avoiding C++ application code that depends on UNICODE or _UNICODE defines. This includes TCHAR/LPTSTR types on Windows and APIs defined as macros, such as CreateWindow and _tcslen. We also recommend alternative ways to reach the goals of these APIs.

We also believe that, if an application is not supposed to specialize in text, the infrastructure must make it possible for the program to be unaware of encoding issues. For instance, a file copy utility should not be written differently to support non-English file names. Joel’s great article on Unicode explains the encodings well for the beginners, but it lacks the most important part: how a programmer should proceed, if she does not care what is inside the string.

Background

In 1988, Joseph D. Becker published the first Unicode draft proposal. At the basis of his design was the naïve assumption that 16 bits per character would suffice. In 1991, the first version of the Unicode standard was published, with code points limited to 16 bits. In the following years many systems have added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, like Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).

However, it was soon discovered that 16 bits per character will not do for Unicode. In 1996, the UTF-16 encoding was created so existing systems would be able to work with non-16-bit characters. This effectively nullified the rationale behind choosing 16-bit encoding in the first place, namely being a fixed-width encoding. Currently Unicode spans over 109449 characters, about 74500 of them being CJK ideographs.

A little child playing an encodings game in front of a large poster about encodings.
Nagoya City Science Museum. Photo by Vadim Zlotnik.

Microsoft has, ever since, mistakenly used ‘Unicode’ and ‘widechar’ as synonyms for both ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 cannot be set as the encoding for narrow string WinAPI, one must compile her code with UNICODE. Windows C++ programmers are educated that Unicode must be done with ‘widechars’. As a result of this mess, they are now among the most confused ones about what is the right thing to do about text.

At the same time, in the Linux and the Web worlds, there is a silent agreement that UTF-8 is the most correct encoding for Unicode on the planet Earth. Even though it gives a strong preference to English and therefore to computer languages (such as C++, HTML, XML, etc) over any other text, it is seldom less efficient than UTF-16 for commonly used character sets.

The Facts

Our Conclusions

UTF-16 is the worst of both worlds—variable length and too wide. It exists for historical reasons, adds a lot of confusion and will hopefully die out.

Portability, cross-platform interoperability and simplicity are more important than interoperability with existing platform APIs. So, the best approach is to use UTF-8 narrow strings everywhere and convert them back and forth on Windows before calling APIs that accept strings. Performance is seldom an issue of any relevance when dealing with string-accepting system APIs (e.g. UI code and file system APIs), but there is a huge advantage to using the same encoding everywhere, and we see no sufficient reason to do otherwise.

Speaking of performance, machines often use strings to communicate (e.g. HTTP headers, XML). Many see this as a mistake, but regardless of that it is nearly always done in English, giving UTF-8 further advantage there. Using different encodings for different kinds of strings significantly increases complexity and consequent bugs.

In particular, we believe that adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++11. What must be demanded from the implementations, though, is that the basic execution character set would be capable of storing any Unicode data. Then, every std::string or char* parameter would be Unicode-compatible. ‘If this accepts text, it should be Unicode compatible’—and with UTF-8, it is also easy to do.

The standard facets have many design flaws. This includes std::numpunct, std::moneypunct and std::ctype not supporting variable-length encoded characters (non-ASCII UTF-8 and non-BMP UTF-16). They must be fixed:

How to do text on Windows

The following is what we recommend to everyone else for compile-time checked Unicode correctness, ease of use and better multi-platformness of the code. This substantially differs from what is usually recommended as the proper way of using Unicode on Windows. Yet, an in-depth research of these recommendations resulted in the same conclusion. So here it goes:

Working with files, filenames and fstreams on Windows

Conversion functions

This guideline uses the conversion functions from the Boost.Nowide library (it is not yet a part of boost):

std::string narrow(const wchar_t *s);
std::wstring widen(const char *s);
std::string narrow(const std::wstring &s);
std::wstring widen(const std::string &s);

The library also provides a set of wrappers for commonly used standard C and C++ library functions that deal with files, as well as means of reading an writing UTF-8 through iostreams.

These functions and wrappers are easy to implement using Windows’ MultiByteToWideChar and WideCharToMultiByte functions. Any other (possibly faster) conversion routines can be used.

FAQ

  1. Q: Are you a linuxer? Is this a concealed religious fight against Windows?

    A: No, I grew up on Windows, and I am a Windows fan. I believe that they did a wrong choice in the text domain, because they did it earlier than others.—Pavel

  2. Q: Are you an Anglophile? Do you secretly think English alphabet and culture are superior to any other?

    A: No, and my country is non-ASCII speaking. I do not think that using a format which encodes ASCII characters in single byte is Anglo-centrism, or has anything to do with human interaction. Even though one can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed, as long as they do exist, text is not only for human readers.

  3. Q: Why do you guys care? I program in C# and/or Java and I don’t need to care about encodings at all.

    A: Not true. Both C# and Java offer a 16 bit char type, which is less than a Unicode character, congratulations. The .NET indexer str[i] works in units of the internal representation, hence a leaky abstraction once again. Substring methods will happily return an invalid string, cutting a non-BMP character in parts.

    Furthermore, you have to mind encodings when you are writing your text to files on disk, network communications, external devices, or any place for other program to read from. Please be kind to use System.Text.Encoding.UTF8 (.NET) in these cases, never Encoding.ASCII, UTF-16 or cellphone PDU, regardless of the assumptions about the contents.

    Web frameworks like ASP.NET do suffer from the poor choice of internal string representation in the underlying framework: the expected string output (and input) of a web application is nearly always UTF-8, resulting in significant conversion overhead in high-throughput web applications and web services.

  4. Q: Why not just let any programmer use her favorite encoding internally, as long as she knows how to use it?

    A: We have nothing against correct using of any encoding. However, it becomes a problem when the same type, such as std::string, means different things in different contexts. While it is ‘ANSI codepage’ for some, for others, it means ‘this code is broken and does not support non-English text’. In our programs, it means Unicode-aware UTF-8 string. This diversity is a source of many bugs and much misery: this additional complexity is something that world does not really need, and the result is much Unicode-broken software, industry-wide.

  5. Q: UTF-16 characters that take more than two bytes are extremely rare in the real world. This practically makes UTF-16 a fixed-width encoding, giving it a whole bunch of advantages. Can’t we just neglect these characters?

    A: Are you serious about not supporting all of Unicode in your software design? And, if you are going to support it anyway, how does the fact that non-BMP characters are rare practically change anything, except for making software testing harder? What does matter, however, is that text manipulations are relatively rare in real applications—compared to just passing strings around as-is. This means the "almost fixed width" has little performance advantage (see Performance), while having shorter strings may be significant.

  6. Q: Why do you turn on the UNICODE define, if you do not intend to use Windows’ LPTSTR/TCHAR/etc macros?

    A: This is a precaution against plugging a UTF-8 char* string into ANSI-expecting functions of Windows API. We want it to generate a compiler error. It is the same kind of a hard-to-find bug as passing an argv[] string to fopen() on Windows: it assumes that the user will never pass non-current-codepage filenames. You will be unlikely to find this kind of a bug by manual testing, unless your testers are trained to supply Chinese file names occasionally, and yet it is a broken program logic. Thanks to UNICODE define, you get an error for that.

  7. Q: Isn’t it quite naïve to think that Microsoft will stop using widechars one day?

    A: Let’s first see when they start supporting CP_UTF8 as a valid locale. This should not be very hard to do. Then, we see no reason why anybody would continue using the widechar APIs. Also, adding support for CP_UTF8 would ‘unbreak’ some of existing unicode-broken programs and libraries.

    Some say that adding CP_UTF8 support would break existing applications that use the ANSI API, and that this was supposedly the reason why Microsoft had to resort to creating the wide string API. This is not true. Even some popular ANSI encodings are variable length (Shift JIS, for example), so no correct code would become broken. The reason Microsoft chose UCS-2 is purely historical. Back then UTF-8 hasn’t yet existed, Unicode was believed to be ‘just a wider ASCII’, and it was cosidered important to use a fixed-width encoding.

  8. Q: What are characters, code points, code units and grapheme clusters?

    A: Here is an excerpt of the definitions according to the Unicode Standard with our comments. Refer to the relevant sections of the standard for more detailed description.

    Code point
    Any numerical value in the Unicode codespace.[§3.4, D10] For instance: U+3243F.
    Code unit
    The minimal bit combination that can represent a unit of encoded text.[§3.9, D77] For example, UTF-8, UTF-16 and UTF-32 use 8-bit, 16-bit and 32-bit code units respectively. The above code point will be encoded as ‘f0 b2 90 bf’ in UTF-8, ‘d889 dc3f’ in UTF-16 and ‘0003243f’ in UTF-32. Note that these are just sequences of groups of bits; how they are stored further depends on the endianness of the particular encoding. So, when storing the above UTF-16 code units on an octet-oriented media, they will be converted to ‘d8 89 dc 3f’ for UTF-16BE and to ‘89 d8 3f dc’ for UTF-16LE.
    Abstract character

    A unit of information used for the organization, control, or representation of textual data.[§3.4, D7] The standard further says in §3.1:

    For the Unicode Standard, [...] the repertoire is inherently open. Because Unicode is a universal encoding, any abstract character that could ever be encoded is a potential candidate to be encoded, regardless of whether the character is currently known.

    The definition is indeed abstract. Whatever one can think of as a character—is an abstract character. For example, tengwar letter ungwe is an abstract character, although it is not yet representable in Unicode.

    Encoded character
    Coded character

    A mapping between a code point and an abstract character.[§3.4, D11] For example, U+1F428 is a coded character which represents the abstract character 🐨 koala.

    This mapping is neither total, nor injective, nor surjective:

    • Surragates, noncharacters and unassigned code points do not correspond to abstract characters at all.
    • Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character ‘Ω’, and must be treated identically.
    • Some abstract characters cannot be encoded by a single code point. These are represented by sequences of coded characters. For example, the only way to represent the abstract character ю́ cyrillic small letter yu with acute is by the sequence U+044E cyrillic small letter yu followed by U+0301 combining acute accent.

    Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.

    User-perceived character
    Whatever the end user thinks of as a character. This notion is language dependent. For instance, ‘ch’ is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
    Grapheme cluster
    A sequence of coded characters that ‘should be kept together’.[§2.11] Grapheme clusters approximate the notion of user-perceived characters in a language independent way. They are used for, e.g., cursor movement and selection.
    Character

    May mean any of the above. The Unicode Standard uses it as a synonym for coded character.[§3.4]

    When some programming language or library documentation says ‘character’, it almost always means a code unit. When an end user is asked about the number of characters in a string, she will count the user-perceived characters. When a programmer tries to count the number of characters, she will count the number of code units, code points, or grapheme clusters, according to the level of her expertise. All this is a source of confusion, as people conclude that, if for the length of the string ‘🐨’ the library returns a value other than one, then it ‘does not support Unicode’.

  9. Q: Why would the Asians give up on UTF-16 encoding, which saves them 50% the memory per character?

    A: It does so only in artificially constructed examples containing only characters in the U+0800 to U+FFFF range. However, computer-to-computer text interfaces dominate any other. This includes XML, HTTP, filesystem paths and configuration files—they all use almost exclusively ASCII characters, and in fact UTF-8 is used just as often in those countries.

    For a dedicated storage of Chinese books, UTF-16 may still be used as a fair optimization. As soon as the text is retrieved from such storage, it should be converted to the standard compatible with the rest of the world. Anyway, if storage is at premium, a lossless compression will be used. In such cases, UTF-8 and UTF-16 will take roughly the same space. Furthermore, ‘in the said languages, a glyph conveys more information than a [L]atin character so it is justified for it to take more space.’ (Tronic, UTF-16 harmful).

    Here are the results of a simple experiment. The space used by the HTML source of some web page (Japan article, retrieved from Japanese Wikipedia on 2012–01–01) is shown in the first column. The second column shows the results for text with markup removed, that is ‘select all, copy, paste into plain text file’.

    HTML Source (Δ UTF-8)Dense text (Δ UTF-8)
    UTF-8767 KB (0%)222 KB (0%)
    UTF-161 186 KB (+55%)176 KB (−21%)
    UTF-8 zipped179 KB (−77%)83 KB (−63%)
    UTF-16LE zipped192 KB (−75%)76 KB (−66%)
    UTF-16BE zipped194 KB (−75%)77 KB (−65%)

    As can be seen, UTF-16 takes about 50% more space than UTF-8 on real data, it only saves 20% for dense Asian text, and hardly competes with general purpose compression algorithms.

  10. Q: What do you think about Byte Order Marks?

    A: From the Unicode Standard (v6.2, p.30): Use of a BOM is neither required nor recommended for UTF-8.

    Byte order issues are yet another reason to avoid UTF-16. UTF-8 has no endianness issues, and the UTF-8 BOM exists only to manifest that this is a UTF-8 stream. If UTF-8 remains the only popular encoding (as it already is in the internet world), the BOM becomes redundant. In practice, most UTF-8 text files omit BOMs today.

  11. Q: What do you think about line endings?

    A: All files shall be read and written in binary mode since this guarantees interoperability—a program will always give the same output on any system. Since the C and C++ standards use \n as in-memory line endings, this will cause all files to be written in the POSIX convention. It may cause trouble when the file is opened in Notepad on Windows; however, any decent text viewer understands such line endings.

  12. Q: But what about performance of text processing algorithms, byte alignment, etc?

    A: Is it really better with UTF-16? Maybe so. ICU uses UTF-16 for historical reasons, thus it is quite hard to measure. However, most of the times strings are treated as cookies, not sorted or reversed every second use. Smaller encoding is then favorable for performance.

  13. Q: Isn’t UTF-8 merely an attempt to be compatible with ASCII? Why keep this old fossil?

    A: Maybe it was. Today, it is a better and more popular encoding of Unicode than any other.

  14. Q: Is it really a fault of UTF-16 that people misuse it, assuming that it is 16 bits per character?

    A: Not really. But yes, safety is an important feature of every design.

  15. Q: If std::string means UTF-8, wouldn’t that get confused with code that stores plain text in std::strings?

    A: There is no such thing as plain text. There is no reason for storing codepage-ANSI or ASCII-only text in a class named ‘string’.

  16. Q: Won’t the conversions between UTF-8 and UTF-16 when passing strings to Windows slow down my application?

    A: First, you will do some conversion either way. It’s either when calling the system, or when interacting with the rest of the world. Even if your interaction with the system is more frequent in your application, here is a little experiment.

    A typical use of the OS is to open files. This function executes in (184 ± 3)μs on my machine:

    void f(const wchar_t* name)
    {
        HANDLE f = CreateFile(name, GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
        DWORD written;
        WriteFile(f, "Hello world!\n", 13, &written, 0);
        CloseHandle(f);
    }

    While this runs in (186 ± 0.7)μs:

    void f(const char* name)
    {
        HANDLE f = CreateFile(widen(name).c_str(), GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
        DWORD written;
        WriteFile(f, "Hello world!\n", 13, &written, 0);
        CloseHandle(f);
    }

    (Run with name="D:\\a\\test\\subdir\\subsubdir\\this is the sub dir\\a.txt" in both cases. It was averaged over 5 runs. We used an optimized widen that relies on std::string contiguous storage guarantee given by C++11.)

    This is just (1 ± 2)% overhead. Moreover, MultiByteToWideChar is almost surely suboptimal. Better UTF-8↔UTF-16 conversion functions exist.

  17. Q: How do I write UTF-8 string literal in my C++ code?

    A: If you internationalize your software then all non-ASCII strings will be loaded from an external translation database, so it is not a problem.

    If you still want to embed a special character you can do it as follows. In C++11 you can do it as:

    u8"∃y ∀x ¬(x ≺ y)"

    With compilers that do not support ‘u8’ you can hard-code the UTF-8 code units as follows:

    "\xE2\x88\x83y \xE2\x88\x80x \xC2\xAC(x \xE2\x89\xBA y)"

    However the most straightforward way is to just write the string as-is and save the source file encoded in UTF-8:

    "∃y ∀x ¬(x ≺ y)"

    Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume that it is in the correct codepage and will not touch your strings. However, it renders it impossible to use Unicode identifiers and wide string literals (that you will not be using anyway).

  18. Q: How can I check for presence of a specific ASCII character, e.g. apostrophe (') for SQL injection prevention, or HTML markup special characters, etc. in a UTF-8 encoded string?

    A: Do as you would for an ASCII string. Every non-ASCII character is encoded in UTF-8 as a sequence of bytes, each of them having value greater than 127. This leaves no place for collision for a naïve algorithm—simple, fast and elegant.

    Also, you can search for a UTF-8 encoded substring in a UTF-8 string as if it was a plain byte array—no need to mind code point boundaries. This is a design feature of UTF-8—a leading byte of an encoded code point can never hold value corresponding to one of trailing bytes of any other code point.

  19. Q: I have a complex large char-based Windows application. What is the easiest way to make it Unicode-aware?

    Keep the chars. Define UNICODE and _UNICODE to get compiler errors where narrow()/widen() should be used (this is done automatically by setting Use Unicode Character Set in Visual Studio project settings). Find all fstream and fopen() uses, and use wide overloads as described above. By now, you are almost done.

    If you use 3rd-party libraries that do not support Unicode, e.g. forwarding file name strings as-is to fopen(), you will have to work around with tools such as GetShortPathName() as shown above.

  20. Q: What about Python? I heard they worked hard in v3.3 to support Unicode better.

    A: Perhaps, they should have done less and the support would be better. In the CPython v3.3 reference implementation, the internal string representation was changed. The UTF-16 was replaced by one of three possible encodings (ISO-8859-1, UCS-2 or UCS-4) depending on the actual string content. To add a single non-ASCII or non-BMP character, the entire string will often be implicitly converted to a different encoding. The internal encoding is transparent to the script. This design is meant to optimize the performance of indexing operations on Unicode code points. However, we argue that counting or indexing code points should not be important for the majority of uses—compared, for instance, to grapheme clusters. To our knowledge, Python currently provides no support of the latter.

    Therefore, we oppose representation-agnostic handling of strings, in favor of representation-transparent API with a UTF-8 internal representation. Indexing operations would be counting code units rather than the code points, as they in fact did before the change. This would also simplify the implementation and also improve performance, e.g. in scripts dealing with the Web, which is already dominated by UTF-8 encoded text, thus making the Python programming language more applicable in the server-side world. One may argue about the safety of string-cutting operations by script programmers, but then again, the same argument is valid for splitting grapheme clusters. Even though Unicode is now fully supported, we believe that Python, as a modern tool with less historical burden to carry, must do better job in text handling.

    Other than that, JPython and IronPython continue to rely on the less fortunate encoding used by their hosting platforms (Java and .NET, respectively) and care must be taken to handle the surrogate pairs correctly there.

  21. Q: I already use this approach and I want to make our vision come true. What can I do?

    A: Review your code and see what library is most painful to use in portable Unicode-aware code. Open a bug report to the authors.

    If you are a C or C++ library author, use char* and std::string with UTF-8 implied, and refuse to support ANSI code pages—since they are inherently Unicode-broken.

    If you are a Microsoft employee, push for implementing support of the CP_UTF8 as one of narrow API code pages.

Myths

Note: If you are not familiar with the Unicode terminology, please read this FAQ first.

Note: For the purpose of this discussion, indexing into the string is also a kind of character counting.

Counting characters can be done in constant time with UTF-16.

This is a common mistake by those who think that UTF-16 is a fixed-width encoding. It is not. In fact UTF-16 is a variable length encoding. Refer to this FAQ if you still deny the existence of non-BMP characters.

Many try to fix this statement by switching encodings, and come with the following statement:

Counting characters can be done in constant time with UTF-32.

Now, the truth of this statement depends on the meaning of the ambiguous and overloaded word ‘character’. The only interpretations that would make the claim true are ‘code units’ and ‘code points’, which coincide in UTF-32. However, code points are not characters, neither according to Unicode nor according to the end user. Some of them are non-characters. These should not be interchanged though. So, assuming we can guarantee that the string does not contain non-characters, each code point would represent a single coded character, and we could count them.

But, is it so an important achievement? Why the above concern raises at all?

Counting coded characters or code points is important.

The importance of code points is frequently overstated. This is due to misunderstanding of the complexity of Unicode, which merely reflects the complexity of human languages. It is easy to tell how many characters are there in ‘Abracadabra’, but it is not so simple for the following string:

Приве́т नमस्ते שָׁלוֹם

The above string consists of 22 (!) code points but only 16 grapheme clusters. So, ‘Abracadabra’ consists of 11 code points, the above string consists of 22 code points, and further of 20 if converted to NFC. Yet, the number of code points is irrelevant to almost any software engineering question, with perhaps the only exception of converting the string to UTF-32. For example:

See also: How Twitter counts characters.

In NFC each code point corresponds to one user-perceived character.

No, because the number of user-perceived characters that can be represented in Unicode is virtually infinite. Even in practice, most characters do not have a fully composed form. For example, the NFD string from the example above, which consists of three real words in three real languages, will consist of 20 code points in NFC. This is still far more than the 16 user-perceived characters it has.

The string length() operation must count user-perceived or coded characters. If not, it does not support Unicode properly.

Unicode support of libraries and programming languages is frequently judged by the value returned for the ‘length of the string’ operation. According to this evaluation of Unicode support, most popular languages, such as C#, Java, and even the ICU itself, would not support Unicode. For example, the length of the one character string ‘🐨’ will be often reported to be 2 where UTF-16 is used as for the internal string representation and 4 for the languages that internally use UTF-8. The source of the misconception is that the specification of these languages use the word ‘character’ to mean a code unit, while the programmer expects it to be something else.

About the authors

This manifesto was written by Pavel Radzivilovsky, Yakov Galka and Slava Novgorodov. It is a result of our experience and research of real-world Unicode issues and mistakes done by real-world programmers. The goal is to improve awareness of text issues and to inspire industry-wide changes to make Unicode-aware programming easier, ultimately improving the experience of users of those programs written by human engineers. Neither of us is involved in the Unicode consortium. Special thanks to Glenn Linderman for providing information about Python.

Much of the text was inspired by discussions on StackOverflow initiated by Artyom Beilis, the author of Boost.Locale. You can leave comments/feedback there. Additional inspiration came from the development conventions at VisionMap and Michael Hartl’s tauday.org.

External links


Bitcoin donate to: 1UTF8gQmvChQ4MwUHT6XmydjUt9TsuDRn
The cash will be used for research and promotion.
Valid XHTML 1.0 Strict Valid CSS!
Last modified: 2014-09-30