Sunday, April 17, 2011

[IT] Short Discourse on Encoding (ASCII, UTF-8, etc.)

As Joel Spolsky, the co-founder of Frog Creek Software and the very popular blog/company Stack Overflow, says, every programmer ought to know about the basics of encoding considering how many of us don't. It's really quite sad. So this will be a brief gist of what I gleaned from reading his article.

In the old days when English still dominated the computer world, a code called ASCII was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare. Code below 32 were unprintable and were used for control purposes like 7 which made the computer beep.

But the codes from 128-255 were left entirely open to the geographic region's inclination. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would send their résumés to Israel they would arrive asrגsumגs.

The eventual adopted ANSI standard secured all agreements below 128 but left what's above 128
to your regional taste. Before the Internet this whole patch of mess worked to a degree but as soon
as the Internet came in, something had to change. Enter Unicode.

Unicode

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.

In fact, Unicode makes you think a different way about encoding characters. Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory:

A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is another story.

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041.

Encodings

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let's just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Couldn't it also be:

48 00 65 00 6C 00 6C 00 6F 00 ?

the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE.

For a while it seemed like that might be good enough, but American programmers were complaining. English rarely used code points above U+00FF and the thought of wasted bytes shocked them. Besides, who's going to convert the ASCII character sets over?

Thus was invented UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F.

There are actually loads of other encoding out there but since December 2007, the most popular encoding on the web has been UTF-8 so it'll be good to know more about it. It's important to know the encoding for messages over the web. No such thing as plain text anymore!

For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8"

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself -- not in the HTML itself, but as one of the response headers that are sent before the HTML page.

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this seems to be a Catch-22: "how can you read the HTML file until you know what encoding it's in?!" Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

But that meta tag really has to be the very first thing in the section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

What if a poorly informed web designer doesn't put this tag though? Apparently, every browser does something different to try to guess the character set. Sometimes, they get it right. Sometimes wrong which you the web browser will have to try to figure out what encoding the web designer actually meant for you to use. Is it Chinese? Hindi? Arabic? Good luck!

Thus comes the central point that for every web document you code up, be sure to include the encoding type!

For his whole article, I heartily recommend reading it here:

No comments:

Post a Comment