February 13, 2009

Because You ASCIId, Part III

Filed under: Main — admin @ 12:01 am

Though there are only 8-bits in a byte, and 256 values for those 8-bits, there are far more than 256 characters available for display on a computer. Add in all the foreign language characters, Japanese, Chinese, Arabic, Russian, plus a host of common symbols from all over and there’s no way you could stuff them all into 8-bits. That’s why computer scientists went to 16-bits and developed the Unicode standard.

In 1987, the Unicode scheme was developed to supersede ASCII and define character codes for use in computers worldwide.

The original Unicode spec had room to define 16,000 characters. The scientists were proud of themselves when they used far less than that. Later, however, they discovered that language researches wanted to have even more characters, including ancient Egyptian and scripts not used in thousands of years. So the Unicode spec was updated to include code definitions for over a million characters or glyphs.

To help keep the players straight, you need to know a few terms:

UTF-8 is the 8-bit Unicode Translation Format and it’s the closest thing to the original ASCII in the Unicode world. Most of the basic text stored in a computer file today is UTF-8, not ASCII.

UTF-16 is the 16-bit Unicode Translation Formation. It stores up to 65,000 characters or glyphs. The UTF-16 can be found on your PC, but only as a special file format. In programming, the UTF-16 format is often referred to as wide character text because it takes two bytes (16 bits) to store each character.

UTF-32 is the full 32-bit standard that stores over a million characters.

Normally the files you save on your computer are encoded using UTF-8, but you do have a choice. For example, in the Windows Notepad program, you can use the Encoding drop-down menu to choose the text format for your text documents, as shown here:

notepad

The options listed are ANSI, which is traditional Notepad text encoding, then also Unicode (UTF-16), Unicode Big Endian (UTF-32), and finally UTF-8. Saving a file in a Unicode format means that older programs that don’t understand Unicode will not properly read the text.

From the 127 ASCII characters all the way up to the millions of glyphs defined by Unicode, I think that we’re pretty well set for the future. Well, that is, until the Zlaxons invade and we have to deal with their new alphabet. Hopefully their conquest is still a few years off.

2 Comments

  1. Or the Transylvanians invade… I’m not sure which one I’d rather.

    Also, isn’t the newer Unicode things only good under Windows 2000\XP\Vista? And Microsoft released an unsupported patch for older OSes for the unicode, or something like that? Or is that some freaky dream of mine? 😉

    Comment by Douglas — February 13, 2009 @ 4:13 am

  2. Yes, the Big Endian Unicode thing works best with the most recent versions of Windows/Mac OS. Even then, not all the characters may be available or properly understood by the OS. I’m surprised they don’t debug this thing more. Eventually they *might* get it righ!

    Transylvanians! Now that’s a jump to the left!

    Comment by admin — February 13, 2009 @ 10:24 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.


Powered by WordPress