UniCode

HomePage :: PageIndex :: RecentChanges :: WikiFarms :: UserSettings :: You are 38.103.63.16
Unicode: What it is and why we need it

Scenario(s):
You are teaching an intermediate class in Russian. You would like to create an on-line discussion forum where you can post excerpts of Russian literature and have your students discuss possible interpretationsâ?¦ and you would like to use the Cyrillic characters (e.g. ё ж з и й к). Perhaps you are teaching Japanese and would like to publish an informational web page for your students demonstrating the use of Kanji (e.g., 日本語), Katakana (e.g., ニホンゴ), and Hiragana (e.g., にほんご) with explanations in English. Or perhaps you are teaching Hindi and are considering a class project where your students author web pages telling about themselves in both their native tongue and Hindi (e.g., यूनिकोड क्या है?).


As language professionals, many of us are aware of the basic steps for creating documents using a non-Latin alphabet. First there is locating the particular font, then installing it on your computer, then changing the keyboard setting, and then locating and using an authoring program that supports use of that particular alphabet. Web documents have extended the task by requiring authors to often work in different programs simultaneously; one for composing the HTML, the other for writing the non-Latin text. But these steps will only take you so far as to produce a document in one particular character set.  What about producing documents that contain characters from two or more character sets?

Encode this!
An often overlooked component to document authoring, both web page and otherwise, and a critically important one for language-related documents, is the page encoding, which tells the software program or web browser from which character set the page information is composed. Different languages, which utilize different character sets, must have their proper encoding set in order for the program or browser to correctly render the characters on the screen. In other words, we need to put code into the HTML (or other authoring language) that says, basically, "The characters in this page are Chinese characters."

When dealing solely within the confines of one particular character set, this is often a transparent process, and the restrictions imposed by the character set are rarely evidenced. Language professionals notwithstanding, few people worry about changing the default encoding of their text documents, if they even know there is a default encoding to change. However, the affordances of the Internet for global document sharing and the desire to plug technology into multilingual environments make one quickly aware of the limitations of traditional encoding formats.

The two primary problems with this has been that (1) it has been impossible to display characters from different character sets (read languages) on the same document,  and (2) authoring, sharing and archiving documents globally has meant manually setting possible encodings until the documents rendered correctly.

The first problem has been a significant hurdle for language professionals who wish to have multiple languages displayed correctly within one web document. Consider a discussion forum aimed at English-speaking learners of Korean, or a collaboratively constructed multilingual document.

The second problem was a serious frustration for editing, sharing and archiving data. Encodings would have to be both known and preserved for each data set, which is time-consuming and potentially error-causing when dealing not just with multiple languages but with character sets for which there are multiple encodings. For example, there five possible encodings for the Cyrillic character set, and characters encoded in Cyrillic (KOI8-R) will appear broken in Cyrillic (ISO). Consider a Russian-language classroom sharing documents with two different classrooms in Russia, both using a different document encoding format.

These problems are the result of character sets that are limited to 256 possible characters. With such a low number of possible characters, including multiple character sets within one encoding format has been impossible.

This was before Unicode. Unicode is, simply stated, an encoding that allows for 96,382 characters (in the most recent version), and thus has ample room to include support for multiple character sets. Unicode encoding allows the creation of documents containing characters from any number of alternate character sets, and allows documents to be easily shared between any number of geographic or linguistic regions.

A Brief History of Unicode:
Those who have been working with computers since the late 1980â??s (or earlier!) remember the inability of personal computers to display graphics, yet alone render text in multiple alphabets. Character display was crude at best, made worse by the poor performance of dot matrix or daisy-wheel printers. However, consumer demands from a developing multinational audience began urging developers to consider supporting non-Latin character sets. Xerox and Apple were among the first companies to recognize the multinational software market, and develop software targeting different languages. However these early programs were limited by their contemporary encoding formats, and were unable to deal with more than one language at a time.

In the late 1980â??s both the International Organization for Standardization (ISO) and the Unicode Project began separate initiatives to develop a single unified character set. Both efforts realized their common goal would be best served by developing compatible standards, and thus the evolving versions of the Unicode standard map directly onto the evolving ISO recommendations.

Unicode, however, goes beyond ISO standards, including, for example, recommendations for rendering bidirectional text and algorithms for sorting and comparing text strings. Today, the Unicode format is a dominant force, and will likely shape the multilingual authoring environment beyond the near future.  

Unicode & the Web:
From multilingual discussion forums, to on-line dictionaries, to syllabi and instructional materials, the web provides a perfect environment for Unicode. Data driven web sites can store their information in databases in one consistent scheme, and rely on one charset encoding for page delivery.

Setting the character encoding in standard HTML is handled by a meta tag included in the head element of the page. The following declaration is adapted from the W3C (World Wide Web Consortium) website, and can be used to set a documentâ??s encoding to Unicode.

(Note that UTF-16 is an even more inclusive next step in the Unicode format, but which is not supported as yet by all browsers or authoring programs.)

Page editors include code to a specified default charset when new documents are generated. Macromediaâ??s Dreamweaver MX includes the following default code in newly created documents:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Microsoftâ??s Frontpage 2002 sets its default encoding with:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

These default character sets (ISO-8859-1 and windows-1252) are for the Western European character encoding,  although one uses Western European(ISO) and the other Western European(Windows). Both of these defaults can be changed to Unicode by simply setting charset=UTF-8.

Authors who prefer avoiding code, can use their respective editor to set the page encoding for them. Using Dreamweaver MX, following the path Modify -> Page Properties will display a drop down menu with the heading Document Encoding, where the option UTF-8(Unicode) can be selected. With Frontpage 2002, following Tools-> Page Options, and then selecting Default Font will bring up a scroll menu with the heading Language(character set) where the option Unicode(UTF-8) can be selected.

For pages created in PHP, page encoding is set using the header function, which should precede all other code on the page.  The following code is adapted from the documentation at PHP.net.

header ('Content-Type: text/html; charset=UTF-8');

For data-driven sites, it is important that the pages (or program) used to input data into the database, and the pages used to query and display information from the database are all set to Unicode encoding.

Unicode and Other Applications:
The two primary pieces to using Unicode in the many other applications we make use of in document authoring is to first know the default encoding of the application, and second how to change (if needed) the encoding to Unicode. While the list of applications can seem endless, and the defaults and change procedures may vary from version to version, a small sampling of applications is provided. *Remember that you will always need to have the appropriate fonts installed on your computer for your applications to access.

Users of Microsoft Word 2000 (or greater), arguably among the most popular authoring programs, will be pleased to know that Word is by default Unicode, and can display characters from multiple non-Latin sets simultaneously.

WordPad defaults to the Rich Text Format (rft) when saving, but provides the output option â??Unicode text documentâ??. WordPad? has the ability to display characters from multiple non-Latin sets simultaneously, making it quite useful for editing multilingual Unicode documents.

Notepad is by default ANSI, but provides an easily accessible drop down menu on the â??Save asâ?? screen to allow users to select Unicode encoding, and can display characters from multiple non-Latin sets simultaneously.

BBEdit for Mac OS X can open Unicode documents, and export documents in Unicode, however it can only display characters from one non-Latin set at a time, so editing multilingual texts can be difficult.

Dreamweaver MX, like BBEdit, can open and export Unicode documents, and is a popular environment for creating web pages, but is limited too by its inability to display more than one non-Latin set at a time.

Frontpage 2002 can display multiple non-Latin character sets, although these characters may need to be authored elsewhere and copied into Frontpage as this author had difficulty with Frontpage being able to create itself non-Latin characters.