BasicOverview: Creating Multilingual Web Documents
Outlined are the basic steps to creating multilingual web documents. Aspects of these steps are elaborated more specifically on OS specific pages (
Windows95or98orME,
Windows2000,
WindowsXP,
MacOS9,
MacOSX). A more detailed description of these steps follows below.
1.) Install all necessary components for creating documents using particular languages (includes installing language/language groups, keyboard/IMEs, and font/font-sets).
2.) Use a unicode-based document editor to author the multilingual document.
3.) Ensure the document has the necessary coding to tell the web browser that the document's content is encoded as Unicode.
4.) Consider that recipients of these documents must also have all necessary components (described in step one) installed locally to view the various character sets.
5.) To print these documents, a given printer must have a "printer driver" installed that accepts Unicode characters. Support and availability for these vary from manufacturer to manufacturer, and from machine to machine. Updated printer drivers can often be obtained by contacting the manufacturer, or visiting that manufacturer's website.
Note: Only install Signed drivers, installing unsigned drivers may cause system instability.
Basic Steps: Detailed
1.) The first step in creating a multilingual document is to install all the necessary tools onto your computer required for authoring in a non-English language. There are three primary components to this:
Install language/language group: Installing a "language" in this step refers to installing a "character set" which may be used by one or more actual languages. For example, authoring in Japanese requires the installation of "Japanese", while authoring in the languages of Azerbaijani, Bulgarian, Buryat, Byelorussian, Karakalpak, Kazakh, Khalkha, Kirghiz, Macedonian, Moldavian, Russian, Serbian, Tajik, Turkmen, Ukrainian and Uzbek requires the installation of "Cyrillic".
Install/Set up necessary keyboard/IMEs: IME is an acronym for "Input Method Editor", and refers to the relationship between the keys on the keyboard and the character(s) they produce when pressed. Most languages have a preset Keyboard/IME configuration that can be changed by the user if desired. Keyboard/IMEs vary by region; for example, in Windows 2000 there are three built-in IME options for Belgian: Belgian (Comma), Belgian Dutch and Belgian French, while Russian has two keyboard/IME options: Russian and Russian (Typewriter). Thus, there are often multiple keyboard/IME configurations for each language.
Install necessary fonts: Once the language is installed and the keyboard/IME is selected, the final step in this process involves installing a unicode font set that will display the authoring language. As with English (Lucida Sans Unicode is an obvious unicode font, but general windows fonts such as Arial, Times New Roman and Courier New are also unicode based fonts), there may be multiple unicode fonts available for each language. For example, the Arial GEO and Courier New GEO are Georgian Fonts for use with Cyrillic characters. Chinese fonts are generally divided into Chinese Simplified and Chinese Traditional, with several fonts available for each. Fonts for use with Japanese characters include Kozuka Mincho Pro Acro, MS Gothic, and MS Mincho. SimSun is a popular font that can display both Chinese and Japanese (Kanji, Katakana or Hiragana).
2.) The second step is to use a unicode-based document editor in authoring the document. Inside an authoring environment one can see the characters (say the Chinese "今天天气很好") typed on the screen. But, this does not mean that the numerical coding that stores these characters is unicode. It may be a proprietary coding schema, or it may be a language specific encoding, such as Cyrillic (K0I8-R) or Chinese Simplified (GB2312). When authoring in one particular language, encoding issues are often transparent (i.e., they occur without direct selection of the author).
Why unicode? As addressed in UniCode, the two principle reasons are (1) universal compatibility, and (2) displaying multilingual characters. The first reason may be case alone for using unicode, however here we focus on the second reason; multilingual character display.
Authoring in a unicode editor ensures that the character encodings will be successfully displayed when we publish the web document telling the browser to use unicode (UTF-8) encoding to render the characters.
For authors writing using multilingual characters, the question as to whether a particular editor is unicode-based will be quite evident as one begins entering various multilingual characters. A quick test you can perform is to copy the character tests from
CharacterTests and paste them into an editor. Keeping in mind that you'll still need to have necessary font-sets installed on your machine to render the characters, if characters from multiple characters sets display correctly, then it is good bet that the editor is unicode-based.
Note that this test works because the multilingual characters on CharacterTests are unicode encoded! If you choose to test your editor by copying characters of another web page, ensure first that the encoding for that web page is unicode.
Many popular editors are now unicode-based. These include, for Mac OSX or OS9;TextEdit, Mozilla Composer 1.1, Netscape Composer 6.2, and for Windows; WordPad, Word 2000/20002, and FrontPage 2000. For a comprehensive listing of editors and their degree of unicode support see Alan Wood's
Unicode and Multilingual Programs and Utilities 
.
Authors composing truly multilingual documents will no doubt use that language's keyboard/IME to input their characters. However, web documents can have individual characters inserted into their documents by using a numeric, hexadecimal or character entity unicode reference. What this means is that in addition to typing characters through a keyboard, specifc characters can be added one-at-a-time using a reference such as ¥ (numeric reference) or ¥ (character entity reference) to produce the Yen sign ¥, or Æ or Æ to produce the AE dipthong Æ. (Hence, a line of code in HTML could be written as <P>The price is 400¥</P> which will appear as 'The price is 400¥' on the web page. Many programs, such as Word, have menus to allow the insertion of unicode-based individual characters.
3.) Even though the editor in which the document was created is unicode-based, when preparing documents for the web it is crucial that the document contains code that will inform the users web browser (e.g. Internet Explorer or Mozilla) to render the content of the page in unicode. The two most commonly used codes for web publishing (HTML and PHP) each contain line commands that tell the browser to use unicode (or any other) encoding for the document's content.
In HTML, the line command is <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8">, and is placed in the HEAD area of the document code. In PHP, this is set using the line command <? header("Content-type: text/html; charset=utf-8");?> and must precede any actual display of content in the PHP code (generally, then, this command is the first line in the PHP document).
However, many authors choose not to work directly with HTML, PHP or other programming code, and instead rely on their editor to either "convert" the document to HTML (as in the case of Word), or build the code "behind the scenes" as the author enters content using a GUI or WYSIWYG interface (as in the case of FrontPage). In these cases, the line command that tells the browser to render the content using unicode can be set through the file menus and document properities dialogue options. (Specific instructions for various document editors will be added shortly).
4.) Not really a "step" per se, but something which must be factored in when creating multilingual web documents, is that recipients of these documents (i.e., the user who loads your document into his/her web browser) must also have the same languages/fonts installed on their computers to read or edit the documents. Thus, the author will seldom have difficulty when previewing his/her web document on the machine with which the document was created, as that machine will more than likely have all the fonts installed.
Consider that web page code normally calls a string of fonts, used in a "first try this, if that font is not present, try this one" basis, such as BODY {font-family:Arial, Verdana, serif}. If a document was authored using the font SimSun, and the user does not have that particular font, or any other capable of displaying Asian characters, the characters may not render correctly on the page. This problem intensifies as an author uses less commonly supported languages.
5.) A final "step" (again, not so much a step as a consideration) is printing issues. Printers rely on "drivers" to read the document encoding. If you print a document that contains multilingual characters to printers using printer drivers that do not accept Unicode characters, the extended characters are printed as square boxes, even though they are displayed correctly on the screen.