BBS水木清华站∶精华区
发信人: cybergene (基因~也许以后~~), 信区: Linux
标 题: How to Use Tcl 8.1 Internationalization Features
发信站: BBS 水木清华站 (Thu Dec 14 15:54:36 2000)
How to Use Tcl 8.1 Internationalization Features
TclPro Extensions | Wrap TclPro | Compile Tcl | Stub Libraries | Threads
| Windows Extensions | Regular Expressions | I18N
Tcl's new internationalization facilities allow you to create Tcl
applications that support any multi-byte language, including Chinese and
Japanese. Tcl also now includes support for message catalogs, which
makes it easier to create localized versions of applications and
packages. Tcl is the first cross-platform scripting language to help
developers to deploy both commercial and enterprise network applications
on a global scale.
This document provides a quick overview of the internationalization
features introduced in Tcl 8.1. Topics include:
Character Encoding Overview
Character Encodings and the Operating System
General String Manipulation
Channel Input/Output
Sourcing Scripts in Different Encodings
Converting Strings to Different Encodings
Fonts, Encodings, and Tk Widgets
Message Catalogs
Internationalization and the Tcl C APIs
Summary: Tcl Internationalization Support at a Glance
Character Encoding Overview
A character encoding is simply a mapping of characters and symbols
used in written language into a binary format used by computers. For
example, in the standard ASCII encoding, the upper-case "A" character
from the Latin character set is represented by the byte value 0x41 in
hexadecimal. Other widely used character encodings include ISO 8859-1,
used by many European languages, Shift-JIS and EUC-JP for Japanese
characters, and Big5 for Chinese characters.
The Unicode Standard is a fixed-width, uniform encoding scheme for
virtually all characters used in the world's major written languages.
Unicode uses a 16-bit encoding for all text elements. These text
elements include letters such as "w" or "M", characters such as those
used in Japanese Hiragana to represent syllables, or ideographs such
as those used in Chinese to represent full words or concepts. The
Unicode Standard does not specify the visual representation of a
character, which is known as a glyph. For more information on the
Unicode Standard, visit the Unicode web site at http://www.unicode.org.
UTF-8 is a standard transformation format for Unicode characters. It
is a method of transforming all Unicode characters into a variable
length encoding of bytes; a single Unicode character can be
represented by one, two, or three bytes. The advantage of the UTF-8
standard is that it and the Unicode standard were designed so that
Unicode characters corresponding to the standard ASCII set (up to
ASCII value 0x7F in hexadecimal) have the same byte values in both UTF-8
and ASCII encoding. In other words, an upper-case "A" character is
represented by the single-byte value 0x41 in both UTF-8 and ASCII
encoding.
Beginning in Tcl 8.1, Tcl represents all strings internally as Unicode
characters in UTF-8 format. Tcl 8.1 also ships with built-in support for
approximately 30 common character encoding standards, and can convert
strings from one encoding to another. The encoding names command
displays a list of all known encodings. You can create additional
encodings as described in the Tcl_GetEncoding.3 reference page.
Tip: Because 7-bit ASCII characters have the same encoding in UTF-8
format, legacy Tcl scripts that use only 7-bit ASCII characters function
the same in Tcl 8.1 as they did in Tcl 8.0. Furthermore, because the
use of Unicode/UTF-8 encoding is internal to Tcl, most string handling
in legacy Tcl scripts works the same in Tcl 8.1 as it did in Tcl 8.0.
Most problems in converting from Tcl 8.0 to 8.1 occur in: 1) using
non-Latin characters, 2) reading and writing strings from a channel, and
3) writing code that assumes that each character in a string is a fixed
byte width (for example, one byte per character).
Character Encodings and the Operating System
The system encoding is the character encoding used by the operating
system for items such as file names and environment variables. Text
files used by text editors and other applications are usually encoded in
the system encoding as well, unless the application that produced
them explicitly saves them in another format (for example, if you use
a Shift-JIS text editor on an ISO 8859-1 system).
Tcl automatically converts strings from UTF-8 format to the system
encoding and vice versa whenever it communicates with the operating
system. For example, Tcl automatically handles any encoding conversion
needed if you execute commands such as:
% glob *
or
% set fd [open "Espa?ol.txt" w]
The Tcl source command also reads files using the system encoding, and
strings passed to and from the Tcl exec command are converted to and
from the system encoding.
Tcl attempts to determine the system encoding during initialization
based on the platform and locale settings. Tcl usually can determine a
reasonable default system encoding based on these settings, but if for
some reason it cannot, it uses ISO 8859-1 as the default system
encoding.
You can override the default system encoding with the encoding system
command. Ajuba Solutions recommends that you avoid using this command if
at all possible. If you set the default system encoding to anything
other than the actual encoding used by your operating system, Tcl will
likely find it impossible to communicate properly with your operating
system.
Note: For reading and writing files in an encoding other than the system
encoding, you need to use the fconfigure -encoding command (not the
encoding system command) as described in the "Channel Input/Output"
section of this document. Also see the "Sourcing Scripts in Different
Encodings" section of this document for special instructions for
sourcing files in formats other than the system encoding.
General String Manipulation
Beginning in Tcl 8.1, all Tcl string manipulation functions expect and
return Unicode strings encoded in UTF-8 format. Because the use of
Unicode/UTF-8 encoding is internal to Tcl, you should see no
difference in Tcl 8.0 and 8.1 string handling in your scripts.
The Tcl string functions properly handle multi-byte UTF-8 characters
as single characters. For example in the following commands, Tcl
treats the string "Café" as a four-character string, even though the
internal representation in UTF-8 format requires five bytes. (As with
previous versions of Tcl, string indexes start with "0"; that is, the
first character is index "0", the second character is index "1", etc.)
% set unistr "Café"
Café
% string length $unistr
4
% string index $unistr 3
é
Furthermore, the new regular expression implementation introduced in Tcl
8.1 handles the full range of Unicode characters.
The "\uxxxx" escape sequence allows you to specify a Unicode character
by its four-digit, hexadecimal Unicode code value. For example, the
following assigns to a variable two ideograph characters corresponding
to the Chinese transliteration of "Tcl" (TAI-KU):
set tclstr "\u592a\u9177"
Channel Input/Output
When reading and writing data on a channel, you need to ensure that
Tcl uses the proper character encoding for that channel. The default
encoding for newly opened channels (both files and sockets) is the
same as the platform- and locale-dependent system encoding used for
interfacing with the operating system. (See the "Character Encodings and
the Operating System" section of this document for more information.)
In most cases, you don't need to do anything special to read or write
data because most text files are created in the system encoding. You
need to take special steps only when accessing files in an encoding
other than the system encoding (for example, reading a file encoded in
Shift-JIS format when your system encoding is ISO 8859-1).
The fconfigure -encoding option allows you to specify the encoding for a
channel. Thus, to read from a file encoded in Shift-JIS format, you
should execute the following commands:
set fd [open $file r]
fconfigure $fd -encoding shiftjis
Tcl then automatically converts any text you read from the file into
standard UTF-8 format.
Similarly, if you are writing to a channel, you can use fconfigure
-encoding to specify the target character encoding and Tcl automatically
converts strings from UTF-8 to that encoding on output.
Note: The Tcl source command always reads files using the system
encoding. For a tip on sourcing files in different encodings, see the
"Sourcing Scripts in Different Encodings" section of this document.
Sourcing Scripts in Different Encodings
The Tcl source command always reads files using the system encoding.
Therefore, Ajuba Solutions recommends that whenever possible, you author
scripts in the native system encoding.
A difficulty arises when distributing scripts internationally, as you
don't necessarily know what the system encoding will be. Fortunately,
most common character encodings include the standard 7-bit ASCII
characters as a subset. Therefore, you are usually safe if your script
contains only 7-bit ASCII characters.
If you need to use an extended character set for your scripts that you
distribute, you can provide a small "bootstrap" script written in
7-bit ASCII. The bootstrap script can then load and execute scripts in
any encoding that you choose.
You can execute a script written in an encoding other than the system
encoding by opening the file, setting the proper encoding using the
fconfigure -encoding command, reading the file into a variable, and then
evaluating the string with the eval command. For example, the following
reads and executes a Tcl script encoded in EUC-JP:
set fd [open "app.tcl" r]
fconfigure $fd 聳encoding euc-jp
set jpscript [read $fd]
close $fd
eval $jpscript
Note: This technique works only if the file contains actual EUC-JP
encoded characters (for example, you created the file with a EUC-JP text
editor). This technique doesn't work if you build the EUC-JP encoded
characters using the "\x" or octal digit escape sequences. Tcl 8.1
interprets each "\x" or octal digit escape sequence as a single
Unicode character with the upper bits set to 0. For example, if the
script app.tcl above contained the line:
set ha "\xA4\xCF"
then the variable ha would contain two characters, "陇?" (Unicode
characters "CURRENCY SIGN" and "LATIN CAPITAL LETTER I WITH DIAERESIS"),
not the Unicode HA character.
Converting Strings to Different Encodings
You can convert a string to a different encoding using the encoding
convertfrom and encoding convertto commands. The encoding convertfrom
command converts a string from a specified encoding into UTF-8 Unicode
characters; the encoding convertto command converts a string from
UTF-8 Unicode into a specified encoding. In either case, if you omit the
encoding argument, the command uses the current system encoding.
As an example, the following command converts a string representing
the Hiragana letter HA from EUC-JP encoding into a Unicode string:
set ha [encoding convertfrom euc-jp "\xA4\xCF"]
(In Tcl 8.1, the "\x" and octal digit escape sequences specify the lower
8 bits of a Unicode character with the upper 8 bits set to 0. The
thus the string "\xA4\xCF" still specifies two characters in Tcl 8.1,
just as it did in Tcl 8.0; however Tcl 8.1 stores those characters in
four bytes, whereas Tcl 8.0 stored them in two bytes.)
Fonts, Encodings, and Tk Widgets
Tk widgets that display text now require text strings in Unicode/UTF-8
encoding. Tk automatically handles any encoding conversion necessary
to display the characters in a particular font.
If the master font that you set for a widget doesn't contain a glyph for
a particular Unicode character that you want to display, Tk attempts to
locate a font that does. Where possible, Tk attempts to locate a font
that matches as many characteristics of the widget's master font as
possible (for example, weight, slant, etc.). Once Tk finds a suitable
font, it displays the character in that font. In other words, the widget
uses the master font for all characters it is capable of displaying,
and alternative fonts only as needed.
In some cases, Tk is unable to identify a suitable font, in which case
the widget cannot display the characters. (Instead, the widget
displays a system-dependent fallback character such as "?") The
process of identifying suitable fonts is complex, and Tk's algorithms
don't always find a font even if one is actually installed on the
system. Therefore, for best results, you should try to select as a
widget's master font one that is capable of handling the characters
you expect to display. For example, "Times" is likely to be a poor
choice if you know that you need to display Japanese or Arabic
characters in a widget.
If you work with text in a variety of character sets, you may need to
search out fonts to represent them. Markus Kuhn has developed a free
6x13 font that supports essentially all the Unicode characters that
can be displayed in a 6x13 glyph. This does not include Japanese,
Chinese, and other Asian languages, but it does cover many others. The
font is available at http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html.
His site also contains many useful links to other sources of fonts and
font information.
Message Catalogs
The new msgcat package provides a set of functions for managing
multilingual user interfaces. It allows you to define strings in a
message catalog, which is independent from your application or package
and which you can edit or localize without modifying the application
source code. The msgcat package is optional, but Ajuba Solutions
recommends using it for all multilingual applications and packages.
The basic principle of the msgcat package is that you create a set of
message files, one for each supported language, containing localized
versions of all the strings your application or package can display.
Then in your application or package, instead of using a string directly,
you call the ::msgcat::mc command to return a localized version of
the string you want.
This document provides only a brief introduction to message catalogs.
The msgcat package provides additional features such as namespace
support and "best match" handling of sublocales. See the msgcat.n
reference page for more information.
Using Message Catalogs
Using message catalogs from within your application or package
requires the following steps:
Optionally set the locale using the ::msgcat::mclocale command. If you
don't call mclocale, the locale defaults to the value of the env(LANG)
environment variable at the time the msgcat package is loaded. If
env(LANG) isn't defined, then the locale defaults to "C".
Call ::msgcat::mcload to load the appropriate message files. The
mcload command requires as an argument a directory containing your
message files.
Anywhere in your script that you would typically specify a string to
display, use the ::msgcat::mc command instead. The mc command takes as
an argument a source string and returns the translation of that string
in the current locale.
The following code fragment demonstrates how you could use the msgcat
package in a script:
# Use the default locale as specified by env(LANG).
# You could explicitly set the locale with a command such as
# ::msgcat::mclocale "en_UK"
# Load the messages files. In this example, they are stored
# in a subdirectory named "msgs" which is in the same directory
# as this script.
∶:msgcat::mcload [file join [file dirname [info script]] msgs]
# Display a welcome message
puts [::msgcat::mc "Welcome to Tcl!"]
In this example, instead of directly displaying the message "Welcome
to Tcl!", the application calls mc to retrieve a localized version of
the string. The string returned by mc depends on the current locale. For
example, in the "es" locale mc could return the Spanish-language
greeting "隆Bienvenido a Tcl!"
If a message file doesn't exist for the current locale, mc executes
the procedure ::msgcat::mcunknown. The default behavior of mcunknown
is to return the original string ("Welcome to Tcl!" in this case), but
you can redefine it to perform any action you want.
Creating Localized Message Files
To use the msgcat package, you need to prepare a set of message files
for your package or application, all contained within the same
directory. The name of each message file is a locale specifier
followed by the extension ".msg" (for example, es.msg for a Spanish
message file or en_UK.msg for a UK English message file).
Each message file contains a series of calls to ::msgcat::mcset to set
the translation strings for that language. The format of the mcset
command is:
∶:msgcat::mcset locale src-string ?translation-string?
The mcset command defines a locale-specific translation for the given
src-string. If no translation-string argument is present, then the value
of src-string is also used as the locale-specific translation string.
So, if American English is the "source language" for your application,
an en_UK.msg file might contain commands such as:
∶:msgcat::mcset en_UK "Welcome to Tcl!"
∶:msgcat::mcset en_UK "Select a color:" "Select a colour:"
Note that no translation string is provided for the first line, so the
resulting "translation" for the en_UK locale is the same as the American
source string, "Welcome to Tcl!" If you omitted this entry in the
message file, then calling mc with the source string "Welcome to Tcl!"
in the en_UK locale would result in mcunknown being called. Although the
default behavior of mcunknown would produce the desired results
(returning "Welcome to Tcl!"), you could run into problems if you
override the behavior of mcunknown. Therefore, it is always safest to
include a mcset mapping for every source string in your application,
even if a particular locale doesn't require a "translation" for that
string.
An equivalent Spanish-language message file, es.msg, would contain:
∶:msgcat::mcset es "Welcome to Tcl!" "隆Bienvenido a Tcl!"
∶:msgcat::mcset es "Select a color:" "Elige un color:"
Internationalization and the Tcl C APIs
Tcl 8.1 introduces new C APIs to support all new internationalization
features. Tcl 8.1 also introduces new convenience functions for
manipulating Unicode/UTF-8 strings. By using the new APIs in your
applications, you can easily add full Unicode support to your
application. Coupled with Tk's powerful font and layout support, you can
quickly create fully internationalized applications.
When programming with the Tcl C APIs, you should be aware of the
following issues, in addition to the Tcl scripting language
internationalization features:
The Tcl C APIs now require all strings to be passed to functions as
Unicode characters in UTF-8 format. You must convert strings in native
system encodings to UTF-8 before passing them to Tcl C functions.
Similarly, you must convert Tcl UTF-8 strings to the native system
encoding before passing them to system functions. Tcl provides functions
for handling encodings and converting strings from one encoding to
another. See the GetEncoding.3 reference page for details.
Because 7-bit ASCII characters have the same encoding in UTF-8 format,
legacy code that uses only 7-bit ASCII characters functions the same
in Tcl 8.1 as it did in Tcl 8.0. Therefore, if you're certain that
your strings contain only 7-bit ASCII characters, no conversion is
required.
Because strings in Tcl are now stored as Unicode characters in UTF-8
format, the number of characters in a string is not necessarily equal to
the number of bytes in a string. In particular, you should no longer
use the standard C string functions such as strlen to count characters
in a string. Similarly, other standard C string functions such as
toupper don't work with Unicode characters. Tcl provides a set of
equivalent Unicode string functions, such as Tcl_NumUtfChars and
Tcl_UtfToUpper, as well as other convenience functions for
manipulating Unicode strings. See the Utf.3 and UtfToUpper.3 reference
pages for details.
Summary: Tcl Internationalization Support at a Glance
The following list is a quick summary of the issues you should be
aware of concerning the new internationalization support introduced in
Tcl 8.1:
Tcl encodes all strings internally as Unicode characters in UTF-8
format.
The introduction of Unicode/UTF-8 encoding requires no changes to legacy
Tcl scripts that use only 7-bit ASCII characters, because UTF-8
characters corresponding to the standard 7-bit ASCII set (up to ASCII
value 0x7F in hexadecimal) have the same byte values in both UTF-8 and
ASCII encoding. Furthermore, because the use of Unicode/UTF-8 encoding
is internal to Tcl, most string handling in legacy Tcl scripts works the
same in Tcl 8.1 as it did in Tcl 8.0.
You can specify a Unicode character by its four-digit, hexadecimal
Unicode code value with the "\uxxxx" escape sequence.
All Tcl string functions properly handle multi-byte UTF-8 characters
as single characters.
Tk widgets that display text accept text string arguments in standard
Unicode/UTF-8 encoding. Tk automatically handles any encoding conversion
necessary to display the characters in a particular font. If the master
font that you set for a widget doesn't contain a glyph (a visual
representation) for a particular Unicode character, Tk attempts to
locate a font that does. Where possible, Tk attempts to locate a font
that matches as many characteristics of the widget's master font as
possible (for example, weight, slant, etc.). In some cases, Tk is unable
to identify a suitable font, even if one is actually installed on the
system. Therefore, for best results, you should try to select as a
widget's master font one that is capable of handling the characters
you expect to display.
The system encoding is the character encoding used by the operating
system. Tcl automatically handles conversions between UTF-8 and the
system encoding when interacting with the operating system.
Tcl usually can determine a reasonable default system encoding based
on the platform and locale settings, but if for some reason it cannot,
it uses ISO 8859-1 as the default system encoding. You can explicitly
set the system encoding used by Tcl with the encoding system command.
By default, Tcl uses the system encoding when reading from and writing
to channels, and converts the text to UTF-8 format. You can change the
character encoding for a channel using the fconfigure -encoding command.
The source command always reads files using the system encoding.
Therefore, Scriptics recommends that whenever possible, you author
scripts in the native system encoding. Furthermore, most common
character encodings include the standard 7-bit ASCII characters as a
subset, so you are usually safe writing scripts using only 7-bit ASCII
characters. You can execute a script written in a different encoding
by opening the file, setting the proper encoding using the fconfigure
-encoding command, reading the file into a variable, and then evaluating
the string with the eval command.
You can convert a string to a different encoding using the encoding
convertfrom and encoding convertto commands.
Tcl has built-in knowledge of approximately 30 common character
encodings. The encoding names command displays a list of all known
encodings. You can create additional encodings as described in the
Tcl_GetEncoding.3 reference page.
The new msgcat package provides a set of functions for managing
multilingual user interfaces. It allows you to define strings in a
message catalog, which is independent from your application and which
you can edit or localize without modifying the application source code.
See the msgcat.n reference page for more information.
You should also read the "Internationalization and the Tcl C APIs"
section of this document if you use the Tcl APIs in C programs.
--
桃花坞里桃花庵,桃花庵下桃花仙;桃花仙人种桃树,又摘桃花卖酒钱。
酒醒只在花前坐,酒醉换来花下眠;半醒半醉日复日,花落花开年复年。
但愿老死花酒间,不愿鞠躬车马前;车尘马足富者趣,酒盏花枝贫者缘。
若将富贵比贫贱,一在平地一在天;若将贫贱比车马,他得驱驰我得闲。
别人笑我忒疯癫,我笑他人看不穿;不见五陵豪杰墓,无花无酒锄做田。
※ 来源:·BBS 水木清华站 smth.org·[FROM: 202.204.7.234]
BBS水木清华站∶精华区