Top
Past Meeting Archive Los Angeles ACM home page National ACM home page Click here for More Activities this month
Check out the Southern California Tech Calendar

Meeting of the Los Angeles Chapter of ACM

Wednesday, March 3, 2004

"Internationalized Software"

Ray Toal
  Professor of Computer Science, LMU

More and more information is moving to the internet. Every company's presence is moving beyond its country of origin and becoming international. To truly appeal to an international crowd it becomes necessary to write applications in multiple languages, as well as conforming to specific customs. This means that data returned will also be in multiple languages.

This talk will present a variety of topics in the area of internationalization and localization, focusing on the technical aspects of developing internationalized software. The presentation will begin with an overview of character sets (such as Unicode and ISO8859-x) and character encodings (such as UTF-8) and the (often misunderstood) differences between them. We'll then show the usual problems with non-internationalized code, and show concrete examples of properly internationalized code in XML, Java and Perl. Some examples, of course, will be from web applications; we'll look at these issues in HTTP, J2EE and the Struts framework.

For those of you that wish to do a little reading on this subject try:
www.linuxforum.net/chinese/doc/i18n/ DOCU_004.HTM

Ray Toal is Professor of Computer Science at Loyola Marymount University where he has taught since 1986. He received his doctorate in Computer Science from UCLA in 1993. His current research interests include data representation schemes, higher order logic, and compiler construction. Ray has also worked as a developer at Citysearch since 1996 where he currently focuses on web-tier J2EE technologies as well as Graphical User Interfaces and thick client internal tools.

~Summary~

LA ACM Chapter March Meeting
Held Wednesday, March 3, 2004

The program was "Internationalized Software" presented by Ray Toal, Professor of Computer Science at Loyal Marymount University. This was a regular meeting of the Los Angeles Chapter of ACM.

Professor Toal started by saying this is a practical talk and you will see concrete examples of internationalized code (I18n). He will try to present examples of everything that needs localization and show how this is done. He will not talk about writing systems, marketing software in multiple countries, upgrading existing non-globalized software, existing products and language packs, etc.

He started out with an example of greeter Java code with "Good Morning" and "Good Evening" hard coded into print statements. This is wrong, a resource bundle that contains a string should be obtained and printed out so the statements can be Internationalized. The strings are changed for each specific region to provide Localization (L10n). Examples were provided for English, Spanish and a greeting in an English dialect for Australia (G'day). Russian text had to be entered in code because the Russian text can't be entered directly as a Java property. Globalization (G11n) is Internationalization + Localization.

A locale is a geographic or political region or community that shares the same language, customs, or cultural conventions. There are three parts to a locale, Language + Country + Variant. Professor Toal provided examples of locales that were available on his laptop. If the requested locale is missing the defaults and a systematic search order are used to find the proper bundle. You can also get the locale from the command line at run time which is a good idea. Resource files are used as inputs to keep text out of the source code. If text was in the code human translators who are usually not programmers would have to mess with the source code. Expert programmers are used to using input files in this manner anyway and they appreciate keeping long text strings out of the source code. You can put more than text into a resource file (images, sounds, etc. are good candidates) but then you have to write code. Languages are defined in ISO 639 and Countries are defined in ISO 3166.

There are many linguistic and cultural issues to deal with: What characters are letters, numbers, symbols?, Ret there word breaks? Line breaks? How to use punctuation? In what direction is text written? How, exactly do you sort? (Uppercase/lowercase? Diatrics? Letter Combinations?) Which calendar is being used? What is the first day of the week? Are there months? How many? What's the deal with time zones? Is Daylight Savings Time in use? How do we write dates? Currencies? Numbers? Percentages? How are colors (culturally) interpreted? (E.g. white represents mourning or death in Eastern cultures, but western cultures use black. Red is purity in India, but danger in the U.S.) An image or icon in acceptable to one culture might be offensive to another. How does one "input" data from character sets with tens of thousands of characters? A big keyboard? In message formatting use whole messages, not pieces of messages, to deal with differences in word order. In number formatting there are different radix separators, thousands separators, position of negative sign (if indeed there is a symbol for it), position of currency symbols, etc. In date formatting there are different radix separators, thousands separators, position of negative sign, position of currency symbols, percentage symbols, etc. In searching and sorting of strings different languages have different rules. CH and LL are single letters in traditional Spanish. Different languages put the same marked characters in different places.

Then there are Character Sets. A Character is an abstract symbol like Plus Sign or Latin Capital Letter A or Musical Flat Sign. A coded character set or Codeset equals Repertoire plus mapping from positive integers to characters. The Codepoint of a character in a codeset is thenumber associated with it. A Glyph is a picture of a character. The Angstrom Sign and Latin Capital Letter A With Ring Above are different characters but have the same Glyph. Examples of character sets are Unicode, UCS, ASCII and ISO8859-x (x in 0..15). There are a number of Character Encoding Schemes. Data is stored and transmitted in bytes (bits, octets, whatever)

Everyone knows that numbers are encoded in bytes (one's complement, two's complement, IEEE-754 single, IEEE-754 double, etc.) How are characters encoded? Lots of ways: Direct encoding (for small codesets), UTF-8, UTF-16, UTF-32 and many others. UTF-8 has many advantages as ASCII text is unchanged in it. Non-ASCII characters are never coded with ASCII characters and there are many other advantages of UTF-8.

Professor Toal provided a number of examples including Perl and XML. XML uses Unicode natively and an XML document is nothing but Unicode characters. You can specify a character encoding in the XML declaration. He summarized with Internationalization (I18n) is important, it often makes your code clearer and is pretty easy. Lots of things should be localized.

You can find the charts for this talk at:
http://www.technocage.com/~ray/talks/i18n.html

These charts include the examples not provided in this DATA-LINK report. And, of course, without attending our meeting you don't get the benefit of Professor Toal's excellent presentation or the opportunity to ask questions and get answers.

This was the seventh meeting of the LA Chapter year and was attended by 17 persons.
Mike Walsh, LA ACM Secretary
 

Join us on Wednesday, April 7th, for our next meeting featuring Chuck Hains of Xerox speaking on "Digital Halftoning."
Mark your calendar!


The Los Angeles Chapter normally meets the first Wednesday of each month at the Ramada Hotel, 6333 Bristol Parkway, Culver City. The program begins at 8 PM.   From the San Diego Freeway (405) take the Sepulveda/Centinela exit southbound or the Slauson/Sepulveda exit northbound.

6:30 p.m.  Social Time

7:00 p.m. Dinner

8:00 p.m.  Presentation

 

Reservations

To make a reservation, call or e-mail John Halbur, (310) 375-7037, and indicate your choice of entree, by Sunday before the dinner meeting.

There is no charge or reservation required to attend the presentation at 8:00 p.m.. Parking is FREE!

For membership information, contact Mike Walsh, (818)785-5056 or follow this link.


Other Affiliated groups

SIGAda   SIGCHI SIGGRAPH  SIGPLAN

****************
LA SIGAda

Return to "More"

****************

LA  SIGGRAPH

Please visit our website for meeting dates, and news of upcoming events.

For further details contact the SIGPHONE at (310) 288-1148 or at Los_Angeles_Chapter@siggraph.org, or www.siggraph.org/chapters/los_angeles

Return to "More"

****************

Past Meeting Archive Los Angeles ACM home page National ACM home page Top

 Last revision: 2004 0403 - [ Webmaster ]