|
Using XML For Localization
By Yves
Savourel, Localization Solutions Architect, ENLASO
Corporation
XML is one of
the safest, most powerful and flexible ways to store, manipulate,
localize and present data in different languages. With the
vast array of internationalization features and companion
technologies, XML provides
many advantages in translation and localization projects.
Unicode
- While the encoding of a saved document can be anything,
the character set of a parsed XML document is always ISO
10646/Unicode and therefore can include characters in
most of the languages used today. Not only can the content
of the XML files consist of almost any Unicode characters,
but the names of elements and attributes can also be made
of non-ASCII characters. For instance, the document shown
below has Japanese and Russian content, and the element
and attribute names are in Russian.
Listing 1 - Non-ASCII XML document:
<?xml version="1.0"?>
<Собрание версия="2.0">
<Объект id="12">
<НомерОбъекта>12-3456-0050</НомерОбъекта> <ВНаличии>123</ВНаличии> <Описание xml:lang="ja">第二発電機</Описание> <Описание xml:lang="ru">Запасной генератор</Описание> </Объект> </Собрание>
Encoding Identification -
One of the main sources of problem during the process of
localizing almost any type of data is related to opening
the files using an incorrect encoding. XML solves this issue
by providing a clear mechanism that always ensures (if applied
properly) the identification of the encoding used to save
the document.
Escape Mechanism -
XML also offers a safe way to escape extended characters
that are not supported by a given encoding. This allows
the document to contain any characters, even when they are
not supported in the encoding the document is using. The
hexadecimal form (&#xHHH;) and the decimal form (&#DDD;)
use both the Unicode code-point value of the characters.
For example, the Greek capital letter Omega (Ω) can be escaped
as Ω or Ω if a given document uses an
encoding where the Greek alphabet is not supported.
Language Identification -
Since an XML document can store data in different languages,
it is important to be able to identify the language of the
content. The XML namespace offers an attribute for this:
xml:lang. The values of xml:lang are the same as the ones
for the LANG attributes in HTML. They are defined by the
RFC
3066 (an update to the RFC 1766), allowing 2 and 3-letter
language codes, with an optional 2-letter country code.
For instance: xml:lang="ar" indicates Arabic in
general, while xml:lang="ar-IQ" identifies Arabic
for Iraq. Note that, unlike other attribute values, the
values of xml:lang are not case sensitive. Note also that
the current values offered by RFC 3066 do not cover all
needs of localization. For instance, there is currently
no code for Latin-American Spanish; and there is no way
to make a distinction between different scripts such as
Azeri in Cyrillic script and Azeri in Latin script.
Transformation Mechanism
- In addition to the features of
XML, some XML-related technologies provide efficient capabilities
to facilitate the implementation of XML multilingual solutions.
For example, XSLT
is a powerful declarative language to manipulate XML data.
It uses XPath,
the standard mechanism to point to any node in an XML document.
Both XSLT and XPath offer internationalization features
such as language identification, sorting, numbers formatting,
itemization using various digit sets, and so forth.
Rendering -
Lastly, two other technologies related to XML, XSL
(also called XSL-FO for XSL Formatting Object) and CSS
(Cascading Style Sheets), confer to XML a powerful and well
internationalized framework to render content. Mixed with
the possibilities offered by XSLT, you can present any given
XML document in a wide variety of outputs.
Migrating
to XML
Obviously, not everyone has data
stored in an XML format today. But there are ways to use
XML even in these circumstances, and to pave the way for
a slow and often painless migration to XML, while still
harvesting the benefits of its use early on.
There are three ways to integrate
XML into your process:
Scenario 1: As a temporary source
The localizer uses a filter to convert
or extract/merge the original format to XML, back and forth.
In this first scenario, nothing changes for the developer
or the author of the documentation. The localization provider
can take advantage of XML to fit the proprietary format
into an existing localization process (including for example,
TM leveraging, terminology management, etc.).
Scenario 2: As an alternate source
The developer/author works in XML and
utilizes a one-way converter to create the file normally
used by the application. This is an attractive solution
if you need to keep most of your process untouched. By developing
one conversion tool to go from your new XML format to your
old proprietary format, you keep the benefits of an existing
process, and the advantages of providing XML data for the
localizer. Converters from XML to another format are usually
not costly to develop, in many cases a simple XSLT template
will do.
Scenario 3: As the original source
The original format is in XML.
This is obviously the best way to integrate XML within your
architecture. One type of documents that can be easily migrated
to this is HTML. The official hypertext file format recommended
by the W3C is now XHTML, not HTML. The conversion from HTML
to XHTML can be easily done, even automatically in many
cases.
The most important aspect is to
move the XML format into the developer/author domain. By
doing this, you allow the developers and the authors to
use metadata directly within the original source file, ultimately
giving them much more control on what happens to the data
during the localization.
 |
|
|
|
|
| ENLASO's
globalization newsletter,
"The Multilingual
Standard" provides
globalization professionals
with technological, cultural
and business perspectives
on successful communication
in a global economy. |
|
|
|
|
|
|
|
Taking
Advantage of XML
Once you have the source data in
an XML format, you can start taking advantage of it in many
different ways:
- Insertion of metadata related
to localization. For example: comments for the translators,
indication of some length limitation for content, etc.
- Addition of unique identifiers
for given sections of text, so you can make re-use of
existing translation based on ID matching rather than
text matching. The traditional translation memory mechanism
can be used later on for what is left to translate.
- Using technologies such as
XML
Linking, you can setup automatic referencing. For
instance: pulling error messages from a resource file
into a help document. This ensures a single translation
and an always-perfect correspondence between the message
in the UI and the one in the documentation.
In some cases, you can go one step further
and use existing XML vocabularies instead of inventing your
own. This will increase interoperability for your data.
In the passed years, several specialized XML applications
have been developed for particular purposes:
Using these formats when appropriate
can bring many additional benefits.
Translating
XML Documents
While the support among translation
tools for XML is improving, it is still not perfect and
you may have to work around many problems if the XML schemas
you are using are somehow complex. Here are some potential
issues:
- Very few tools can deal with
multilingual files. The xml:lang attribute is not recognized
as a special attribute and not taken in account or updated.
- Do not expect tools to give
you back translated files with CDATA sections as they
were in the original files. Some products don't even handle
CDATA correctly.
- Most tools, with the exception
of very few like GlobalSight System 4, work with an element/attribute
paradigm rather than a node paradigm. In other
words, they can't deal with element content that should
be translated or not depending on conditions such as a
specific value for a given attribute.
One way to work around this last problem
is to modify slightly the original file to have a better
input for the tools partially XML-impaired. For example,
the XML document below has only one element content to be
translated: "Cancel". However, very few tools
will be able to set the necessary conditions for that because
the case is too complex for them: translate only the
content of any <data> element if it has an attribute
type set to "text", and if it is inside a <component>
element that has an attribute type set to "caption".
Listing 2 - Example of UI description
in XML:
<?xml version="1.0"?>
<dialogue xml:lang="en-gb">
<rsrc id="123">
<component id="456" type="image">
<!-- Do not translate -->
<data type="text">images/cancel.gif</data>
<data type="coordinates">12,20,50,14</data>
</component>
<component id="789" type="caption">
<!-- Translate -->
<data type="text">Cancel</data>
<data type="coordinates">12,34,50,14</data>
</component>
</rsrc>
</dialogue>
One method to address this type of issue that will work
with any tool is to reduce the conditions into a new temporary
element. For instance, add an element <tbt> (to be
translated) where the content needs to be localized. The
tools can then use a simple condition: translate the
content of any <tbt> element. If the document
has to be validated during the localization process, you
may have to declare it as part of a different namespace,
otherwise you can use it directly and make sure to remove
it after localization.
Listing 3 - Example of UI description
in XML with extra element for translation:
<?xml version="1.0"?>
<dialogue xml:lang="en-gb">
<rsrc id="123">
<component id="456" type="image">
<!-- Do not translate -->
<data type="text">images/cancel.gif</data>
<data type="coordinates">12,20,50,14</data> </component> <component id="789" type="caption"> <!-- Translate --> <data type="text"><tbt>Cancel</tbt></data> <data type="coordinates">12,34,50,14</data> </component> </rsrc>
</dialogue>
A simple XSL transformation template
can be used to automate the changes as shown below.
Listing 4 - XSL Transformation for
adding special translation elements:
<?xml version="1.0" ?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output encoding="utf-8" />
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template
match="//component[@type='caption']/data[@type='text']">
<xsl:copy>
<xsl:apply-templates select="@*"/>
<tbt><xsl:apply-templates/></tbt>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
In addition to indicating what
element content and/or attribute values need translation,
you also need to provide some other information to the localizer:
- Are there any elements that
have pre-formatted content? (like <pre> in XHTML)
- Are there any elements that
have content other than normal text? (like <script>
in XHTML)
- Which elements should be treated
as inline codes? (like <b> in XHTML)
Just like for other file formats,
there are various strategies for translating XML documents.
The first option is to use an XML-specialized
translation tool that will allow you to edit the text directly
into the original format, as shown below with Trados TagEditor:

As many XML documents are simple
repositories of data that are used in a context different
from the way they are stored, seeing the original file may
not always be a big advantage.
The second option is to use tools
that extract the translatable text into a database and present
it to the localizer in a tabular form. For instance, Catalyst,
from Alchemy Software, uses this mechanism as shown below:

Such applications usually offer
a preview method that reconstructs the original format and
displays it, so the translator can see the data in context.
Obviously the same limitations as in the first option apply
here as well: sometimes XML files are simply not the best
place to see the text in the context it will be used.
Finally, the last option is to prepare
the XML document with a utility that adds a color-coded
RTF layer on top of the content. This way the file can be
opened and translated in a classic word-processor like Word.
This is illustrated below with Wordfast used as the translation
tool:

This method has
the advantage of being more flexible regarding the choice
of translation tool you want to use (or if you do not want
to use any translation tool). It may also be more accommodating
if the original format is not very well supported by translation
tools: you can always add manual formatting to delimit the
translatable parts from the sections to leave untouched.
Regardless of the method utilized,
at some point the localizer has to create some sort of definition
file where localization-related information for the format
to translate are stored: what parts of the content needs
translation and what parts need to be left untouched, the
name of the translatable attributes, and so forth.
For example, the following screen
shows how this is done with SDLX from SDL International:

Each tool has its own way of storing
this type of information, but overall, they all use more
or less the same information. At some point, the standardization
of both the information needed and a common format to specify
it would be helpful to everyone.
One of the attractive aspects of
XML is that by applying different style-sheets and/or transformation
templates to the document you can provide different views
of the data. For example: one showing the data as it will
be formatted in the final output, one displaying only items
that have changed, and another showing source and target
side by side for easier comparison, and so on.
Using XSLT allows you to go even
further. You can create templates that validate the document
and display the items with errors. If the XML format you
are using is one of the standard formats such as XLIFF,
you can develop a library of different templates offering
a wide range of functionalities that you can re-use across
different original formats. In the example below, a simple
template provides a comparison between the source and the
target text, flagging any target entries where an ending
exclamation point is missing. You can apply similar methods
to verify (according each language) other punctuation requirements,
preservation of leading or trailing spaces, missing or extra
variables, etc. This specific example uses a very useful
extension in the Microsoft XSL engine: the ability to call
scripts in the template.
Listing 5 - XSLT template
for verification:
<?xml version="1.0" encoding="iso-8859-1" ?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:rws="http://www.translate.com/xml-verification">
<msxsl:script language="JScript" implements-prefix="rws">
<![CDATA[
var g_nCount = 0;
function ErrorCount () {
return(g_nCount);
}
function CheckLastChar(p_Src, p_Trg, p_Char) {
var sSrc = new String(p_Src.nextNode().text);
if ( sSrc.length<1 ) return("");
var sTrg = new String(p_Trg.nextNode().text);
if ( sTrg.length<1 ) return("Empty Target");
var cTmp = sSrc.charAt(sSrc.length-1);
if ( cTmp == p_Char ) {
if ( cTmp != sTrg.charAt(sTrg.length-1) ) {
g_nCount++;
return("Missing character '" + p_Char +
"' at the end of the target text.");
}
}
return("");
}
]]>
</msxsl:script>
<xsl:template match="text()"/>
<xsl:template match="comment()"/>
<xsl:template match="//alt-trans"/>
<xsl:template match="/xliff">
<html>
<head>
<title>Verification</title>
</head>
<body>
<h1>Verification</h1>
<table border="1" cellspacing="0" cellpadding="3">
<xsl:apply-templates/>
</table>
<p>Number of errors =
<b><xsl:value-of select="rws:ErrorCount()"/></b></p>
</body>
</html>
</xsl:template>
<xsl:template match="//source">
<xsl:variable name="R1"
select="rws:CheckLastChar(.,../target,'!')"/>
<xsl:if test="$R1!=''">
<tr> <xsl:attribute name="style">background:silver</xsl:attribute> <td>Error <xsl:value-of select="rws:ErrorCount()"/></td> <td><xsl:value-of select="$R1"/></td> </tr> <tr> <td>Source:</td> <td>[<xsl:value-of select="."/>]</td> </tr> <tr> <td>Target:</td> <td>[<xsl:value-of select="../target"/>]</td> </tr> </xsl:if> </xsl:template>
</xsl:stylesheet>
This type of mechanism allows you to
extend and easily customize the capabilities of any translation
tool. Most of them offer a function to see a preview of
the translation, the user can observe directly the results
of the validation in his/her working environment, as show
below for TagEditor:

The Missing
Parts
There are still a few pieces that need
to be added to the overall XML framework.
A. Localization Properties
The localization properties are
the information translation tools need in order to correctly
process an XML document. This information needs to be defined
at the document type level. For example: one set of localization
properties for XHTML, another for SVG, another for ebXML,
and so forth. It would be even better if these properties
could be integrated into the XML Schema of each of the document
types. An illustration of such possible definition file
is shown below.
Listing 6 - Localization properties
for the document in Listing 2 on page 7:
<?xml version="1.0" ?>
<locprop version="0.1">
<rules name="Example1" root="dialogue">
<element-defaults localize="no"/>
<attribute-defaults localize="no"/>
<rule item="//component[@type='caption']/data[@type='text']"
localize="yes"/>
</rules>
</locprop>
B. Localization Directives
Localization directives are information
specified at the document instance level to indicate specific
behavior or information for a given part of the content.
For instance: a span of text not to translate in a paragraph
otherwise translatable, a specification of a maximum number
of characters to use for a given field, etc.
Localization directives are to
be used as metadata, using the standard XML namespace mechanism.
This allows you to embed, within the document of your own
document type, common directives that can be understood
by any authoring and translation application. You can imagine
that at some point an author could simply highlight a section
of the text and click a button to make it non-translatable,
or to identify it as a glossary term, etc.
For example, the XHTML document
on the next page includes a few of those possible types
of directives.
Listing 7 - Localization directives
in an XHTML document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en" lang="en"
xmlns:loc="urn:the-localization-directives-standard">
<head><title loc:id="100">Title</title></head>
<body>
<h1 id="101">Introduction to <loc:span term="yes">Document
Management</loc:span></h1>
<p id="102">Our company, <loc:span localize="no">Infinite
Wisdom Inc.</loc:span>, provides quality courses on how to
manage your documentation.</p>
</body>
</html>
Such vocabulary could also be re-used
by schema developers to include localization-specific attributes
in their own vocabularies, the same way xml:lang is used
today.
|