[ Index ]

PHP Cross Reference of Unnamed Project




/se3-unattended/var/se3/unattended/install/linuxaux/opt/perl/lib/5.10.0/Locale/Maketext/ -> TPJ13.pod (source)

   1  # This document contains text in Perl "POD" format.
   2  # Use a POD viewer like perldoc or perlman to render it.
   4  =head1 NAME
   6  Locale::Maketext::TPJ13 -- article about software localization
   8  =head1 SYNOPSIS
  10    # This an article, not a module.
  12  =head1 DESCRIPTION
  14  The following article by Sean M. Burke and Jordan Lachler
  15  first appeared in I<The Perl Journal> #13
  16  and is copyright 1999 The Perl Journal. It appears
  17  courtesy of Jon Orwant and The Perl Journal.  This document may be
  18  distributed under the same terms as Perl itself.
  20  =head1 Localization and Perl: gettext breaks, Maketext fixes
  22  by Sean M. Burke and Jordan Lachler
  24  This article points out cases where gettext (a common system for
  25  localizing software interfaces -- i.e., making them work in the user's
  26  language of choice) fails because of basic differences between human
  27  languages.  This article then describes Maketext, a new system capable
  28  of correctly treating these differences.
  30  =head2 A Localization Horror Story: It Could Happen To You
  32  =over
  34  "There are a number of languages spoken by human beings in this
  35  world."
  37  -- Harald Tveit Alvestrand, in RFC 1766, "Tags for the
  38  Identification of Languages"
  40  =back
  42  Imagine that your task for the day is to localize a piece of software
  43  -- and luckily for you, the only output the program emits is two
  44  messages, like this:
  46    I scanned 12 directories.
  48    Your query matched 10 files in 4 directories.
  50  So how hard could that be?  You look at the code that
  51  produces the first item, and it reads:
  53    printf("I scanned %g directories.",
  54           $directory_count);
  56  You think about that, and realize that it doesn't even work right for
  57  English, as it can produce this output:
  59    I scanned 1 directories.
  61  So you rewrite it to read:
  63    printf("I scanned %g %s.",
  64           $directory_count,
  65           $directory_count == 1 ?
  66             "directory" : "directories",
  67    );
  69  ...which does the Right Thing.  (In case you don't recall, "%g" is for
  70  locale-specific number interpolation, and "%s" is for string
  71  interpolation.)
  73  But you still have to localize it for all the languages you're
  74  producing this software for, so you pull Locale::gettext off of CPAN
  75  so you can access the C<gettext> C functions you've heard are standard
  76  for localization tasks.
  78  And you write:
  80    printf(gettext("I scanned %g %s."),
  81           $dir_scan_count,
  82           $dir_scan_count == 1 ?
  83             gettext("directory") : gettext("directories"),
  84    );
  86  But you then read in the gettext manual (Drepper, Miller, and Pinard 1995)
  87  that this is not a good idea, since how a single word like "directory"
  88  or "directories" is translated may depend on context -- and this is
  89  true, since in a case language like German or Russian, you'd may need
  90  these words with a different case ending in the first instance (where the
  91  word is the object of a verb) than in the second instance, which you haven't even
  92  gotten to yet (where the word is the object of a preposition, "in %g
  93  directories") -- assuming these keep the same syntax when translated
  94  into those languages.
  96  So, on the advice of the gettext manual, you rewrite:
  98    printf( $dir_scan_count == 1 ?
  99             gettext("I scanned %g directory.") :
 100             gettext("I scanned %g directories."),
 101           $dir_scan_count );
 103  So, you email your various translators (the boss decides that the
 104  languages du jour are Chinese, Arabic, Russian, and Italian, so you
 105  have one translator for each), asking for translations for "I scanned
 106  %g directory." and "I scanned %g directories.".  When they reply,
 107  you'll put that in the lexicons for gettext to use when it localizes
 108  your software, so that when the user is running under the "zh"
 109  (Chinese) locale, gettext("I scanned %g directory.") will return the
 110  appropriate Chinese text, with a "%g" in there where printf can then
 111  interpolate $dir_scan.
 113  Your Chinese translator emails right back -- he says both of these
 114  phrases translate to the same thing in Chinese, because, in linguistic
 115  jargon, Chinese "doesn't have number as a grammatical category" --
 116  whereas English does.  That is, English has grammatical rules that
 117  refer to "number", i.e., whether something is grammatically singular
 118  or plural; and one of these rules is the one that forces nouns to take
 119  a plural suffix (generally "s") when in a plural context, as they are when
 120  they follow a number other than "one" (including, oddly enough, "zero").
 121  Chinese has no such rules, and so has just the one phrase where English
 122  has two.  But, no problem, you can have this one Chinese phrase appear
 123  as the translation for the two English phrases in the "zh" gettext
 124  lexicon for your program.
 126  Emboldened by this, you dive into the second phrase that your software
 127  needs to output: "Your query matched 10 files in 4 directories.".  You notice
 128  that if you want to treat phrases as indivisible, as the gettext
 129  manual wisely advises, you need four cases now, instead of two, to
 130  cover the permutations of singular and plural on the two items,
 131  $dir_count and $file_count.  So you try this:
 133    printf( $file_count == 1 ?
 134      ( $directory_count == 1 ?
 135       gettext("Your query matched %g file in %g directory.") :
 136       gettext("Your query matched %g file in %g directories.") ) :
 137      ( $directory_count == 1 ?
 138       gettext("Your query matched %g files in %g directory.") :
 139       gettext("Your query matched %g files in %g directories.") ),
 140     $file_count, $directory_count,
 141    );
 143  (The case of "1 file in 2 [or more] directories" could, I suppose,
 144  occur in the case of symlinking or something of the sort.)
 146  It occurs to you that this is not the prettiest code you've ever
 147  written, but this seems the way to go.  You mail off to the
 148  translators asking for translations for these four cases.  The
 149  Chinese guy replies with the one phrase that these all translate to in
 150  Chinese, and that phrase has two "%g"s in it, as it should -- but
 151  there's a problem.  He translates it word-for-word back: "In %g
 152  directories contains %g files match your query."  The %g
 153  slots are in an order reverse to what they are in English.  You wonder
 154  how you'll get gettext to handle that.
 156  But you put it aside for the moment, and optimistically hope that the
 157  other translators won't have this problem, and that their languages
 158  will be better behaved -- i.e., that they will be just like English.
 160  But the Arabic translator is the next to write back.  First off, your
 161  code for "I scanned %g directory." or "I scanned %g directories."
 162  assumes there's only singular or plural.  But, to use linguistic
 163  jargon again, Arabic has grammatical number, like English (but unlike
 164  Chinese), but it's a three-term category: singular, dual, and plural.
 165  In other words, the way you say "directory" depends on whether there's
 166  one directory, or I<two> of them, or I<more than two> of them.  Your
 167  test of C<($directory == 1)> no longer does the job.  And it means
 168  that where English's grammatical category of number necessitates
 169  only the two permutations of the first sentence based on "directory
 170  [singular]" and "directories [plural]", Arabic has three -- and,
 171  worse, in the second sentence ("Your query matched %g file in %g
 172  directory."), where English has four, Arabic has nine.  You sense
 173  an unwelcome, exponential trend taking shape.
 175  Your Italian translator emails you back and says that "I searched 0
 176  directories" (a possible English output of your program) is stilted,
 177  and if you think that's fine English, that's your problem, but that
 178  I<just will not do> in the language of Dante.  He insists that where
 179  $directory_count is 0, your program should produce the Italian text
 180  for "I I<didn't> scan I<any> directories.".  And ditto for "I didn't
 181  match any files in any directories", although he says the last part
 182  about "in any directories" should probably just be left off.
 184  You wonder how you'll get gettext to handle this; to accomodate the
 185  ways Arabic, Chinese, and Italian deal with numbers in just these few
 186  very simple phrases, you need to write code that will ask gettext for
 187  different queries depending on whether the numerical values in
 188  question are 1, 2, more than 2, or in some cases 0, and you still haven't
 189  figured out the problem with the different word order in Chinese.
 191  Then your Russian translator calls on the phone, to I<personally> tell
 192  you the bad news about how really unpleasant your life is about to
 193  become:
 195  Russian, like German or Latin, is an inflectional language; that is, nouns
 196  and adjectives have to take endings that depend on their case
 197  (i.e., nominative, accusative, genitive, etc...) -- which is roughly a matter of
 198  what role they have in syntax of the sentence --
 199  as well as on the grammatical gender (i.e., masculine, feminine, neuter)
 200  and number (i.e., singular or plural) of the noun, as well as on the
 201  declension class of the noun.  But unlike with most other inflected languages,
 202  putting a number-phrase (like "ten" or "forty-three", or their Arabic
 203  numeral equivalents) in front of noun in Russian can change the case and
 204  number that noun is, and therefore the endings you have to put on it.
 206  He elaborates:  In "I scanned %g directories", you'd I<expect>
 207  "directories" to be in the accusative case (since it is the direct
 208  object in the sentence) and the plural number,
 209  except where $directory_count is 1, then you'd expect the singular, of
 210  course.  Just like Latin or German.  I<But!>  Where $directory_count %
 211  10 is 1 ("%" for modulo, remember), assuming $directory count is an
 212  integer, and except where $directory_count % 100 is 11, "directories"
 213  is forced to become grammatically singular, which means it gets the
 214  ending for the accusative singular...  You begin to visualize the code
 215  it'd take to test for the problem so far, I<and still work for Chinese
 216  and Arabic and Italian>, and how many gettext items that'd take, but
 217  he keeps going...  But where $directory_count % 10 is 2, 3, or 4
 218  (except where $directory_count % 100 is 12, 13, or 14), the word for
 219  "directories" is forced to be genitive singular -- which means another
 220  ending... The room begins to spin around you, slowly at first...  But
 221  with I<all other> integer values, since "directory" is an inanimate
 222  noun, when preceded by a number and in the nominative or accusative
 223  cases (as it is here, just your luck!), it does stay plural, but it is
 224  forced into the genitive case -- yet another ending...  And
 225  you never hear him get to the part about how you're going to run into
 226  similar (but maybe subtly different) problems with other Slavic
 227  languages like Polish, because the floor comes up to meet you, and you
 228  fade into unconsciousness.
 231  The above cautionary tale relates how an attempt at localization can
 232  lead from programmer consternation, to program obfuscation, to a need
 233  for sedation.  But careful evaluation shows that your choice of tools
 234  merely needed further consideration.
 236  =head2 The Linguistic View
 238  =over
 240  "It is more complicated than you think." 
 242  -- The Eighth Networking Truth, from RFC 1925
 244  =back
 246  The field of Linguistics has expended a great deal of effort over the
 247  past century trying to find grammatical patterns which hold across
 248  languages; it's been a constant process
 249  of people making generalizations that should apply to all languages,
 250  only to find out that, all too often, these generalizations fail --
 251  sometimes failing for just a few languages, sometimes whole classes of
 252  languages, and sometimes nearly every language in the world except
 253  English.  Broad statistical trends are evident in what the "average
 254  language" is like as far as what its rules can look like, must look
 255  like, and cannot look like.  But the "average language" is just as
 256  unreal a concept as the "average person" -- it runs up against the
 257  fact no language (or person) is, in fact, average.  The wisdom of past
 258  experience leads us to believe that any given language can do whatever
 259  it wants, in any order, with appeal to any kind of grammatical
 260  categories wants -- case, number, tense, real or metaphoric
 261  characteristics of the things that words refer to, arbitrary or
 262  predictable classifications of words based on what endings or prefixes
 263  they can take, degree or means of certainty about the truth of
 264  statements expressed, and so on, ad infinitum.
 266  Mercifully, most localization tasks are a matter of finding ways to
 267  translate whole phrases, generally sentences, where the context is
 268  relatively set, and where the only variation in content is I<usually>
 269  in a number being expressed -- as in the example sentences above.
 270  Translating specific, fully-formed sentences is, in practice, fairly
 271  foolproof -- which is good, because that's what's in the phrasebooks
 272  that so many tourists rely on.  Now, a given phrase (whether in a
 273  phrasebook or in a gettext lexicon) in one language I<might> have a
 274  greater or lesser applicability than that phrase's translation into
 275  another language -- for example, strictly speaking, in Arabic, the
 276  "your" in "Your query matched..." would take a different form
 277  depending on whether the user is male or female; so the Arabic
 278  translation "your[feminine] query" is applicable in fewer cases than
 279  the corresponding English phrase, which doesn't distinguish the user's
 280  gender.  (In practice, it's not feasable to have a program know the
 281  user's gender, so the masculine "you" in Arabic is usually used, by
 282  default.)
 284  But in general, such surprises are rare when entire sentences are
 285  being translated, especially when the functional context is restricted
 286  to that of a computer interacting with a user either to convey a fact
 287  or to prompt for a piece of information.  So, for purposes of
 288  localization, translation by phrase (generally by sentence) is both the
 289  simplest and the least problematic.
 291  =head2 Breaking gettext
 293  =over
 295  "It Has To Work."
 297  -- First Networking Truth, RFC 1925
 299  =back
 301  Consider that sentences in a tourist phrasebook are of two types: ones
 302  like "How do I get to the marketplace?" that don't have any blanks to
 303  fill in, and ones like "How much do these ___ cost?", where there's
 304  one or more blanks to fill in (and these are usually linked to a
 305  list of words that you can put in that blank: "fish", "potatoes",
 306  "tomatoes", etc.)  The ones with no blanks are no problem, but the
 307  fill-in-the-blank ones may not be really straightforward. If it's a
 308  Swahili phrasebook, for example, the authors probably didn't bother to
 309  tell you the complicated ways that the verb "cost" changes its
 310  inflectional prefix depending on the noun you're putting in the blank.
 311  The trader in the marketplace will still understand what you're saying if
 312  you say "how much do these potatoes cost?" with the wrong
 313  inflectional prefix on "cost".  After all, I<you> can't speak proper Swahili,
 314  I<you're> just a tourist.  But while tourists can be stupid, computers
 315  are supposed to be smart; the computer should be able to fill in the
 316  blank, and still have the results be grammatical.
 318  In other words, a phrasebook entry takes some values as parameters
 319  (the things that you fill in the blank or blanks), and provides a value
 320  based on these parameters, where the way you get that final value from
 321  the given values can, properly speaking, involve an arbitrarily
 322  complex series of operations.  (In the case of Chinese, it'd be not at
 323  all complex, at least in cases like the examples at the beginning of
 324  this article; whereas in the case of Russian it'd be a rather complex
 325  series of operations.  And in some languages, the
 326  complexity could be spread around differently: while the act of
 327  putting a number-expression in front of a noun phrase might not be
 328  complex by itself, it may change how you have to, for example, inflect
 329  a verb elsewhere in the sentence.  This is what in syntax is called
 330  "long-distance dependencies".)
 332  This talk of parameters and arbitrary complexity is just another way
 333  to say that an entry in a phrasebook is what in a programming language
 334  would be called a "function".  Just so you don't miss it, this is the
 335  crux of this article: I<A phrase is a function; a phrasebook is a
 336  bunch of functions.>
 338  The reason that using gettext runs into walls (as in the above
 339  second-person horror story) is that you're trying to use a string (or
 340  worse, a choice among a bunch of strings) to do what you really need a
 341  function for -- which is futile.  Preforming (s)printf interpolation
 342  on the strings which you get back from gettext does allow you to do I<some>
 343  common things passably well... sometimes... sort of; but, to paraphrase
 344  what some people say about C<csh> script programming, "it fools you
 345  into thinking you can use it for real things, but you can't, and you
 346  don't discover this until you've already spent too much time trying,
 347  and by then it's too late."
 349  =head2 Replacing gettext
 351  So, what needs to replace gettext is a system that supports lexicons
 352  of functions instead of lexicons of strings.  An entry in a lexicon
 353  from such a system should I<not> look like this:
 355    "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires"
 357  [\xE9 is e-acute in Latin-1.  Some pod renderers would
 358  scream if I used the actual character here. -- SB]
 360  but instead like this, bearing in mind that this is just a first stab:
 362    sub I_found_X1_files_in_X2_directories {
 363      my( $files, $dirs ) = @_[0,1];
 364      $files = sprintf("%g %s", $files,
 365        $files == 1 ? 'fichier' : 'fichiers');
 366      $dirs = sprintf("%g %s", $dirs,
 367        $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
 368      return "J'ai trouv\xE9 $files dans $dirs.";
 369    }
 371  Now, there's no particularly obvious way to store anything but strings
 372  in a gettext lexicon; so it looks like we just have to start over and
 373  make something better, from scratch.  I call my shot at a
 374  gettext-replacement system "Maketext", or, in CPAN terms,
 375  Locale::Maketext.
 377  When designing Maketext, I chose to plan its main features in terms of
 378  "buzzword compliance".  And here are the buzzwords:
 380  =head2 Buzzwords: Abstraction and Encapsulation
 382  The complexity of the language you're trying to output a phrase in is
 383  entirely abstracted inside (and encapsulated within) the Maketext module
 384  for that interface.  When you call:
 386    print $lang->maketext("You have [quant,_1,piece] of new mail.",
 387                         scalar(@messages));
 389  you don't know (and in fact can't easily find out) whether this will
 390  involve lots of figuring, as in Russian (if $lang is a handle to the
 391  Russian module), or relatively little, as in Chinese.  That kind of
 392  abstraction and encapsulation may encourage other pleasant buzzwords
 393  like modularization and stratification, depending on what design
 394  decisions you make.
 396  =head2 Buzzword: Isomorphism
 398  "Isomorphism" means "having the same structure or form"; in discussions
 399  of program design, the word takes on the special, specific meaning that
 400  your implementation of a solution to a problem I<has the same
 401  structure> as, say, an informal verbal description of the solution, or
 402  maybe of the problem itself.  Isomorphism is, all things considered,
 403  a good thing -- it's what problem-solving (and solution-implementing)
 404  should look like.
 406  What's wrong the with gettext-using code like this...
 408    printf( $file_count == 1 ?
 409      ( $directory_count == 1 ?
 410       "Your query matched %g file in %g directory." :
 411       "Your query matched %g file in %g directories." ) :
 412      ( $directory_count == 1 ?
 413       "Your query matched %g files in %g directory." :
 414       "Your query matched %g files in %g directories." ),
 415     $file_count, $directory_count,
 416    );
 418  is first off that it's not well abstracted -- these ways of testing
 419  for grammatical number (as in the expressions like C<foo == 1 ?
 420  singular_form : plural_form>) should be abstracted to each language
 421  module, since how you get grammatical number is language-specific.
 423  But second off, it's not isomorphic -- the "solution" (i.e., the
 424  phrasebook entries) for Chinese maps from these four English phrases to
 425  the one Chinese phrase that fits for all of them.  In other words, the
 426  informal solution would be "The way to say what you want in Chinese is
 427  with the one phrase 'For your question, in Y directories you would
 428  find X files'" -- and so the implemented solution should be,
 429  isomorphically, just a straightforward way to spit out that one
 430  phrase, with numerals properly interpolated.  It shouldn't have to map
 431  from the complexity of other languages to the simplicity of this one.
 433  =head2 Buzzword: Inheritance
 435  There's a great deal of reuse possible for sharing of phrases between
 436  modules for related dialects, or for sharing of auxiliary functions
 437  between related languages.  (By "auxiliary functions", I mean
 438  functions that don't produce phrase-text, but which, say, return an
 439  answer to "does this number require a plural noun after it?".  Such
 440  auxiliary functions would be used in the internal logic of functions
 441  that actually do produce phrase-text.)
 443  In the case of sharing phrases, consider that you have an interface
 444  already localized for American English (probably by having been
 445  written with that as the native locale, but that's incidental).
 446  Localizing it for UK English should, in practical terms, be just a
 447  matter of running it past a British person with the instructions to
 448  indicate what few phrases would benefit from a change in spelling or
 449  possibly minor rewording.  In that case, you should be able to put in
 450  the UK English localization module I<only> those phrases that are
 451  UK-specific, and for all the rest, I<inherit> from the American
 452  English module.  (And I expect this same situation would apply with
 453  Brazilian and Continental Portugese, possbily with some I<very>
 454  closely related languages like Czech and Slovak, and possibly with the
 455  slightly different "versions" of written Mandarin Chinese, as I hear exist in
 456  Taiwan and mainland China.)
 458  As to sharing of auxiliary functions, consider the problem of Russian
 459  numbers from the beginning of this article; obviously, you'd want to
 460  write only once the hairy code that, given a numeric value, would
 461  return some specification of which case and number a given quanitified
 462  noun should use.  But suppose that you discover, while localizing an
 463  interface for, say, Ukranian (a Slavic language related to Russian,
 464  spoken by several million people, many of whom would be relieved to
 465  find that your Web site's or software's interface is available in
 466  their language), that the rules in Ukranian are the same as in Russian
 467  for quantification, and probably for many other grammatical functions.
 468  While there may well be no phrases in common between Russian and
 469  Ukranian, you could still choose to have the Ukranian module inherit
 470  from the Russian module, just for the sake of inheriting all the
 471  various grammatical methods.  Or, probably better organizationally,
 472  you could move those functions to a module called C<_E_Slavic> or
 473  something, which Russian and Ukranian could inherit useful functions
 474  from, but which would (presumably) provide no lexicon.
 476  =head2 Buzzword: Concision
 478  Okay, concision isn't a buzzword.  But it should be, so I decree that
 479  as a new buzzword, "concision" means that simple common things should
 480  be expressible in very few lines (or maybe even just a few characters)
 481  of code -- call it a special case of "making simple things easy and
 482  hard things possible", and see also the role it played in the
 483  MIDI::Simple language, discussed elsewhere in this issue [TPJ#13].
 485  Consider our first stab at an entry in our "phrasebook of functions":
 487    sub I_found_X1_files_in_X2_directories {
 488      my( $files, $dirs ) = @_[0,1];
 489      $files = sprintf("%g %s", $files,
 490        $files == 1 ? 'fichier' : 'fichiers');
 491      $dirs = sprintf("%g %s", $dirs,
 492        $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
 493      return "J'ai trouv\xE9 $files dans $dirs.";
 494    }
 496  You may sense that a lexicon (to use a non-committal catch-all term for a
 497  collection of things you know how to say, regardless of whether they're
 498  phrases or words) consisting of functions I<expressed> as above would
 499  make for rather long-winded and repetitive code -- even if you wisely
 500  rewrote this to have quantification (as we call adding a number
 501  expression to a noun phrase) be a function called like:
 503    sub I_found_X1_files_in_X2_directories {
 504      my( $files, $dirs ) = @_[0,1];
 505      $files = quant($files, "fichier");
 506      $dirs =  quant($dirs,  "r\xE9pertoire");
 507      return "J'ai trouv\xE9 $files dans $dirs.";
 508    }
 510  And you may also sense that you do not want to bother your translators
 511  with having to write Perl code -- you'd much rather that they spend
 512  their I<very costly time> on just translation.  And this is to say
 513  nothing of the near impossibility of finding a commercial translator
 514  who would know even simple Perl.
 516  In a first-hack implementation of Maketext, each language-module's
 517  lexicon looked like this:
 519   %Lexicon = (
 520     "I found %g files in %g directories"
 521     => sub {
 522        my( $files, $dirs ) = @_[0,1];
 523        $files = quant($files, "fichier");
 524        $dirs =  quant($dirs,  "r\xE9pertoire");
 525        return "J'ai trouv\xE9 $files dans $dirs.";
 526      },
 527    ... and so on with other phrase => sub mappings ...
 528   );
 530  but I immediately went looking for some more concise way to basically
 531  denote the same phrase-function -- a way that would also serve to
 532  concisely denote I<most> phrase-functions in the lexicon for I<most>
 533  languages.  After much time and even some actual thought, I decided on
 534  this system:
 536  * Where a value in a %Lexicon hash is a contentful string instead of
 537  an anonymous sub (or, conceivably, a coderef), it would be interpreted
 538  as a sort of shorthand expression of what the sub does.  When accessed
 539  for the first time in a session, it is parsed, turned into Perl code,
 540  and then eval'd into an anonymous sub; then that sub replaces the
 541  original string in that lexicon.  (That way, the work of parsing and
 542  evaling the shorthand form for a given phrase is done no more than
 543  once per session.)
 545  * Calls to C<maketext> (as Maketext's main function is called) happen
 546  thru a "language session handle", notionally very much like an IO
 547  handle, in that you open one at the start of the session, and use it
 548  for "sending signals" to an object in order to have it return the text
 549  you want.
 551  So, this:
 553    $lang->maketext("You have [quant,_1,piece] of new mail.",
 554                   scalar(@messages));
 556  basically means this: look in the lexicon for $lang (which may inherit
 557  from any number of other lexicons), and find the function that we
 558  happen to associate with the string "You have [quant,_1,piece] of new
 559  mail" (which is, and should be, a functioning "shorthand" for this
 560  function in the native locale -- English in this case).  If you find
 561  such a function, call it with $lang as its first parameter (as if it
 562  were a method), and then a copy of scalar(@messages) as its second,
 563  and then return that value.  If that function was found, but was in
 564  string shorthand instead of being a fully specified function, parse it
 565  and make it into a function before calling it the first time.
 567  * The shorthand uses code in brackets to indicate method calls that
 568  should be performed.  A full explanation is not in order here, but a
 569  few examples will suffice:
 571    "You have [quant,_1,piece] of new mail."
 573  The above code is shorthand for, and will be interpreted as,
 574  this:
 576    sub {
 577      my $handle = $_[0];
 578      my(@params) = @_;
 579      return join '',
 580        "You have ",
 581        $handle->quant($params[1], 'piece'),
 582        "of new mail.";
 583    }
 585  where "quant" is the name of a method you're using to quantify the
 586  noun "piece" with the number $params[0].
 588  A string with no brackety calls, like this:
 590    "Your search expression was malformed."
 592  is somewhat of a degerate case, and just gets turned into:
 594    sub { return "Your search expression was malformed." }
 596  However, not everything you can write in Perl code can be written in
 597  the above shorthand system -- not by a long shot.  For example, consider
 598  the Italian translator from the beginning of this article, who wanted
 599  the Italian for "I didn't find any files" as a special case, instead
 600  of "I found 0 files".  That couldn't be specified (at least not easily
 601  or simply) in our shorthand system, and it would have to be written
 602  out in full, like this:
 604    sub {  # pretend the English strings are in Italian
 605      my($handle, $files, $dirs) = @_[0,1,2];
 606      return "I didn't find any files" unless $files;
 607      return join '',
 608        "I found ",
 609        $handle->quant($files, 'file'),
 610        " in ",
 611        $handle->quant($dirs,  'directory'),
 612        ".";
 613    }
 615  Next to a lexicon full of shorthand code, that sort of sticks out like a
 616  sore thumb -- but this I<is> a special case, after all; and at least
 617  it's possible, if not as concise as usual.
 619  As to how you'd implement the Russian example from the beginning of
 620  the article, well, There's More Than One Way To Do It, but it could be
 621  something like this (using English words for Russian, just so you know
 622  what's going on):
 624    "I [quant,_1,directory,accusative] scanned."
 626  This shifts the burden of complexity off to the quant method.  That
 627  method's parameters are: the numeric value it's going to use to
 628  quantify something; the Russian word it's going to quantify; and the
 629  parameter "accusative", which you're using to mean that this
 630  sentence's syntax wants a noun in the accusative case there, although
 631  that quantification method may have to overrule, for grammatical
 632  reasons you may recall from the beginning of this article.
 634  Now, the Russian quant method here is responsible not only for
 635  implementing the strange logic necessary for figuring out how Russian
 636  number-phrases impose case and number on their noun-phrases, but also
 637  for inflecting the Russian word for "directory".  How that inflection
 638  is to be carried out is no small issue, and among the solutions I've
 639  seen, some (like variations on a simple lookup in a hash where all
 640  possible forms are provided for all necessary words) are
 641  straightforward but I<can> become cumbersome when you need to inflect
 642  more than a few dozen words; and other solutions (like using
 643  algorithms to model the inflections, storing only root forms and
 644  irregularities) I<can> involve more overhead than is justifiable for
 645  all but the largest lexicons.
 647  Mercifully, this design decision becomes crucial only in the hairiest
 648  of inflected languages, of which Russian is by no means the I<worst> case
 649  scenario, but is worse than most.  Most languages have simpler
 650  inflection systems; for example, in English or Swahili, there are
 651  generally no more than two possible inflected forms for a given noun
 652  ("error/errors"; "kosa/makosa"), and the
 653  rules for producing these forms are fairly simple -- or at least,
 654  simple rules can be formulated that work for most words, and you can
 655  then treat the exceptions as just "irregular", at least relative to
 656  your ad hoc rules.  A simpler inflection system (simpler rules, fewer
 657  forms) means that design decisions are less crucial to maintaining
 658  sanity, whereas the same decisions could incur
 659  overhead-versus-scalability problems in languages like Russian.  It
 660  may I<also> be likely that code (possibly in Perl, as with
 661  Lingua::EN::Inflect, for English nouns) has already
 662  been written for the language in question, whether simple or complex.
 664  Moreover, a third possibility may even be simpler than anything
 665  discussed above: "Just require that all possible (or at least
 666  applicable) forms be provided in the call to the given language's quant
 667  method, as in:"
 669    "I found [quant,_1,file,files]."
 671  That way, quant just has to chose which form it needs, without having
 672  to look up or generate anything.  While possibly not optimal for
 673  Russian, this should work well for most other languages, where
 674  quantification is not as complicated an operation.
 676  =head2 The Devil in the Details
 678  There's plenty more to Maketext than described above -- for example,
 679  there's the details of how language tags ("en-US", "i-pwn", "fi",
 680  etc.) or locale IDs ("en_US") interact with actual module naming
 681  ("BogoQuery/Locale/en_us.pm"), and what magic can ensue; there's the
 682  details of how to record (and possibly negotiate) what character
 683  encoding Maketext will return text in (UTF8? Latin-1? KOI8?).  There's
 684  the interesting fact that Maketext is for localization, but nowhere
 685  actually has a "C<use locale;>" anywhere in it.  For the curious,
 686  there's the somewhat frightening details of how I actually
 687  implement something like data inheritance so that searches across
 688  modules' %Lexicon hashes can parallel how Perl implements method
 689  inheritance.
 691  And, most importantly, there's all the practical details of how to
 692  actually go about deriving from Maketext so you can use it for your
 693  interfaces, and the various tools and conventions for starting out and
 694  maintaining individual language modules.
 696  That is all covered in the documentation for Locale::Maketext and the
 697  modules that come with it, available in CPAN.  After having read this
 698  article, which covers the why's of Maketext, the documentation,
 699  which covers the how's of it, should be quite straightfoward.
 701  =head2 The Proof in the Pudding: Localizing Web Sites
 703  Maketext and gettext have a notable difference: gettext is in C,
 704  accessible thru C library calls, whereas Maketext is in Perl, and
 705  really can't work without a Perl interpreter (although I suppose
 706  something like it could be written for C).  Accidents of history (and
 707  not necessarily lucky ones) have made C++ the most common language for
 708  the implementation of applications like word processors, Web browsers,
 709  and even many in-house applications like custom query systems.  Current
 710  conditions make it somewhat unlikely that the next one of any of these
 711  kinds of applications will be written in Perl, albeit clearly more for
 712  reasons of custom and inertia than out of consideration of what is the
 713  right tool for the job.
 715  However, other accidents of history have made Perl a well-accepted
 716  language for design of server-side programs (generally in CGI form)
 717  for Web site interfaces.  Localization of static pages in Web sites is
 718  trivial, feasable either with simple language-negotiation features in
 719  servers like Apache, or with some kind of server-side inclusions of
 720  language-appropriate text into layout templates.  However, I think
 721  that the localization of Perl-based search systems (or other kinds of
 722  dynamic content) in Web sites, be they public or access-restricted,
 723  is where Maketext will see the greatest use.
 725  I presume that it would be only the exceptional Web site that gets
 726  localized for English I<and> Chinese I<and> Italian I<and> Arabic
 727  I<and> Russian, to recall the languages from the beginning of this
 728  article -- to say nothing of German, Spanish, French, Japanese,
 729  Finnish, and Hindi, to name a few languages that benefit from large
 730  numbers of programmers or Web viewers or both.
 732  However, the ever-increasing internationalization of the Web (whether
 733  measured in terms of amount of content, of numbers of content writers
 734  or programmers, or of size of content audiences) makes it increasingly
 735  likely that the interface to the average Web-based dynamic content
 736  service will be localized for two or maybe three languages.  It is my
 737  hope that Maketext will make that task as simple as possible, and will
 738  remove previous barriers to localization for languages dissimilar to
 739  English.
 741   __END__
 743  Sean M. Burke (sburkeE<64>cpan.org) has a Master's in linguistics
 744  from Northwestern University; he specializes in language technology.
 745  Jordan Lachler (lachlerE<64>unm.edu) is a PhD student in the Department of
 746  Linguistics at the University of New Mexico; he specializes in
 747  morphology and pedagogy of North American native languages.
 749  =head2 References
 751  Alvestrand, Harald Tveit.  1995.  I<RFC 1766: Tags for the
 752  Identification of Languages.>
 753  C<ftp://ftp.isi.edu/in-notes/rfc1766.txt>
 754  [Now see RFC 3066.]
 756  Callon, Ross, editor.  1996.  I<RFC 1925: The Twelve
 757  Networking Truths.>
 758  C<ftp://ftp.isi.edu/in-notes/rfc1925.txt>
 760  Drepper, Ulrich, Peter Miller,
 761  and FranE<ccedil>ois Pinard.  1995-2001.  GNU
 762  C<gettext>.  Available in C<ftp://prep.ai.mit.edu/pub/gnu/>, with
 763  extensive docs in the distribution tarball.  [Since
 764  I wrote this article in 1998, I now see that the
 765  gettext docs are now trying more to come to terms with
 766  plurality.  Whether useful conclusions have come from it
 767  is another question altogether. -- SMB, May 2001]
 769  Forbes, Nevill.  1964.  I<Russian Grammar.>  Third Edition, revised
 770  by J. C. Dumbreck.  Oxford University Press.
 772  =cut
 774  #End

Generated: Tue Mar 17 22:47:18 2015 Cross-referenced by PHPXref 0.7.1