[ Index ]

PHP Cross Reference of Unnamed Project

title

Body

[close]

/se3-unattended/var/se3/unattended/install/linuxaux/opt/perl/lib/5.10.0/Encode/ -> Supported.pod (source)

   1  =head1 NAME
   2  
   3  Encode::Supported -- Encodings supported by Encode
   4  
   5  =head1 DESCRIPTION
   6  
   7  =head2 Encoding Names
   8  
   9  Encoding names are case insensitive. White space in names
  10  is ignored.  In addition, an encoding may have aliases.
  11  Each encoding has one "canonical" name.  The "canonical"
  12  name is chosen from the names of the encoding by picking
  13  the first in the following sequence (with a few exceptions).
  14  
  15  =over 2
  16  
  17  =item *
  18  
  19  The name used by the Perl community.  That includes 'utf8' and 'ascii'.
  20  Unlike aliases, canonical names directly reach the method so such
  21  frequently used words like 'utf8' don't need to do alias lookups.
  22  
  23  =item *
  24  
  25  The MIME name as defined in IETF RFCs.  This includes all "iso-"s.
  26  
  27  =item * 
  28  
  29  The name in the IANA registry.
  30  
  31  =item *
  32  
  33  The name used by the organization that defined it.
  34  
  35  =back
  36  
  37  In case I<de jure> canonical names differ from that of the Encode
  38  module, they are always aliased if it ever be implemented.  So you can
  39  safely tell if a given encoding is implemented or not just by passing 
  40  the canonical name.
  41  
  42  Because of all the alias issues, and because in the general case 
  43  encodings have state, "Encode" uses an encoding object internally 
  44  once an operation is in progress.
  45  
  46  =head1 Supported Encodings
  47  
  48  As of Perl 5.8.0, at least the following encodings are recognized.
  49  Note that unless otherwise specified, they are all case insensitive
  50  (via alias) and all occurrence of spaces are replaced with '-'.
  51  In other words, "ISO 8859 1" and "iso-8859-1" are identical.
  52  
  53  Encodings are categorized and implemented in several different modules
  54  but you don't have to C<use Encode::XX> to make them available for
  55  most cases.  Encode.pm will automatically load those modules on demand.
  56  
  57  =head2 Built-in Encodings
  58  
  59  The following encodings are always available.
  60  
  61    Canonical     Aliases                      Comments & References
  62    ----------------------------------------------------------------
  63    ascii         US-ascii ISO-646-US                         [ECMA]
  64    ascii-ctrl                              Special Encoding
  65    iso-8859-1    latin1                                       [ISO]
  66    null                                  Special Encoding
  67    utf8          UTF-8                                    [RFC2279]
  68    ----------------------------------------------------------------
  69  
  70  I<null> and I<ascii-ctrl> are special.  "null" fails for all character
  71  so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
  72  CHARACTERS will fall back to character references.  Ditto for
  73  "ascii-ctrl" except for control characters.  For fallback modes, see
  74  L<Encode>.
  75  
  76  =head2 Encode::Unicode -- other Unicode encodings
  77  
  78  Unicode coding schemes other than native utf8 are supported by
  79  Encode::Unicode, which will be autoloaded on demand.
  80  
  81    ----------------------------------------------------------------
  82    UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
  83    UCS-2LE                                                     [UC]
  84    UTF-16                                                      [UC]
  85    UTF-16BE                                                    [UC]
  86    UTF-16LE                                                    [UC]
  87    UTF-32                                                      [UC]
  88    UTF-32BE    UCS-4                                         [UC]
  89    UTF-32LE                                                    [UC]
  90    UTF-7                                                  [RFC2152]
  91    ----------------------------------------------------------------
  92  
  93  To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
  94  see L<Encode::Unicode>. 
  95  
  96  UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
  97  encoding.  It is implemented seperately by Encode::Unicode::UTF7.
  98  
  99  =head2 Encode::Byte -- Extended ASCII
 100  
 101  Encode::Byte implements most single-byte encodings except for
 102  Symbols and EBCDIC. The following encodings are based on single-byte
 103  encodings implemented as extended ASCII.  Most of them map
 104  \x80-\xff (upper half) to non-ASCII characters.
 105  
 106  =over 2
 107  
 108  =item ISO-8859 and corresponding vendor mappings
 109  
 110  Since there are so many, they are presented in table format with
 111  languages and corresponding encoding names by vendors.  Note that
 112  the table is sorted in order of ISO-8859 and the corresponding vendor
 113  mappings are slightly different from that of ISO.  See
 114  L<http://czyborra.com/charsets/iso8859.html> for details.
 115  
 116    Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
 117    ----------------------------------------------------------------
 118    N. America    (ASCII)         cp437        AdobeStandardEncoding
 119                                  cp863 (DOSCanadaF)
 120    W. Europe     iso-8859-1      cp850   cp1252  MacRoman  nextstep
 121                                                           hp-roman8
 122                                  cp860 (DOSPortuguese)
 123    Cntrl. Europe iso-8859-2      cp852   cp1250  MacCentralEurRoman
 124                                                  MacCroatian
 125                                                  MacRomanian
 126                                                  MacRumanian
 127    Latin3[1]     iso-8859-3      
 128    Latin4[2]     iso-8859-4              
 129    Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
 130      (See also next section)     cp866           MacUkrainian
 131    Arabic        iso-8859-6      cp864   cp1256  MacArabic
 132                                  cp1006          MacFarsi
 133    Greek         iso-8859-7      cp737   cp1253  MacGreek
 134                                  cp869 (DOSGreek2)
 135    Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
 136    Turkish       iso-8859-9      cp857   cp1254  MacTurkish
 137    Nordics       iso-8859-10     cp865
 138                                  cp861           MacIcelandic
 139                                                  MacSami
 140    Thai          iso-8859-11[3]  cp874           MacThai
 141    (iso-8859-12 is nonexistent. Reserved for Indics?)
 142    Baltics       iso-8859-13     cp775           cp1257
 143    Celtics       iso-8859-14
 144    Latin9 [4]    iso-8859-15
 145    Latin10       iso-8859-16
 146    Vietnamese    viscii                  cp1258  MacVietnamese
 147    ----------------------------------------------------------------
 148  
 149    [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
 150    [2] Baltics.  Now on 8859-10, except for Latvian.
 151    [3] TIS 620 +  Non-Breaking Space (0xA0 / U+00A0)
 152    [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
 153        letters that are missing from 8859-1 were added.
 154  
 155  All cp* are also available as ibm-*, ms-*, and windows-* .  See also
 156  L<http://czyborra.com/charsets/codepages.html>.
 157  
 158  Macintosh encodings don't seem to be registered in such entities as
 159  IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
 160  1150.  See L<http://developer.apple.com/technotes/tn/tn1150.html> 
 161  for details.
 162  
 163  =item KOI8 - De Facto Standard for the Cyrillic world
 164  
 165  Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
 166  popular in the Net.   L<Encode> comes with the following KOI charsets.
 167  For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
 168  
 169    ----------------------------------------------------------------
 170    koi8-f                                        
 171    koi8-r cp878                                           [RFC1489]
 172    koi8-u                                                 [RFC2319]
 173    ----------------------------------------------------------------
 174  
 175  =back
 176  
 177  =head2 gsm0338 - Hentai Latin 1
 178  
 179  GSM0338 is for GSM handsets. Though it shares alphanumerals with
 180  ASCII, control character ranges and other parts are mapped very
 181  differently, mainly to store Greek characters.  There are also escape
 182  sequences (starting with 0x1B) to cover e.g. the Euro sign.  
 183  
 184  This was once handled by L<Encode::Bytes> but because of all those
 185  unusual specifications, Encode 2.20 has relocated the support to
 186  L<Encode::GSM0338>. See L<Encode::GSM0338> for details.
 187  
 188  =over 2
 189  
 190  =item gsm0338 support before 2.19
 191  
 192  Some special cases like a trailing 0x00 byte or a lone 0x1B byte are not
 193  well-defined and decode() will return an empty string for them.
 194  One possible workaround is
 195  
 196     $gsm =~ s/\x00\z/\x00\x00/;
 197     $uni = decode("gsm0338", $gsm);
 198     $uni .= "\xA0" if $gsm =~ /\x1B\z/;
 199  
 200  Note that the Encode implementation of GSM0338 does not implement the
 201  reuse of Latin capital letters as Greek capital letters (for example,
 202  the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL
 203  LETTER ZETA).
 204  
 205  The GSM0338 is also covered in Encode::Byte even though it is not
 206  an "extended ASCII" encoding.
 207  
 208  =back
 209  
 210  =head2 CJK: Chinese, Japanese, Korean (Multibyte)
 211  
 212  Note that Vietnamese is listed above.  Also read "Encoding vs Charset"
 213  below.  Also note that these are implemented in distinct modules by
 214  countries, due to the size concerns (simplified Chinese is mapped
 215  to 'CN', continental China, while traditional Chinese is mapped to
 216  'TW', Taiwan).  Please refer to their respective documentation pages.
 217  
 218  =over 2
 219  
 220  =item Encode::CN -- Continental China
 221  
 222    Standard      DOS/Win Macintosh                Comment/Reference
 223    ----------------------------------------------------------------
 224    euc-cn [1]            MacChineseSimp
 225    (gbk)         cp936 [2]
 226    gb12345-raw                      { GB12345 without CES }
 227    gb2312-raw                       { GB2312  without CES }
 228    hz
 229    iso-ir-165
 230    ----------------------------------------------------------------
 231  
 232    [1] GB2312 is aliased to this.  See L<Microsoft-related naming mess>
 233    [2] gbk is aliased to this.  See L<Microsoft-related naming mess>
 234  
 235  =item Encode::JP -- Japan
 236  
 237    Standard      DOS/Win Macintosh                Comment/Reference
 238    ----------------------------------------------------------------
 239    euc-jp
 240    shiftjis      cp932   macJapanese
 241    7bit-jis
 242    iso-2022-jp                                            [RFC1468]
 243    iso-2022-jp-1                                          [RFC2237]
 244    jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
 245    jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }
 246    jis0212-raw  { JIS X 0212 (Extended Kanji)         without CES }
 247    ----------------------------------------------------------------
 248  
 249  =item Encode::KR -- Korea
 250  
 251    Standard      DOS/Win Macintosh                Comment/Reference
 252    ----------------------------------------------------------------
 253    euc-kr                MacKorean                        [RFC1557]
 254                  cp949 [1]                    
 255    iso-2022-kr                                            [RFC1557]
 256    johab                                  [KS X 1001:1998, Annex 3]
 257    ksc5601-raw                              { KSC5601 without CES }
 258    ----------------------------------------------------------------
 259  
 260    [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
 261    See below.
 262  
 263  =item Encode::TW -- Taiwan
 264  
 265    Standard      DOS/Win Macintosh                Comment/Reference
 266    ----------------------------------------------------------------
 267    big5-eten     cp950   MacChineseTrad {big5 aliased to big5-eten}
 268    big5-hkscs                              
 269    ----------------------------------------------------------------
 270  
 271  =item Encode::HanExtra -- More Chinese via CPAN
 272  
 273  Due to the size concerns, additional Chinese encodings below are
 274  distributed separately on CPAN, under the name Encode::HanExtra.
 275  
 276    Standard      DOS/Win Macintosh                Comment/Reference
 277    ----------------------------------------------------------------
 278    big5ext                                   CMEX's Big5e Extension
 279    big5plus                                  CMEX's Big5+ Extension
 280    cccii         Chinese Character Code for Information Interchange
 281    euc-tw                             EUC (Extended Unix Character)
 282    gb18030                          GBK with Traditional Characters
 283    ----------------------------------------------------------------
 284  
 285  =item Encode::JIS2K -- JIS X 0213 encodings via CPAN
 286  
 287  Due to size concerns, additional Japanese encodings below are
 288  distributed separately on CPAN, under the name Encode::JIS2K.
 289  
 290    Standard      DOS/Win Macintosh                Comment/Reference
 291    ----------------------------------------------------------------
 292    euc-jisx0213
 293    shiftjisx0123
 294    iso-2022-jp-3
 295    jis0213-1-raw
 296    jis0213-2-raw
 297    ----------------------------------------------------------------
 298  
 299  =back
 300  
 301  =head2 Miscellaneous encodings
 302  
 303  =over 2
 304  
 305  =item Encode::EBCDIC
 306  
 307  See L<perlebcdic> for details.
 308  
 309    ----------------------------------------------------------------
 310    cp37
 311    cp500  
 312    cp875  
 313    cp1026  
 314    cp1047  
 315    posix-bc
 316    ----------------------------------------------------------------
 317  
 318  =item Encode::Symbols
 319  
 320  For symbols  and dingbats.
 321  
 322    ----------------------------------------------------------------
 323    symbol
 324    dingbats
 325    MacDingbats
 326    AdobeZdingbat
 327    AdobeSymbol
 328    ----------------------------------------------------------------
 329  
 330  =item Encode::MIME::Header
 331  
 332  Strictly speaking, MIME header encoding documented in RFC 2047 is more
 333  of encapsulation than encoding.  However, their support in modern
 334  world is imperative so they are supported.
 335  
 336    ----------------------------------------------------------------
 337    MIME-Header                                            [RFC2047]
 338    MIME-B                                                 [RFC2047]
 339    MIME-Q                                                 [RFC2047]
 340    ----------------------------------------------------------------
 341  
 342  =item Encode::Guess
 343  
 344  This one is not a name of encoding but a utility that lets you pick up
 345  the most appropriate encoding for a data out of given I<suspects>.  See
 346  L<Encode::Guess> for details.
 347  
 348  =back
 349  
 350  =head1 Unsupported encodings
 351  
 352  The following encodings are not supported as yet; some because they
 353  are rarely used, some because of technical difficulties.  They may
 354  be supported by external modules via CPAN in the future, however.
 355  
 356  =over 2
 357  
 358  =item   ISO-2022-JP-2 [RFC1554]
 359  
 360  Not very popular yet.  Needs Unicode Database or equivalent to
 361  implement encode() (because it includes JIS X 0208/0212, KSC5601, and
 362  GB2312 simultaneously, whose code points in Unicode overlap.  So you
 363  need to lookup the database to determine to what character set a given
 364  Unicode character should belong). 
 365  
 366  =item ISO-2022-CN [RFC1922]
 367  
 368  Not very popular.  Needs CNS 11643-1 and -2 which are not available in
 369  this module.  CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
 370  Autrijus Tang may add support for this encoding in his module in future.
 371  
 372  =item Various HP-UX encodings
 373  
 374  The following are unsupported due to the lack of mapping data.
 375  
 376    '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
 377    '15' - japanese15, korean15, and roi15
 378  
 379  =item Cyrillic encoding ISO-IR-111
 380  
 381  Anton Tagunov doubts its usefulness.
 382  
 383  =item ISO-8859-8-1 [Hebrew]
 384  
 385  None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
 386  MacHebrew are supported because and just because there were mappings
 387  available at L<http://www.unicode.org/>).  Contributions welcome.
 388  
 389  =item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
 390  
 391  Ditto.
 392  
 393  =item Thai encoding TCVN
 394  
 395  Ditto.
 396  
 397  =item Vietnamese encodings VPS
 398  
 399  Though Jungshik Shin has reported that Mozilla supports this encoding,
 400  it was too late before 5.8.0 for us to add it.  In the future, it
 401  may be available via a separate module.  See
 402  L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
 403  and
 404  L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
 405  if you are interested in helping us.
 406  
 407  =item Various Mac encodings
 408  
 409  The following are unsupported due to the lack of mapping data. 
 410  
 411    MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
 412    MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
 413    MacLaotian,   MacMalayalam, MacMongolian, MacOriya
 414    MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
 415    MacVietnamese
 416  
 417  The rest which are already available are based upon the vendor mappings
 418  at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
 419  
 420  =item (Mac) Indic encodings
 421  
 422  The maps for the following are available at L<http://www.unicode.org/>
 423  but remain unsupport because those encodings need algorithmical
 424  approach, currently unsupported by F<enc2xs>:
 425  
 426    MacDevanagari
 427    MacGurmukhi
 428    MacGujarati
 429  
 430  For details, please see C<Unicode mapping issues and notes:> at
 431  L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
 432  
 433  I believe this issue is prevalent not only for Mac Indics but also in
 434  other Indic encodings, but the above were the only Indic encodings
 435  maps that I could find at L<http://www.unicode.org/> .
 436  
 437  =back
 438  
 439  =head1 Encoding vs. Charset -- terminology
 440  
 441  We are used to using the term (character) I<encoding> and I<character
 442  set> interchangeably.  But just as confusing the terms byte and
 443  character is dangerous and the terms should be differentiated when
 444  needed, we need to differentiate I<encoding> and I<character set>.
 445  
 446  To understand that, here is a description of how we make computers
 447  grok our characters.
 448  
 449  =over 2
 450  
 451  =item *
 452  
 453  First we start with which characters to include.  We call this
 454  collection of characters I<character repertoire>.
 455  
 456  =item *
 457  
 458  Then we have to give each character a unique ID so your computer can
 459  tell the difference between 'a' and 'A'.  This itemized character
 460  repertoire is now a I<character set>.
 461  
 462  =item *
 463  
 464  If your computer can grow the character set without further
 465  processing, you can go ahead and use it.  This is called a I<coded
 466  character set> (CCS) or I<raw character encoding>.  ASCII is used this
 467  way for most cases.
 468  
 469  =item *
 470  
 471  But in many cases, especially multi-byte CJK encodings, you have to
 472  tweak a little more.  Your network connection may not accept any data
 473  with the Most Significant Bit set, and your computer may not be able to
 474  tell if a given byte is a whole character or just half of it.  So you
 475  have to I<encode> the character set to use it.
 476  
 477  A I<character encoding scheme> (CES) determines how to encode a given
 478  character set, or a set of multiple character sets.  7bit ISO-2022 is
 479  an example of a CES.  You switch between character sets via I<escape
 480  sequences>.
 481  
 482  =back
 483  
 484  Technically, or mathematically, speaking, a character set encoded in
 485  such a CES that maps character by character may form a CCS.  EUC is such
 486  an example.  The CES of EUC is as follows:
 487  
 488  =over 2
 489  
 490  =item *
 491  
 492  Map ASCII unchanged.
 493  
 494  =item *
 495  
 496  Map such a character set that consists of 94 or 96 powered by N
 497  members by adding 0x80 to each byte.
 498  
 499  =item *
 500  
 501  You can also use 0x8e and 0x8f to indicate that the following sequence of
 502  characters belongs to yet another character set.  To each following byte
 503  is added the value 0x80.
 504  
 505  =back
 506  
 507  By carefully looking at the encoded byte sequence, you can find that the
 508  byte sequence conforms a unique number.  In that sense, EUC is a CCS
 509  generated by a CES above from up to four CCS (complicated?).  UTF-8
 510  falls into this category.  See L<perlUnicode/"UTF-8"> to find out how
 511  UTF-8 maps Unicode to a byte sequence.
 512  
 513  You may also have found out by now why 7bit ISO-2022 cannot comprise
 514  a CCS.  If you look at a byte sequence \x21\x21, you can't tell if
 515  it is two !'s or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1
 516  so you have no trouble differentiating between "!!". and S<"  ">.
 517  
 518  =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
 519  
 520  This section tries to classify the supported encodings by their 
 521  applicability for information exchange over the Internet and to 
 522  choose the most suitable aliases to name them in the context of 
 523  such communication.
 524  
 525  =over 2
 526  
 527  =item * 
 528  
 529  To (en|de)code encodings marked by C<(**)>, you need 
 530  C<Encode::HanExtra>, available from CPAN.
 531  
 532  =back
 533  
 534  Encoding names
 535  
 536    US-ASCII    UTF-8    ISO-8859-*  KOI8-R
 537    Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
 538    EUC-KR      Big5     GB2312
 539  
 540  are registered with IANA as preferred MIME names and may
 541  be used over the Internet.
 542  
 543  C<Shift_JIS> has been officialized by JIS X 0208:1997.
 544  L<Microsoft-related naming mess> gives details.
 545  
 546  C<GB2312> is the IANA name for C<EUC-CN>.
 547  See L<Microsoft-related naming mess> for details.
 548  
 549  C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
 550  with Encode. See L<Encode::CN> for details.
 551  
 552    EUC-CN
 553    KOI8-U        [RFC2319]
 554  
 555  have not been registered with IANA (as of March 2002) but
 556  seem to be supported by major web browsers. 
 557  The IANA name for C<EUC-CN> is C<GB2312>.
 558  
 559    KS_C_5601-1987
 560  
 561  is heavily misused.
 562  See L<Microsoft-related naming mess> for details.
 563  
 564  C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
 565  with Encode. See L<Encode::KR> for details.
 566  
 567    UTF-16 UTF-16BE UTF-16LE
 568  
 569  are IANA-registered C<charset>s. See [RFC 2781] for details.
 570  Jungshik Shin reports that UTF-16 with a BOM is well accepted
 571  by MS IE 5/6 and NS 4/6. Beware however that
 572  
 573  =over 2
 574  
 575  =item *
 576  
 577  C<UTF-16> support in any software you're going to be
 578  using/interoperating with has probably been less tested
 579  then C<UTF-8> support
 580  
 581  =item *
 582  
 583  C<UTF-8> coded data seamlessly passes traditional
 584  command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
 585  data is likely to cause confusion (with its zero bytes,
 586  for example)
 587  
 588  =item *
 589  
 590  it is beyond the power of words to describe the way HTML browsers
 591  encode non-C<ASCII> form data. To get a general impression, visit
 592  L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
 593  While encoding of form data has stabilized for C<UTF-8> encoded pages
 594  (at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
 595  expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
 596  pages!
 597  
 598  =back
 599  
 600  The rule of thumb is to use C<UTF-8> unless you know what
 601  you're doing and unless you really benefit from using C<UTF-16>.
 602  
 603    ISO-IR-165    [RFC1345]
 604    VISCII
 605    GB 12345
 606    GB 18030 (**)  (see links bellow)
 607    EUC-TW   (**)
 608  
 609  are totally valid encodings but not registered at IANA.
 610  The names under which they are listed here are probably the
 611  most widely-known names for these encodings and are recommended
 612  names.
 613  
 614    BIG5PLUS (**)
 615  
 616  is a proprietary name. 
 617  
 618  =head2 Microsoft-related naming mess
 619  
 620  Microsoft products misuse the following names:
 621  
 622  =over 2
 623  
 624  =item KS_C_5601-1987
 625  
 626  Microsoft extension to C<EUC-KR>.
 627  
 628  Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
 629  
 630  See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
 631  for details.
 632  
 633  Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
 634  misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
 635  C<kcs5601-raw>.
 636  
 637  See L<Encode::KR> for details.
 638  
 639  =item GB2312
 640  
 641  Microsoft extension to C<EUC-CN>.
 642  
 643  Proper names: C<CP936>, C<GBK>.
 644  
 645  C<GB2312> has been registered in the C<EUC-CN> meaning at
 646  IANA. This has partially repaired the situation: Microsoft's 
 647  C<GB2312> has become a superset of the official C<GB2312>.
 648  
 649  Encode aliases C<GB2312> to C<euc-cn> in full agreement with
 650  IANA registration. C<cp936> is supported separately.
 651  I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
 652  
 653  See L<Encode::CN> for details.
 654  
 655  =item Big5
 656  
 657  Microsoft extension to C<Big5>.
 658  
 659  Proper name: C<CP950>.
 660  
 661  Encode separately supports C<Big5> and C<cp950>.
 662  
 663  =item Shift_JIS
 664  
 665  Microsoft's understanding of C<Shift_JIS>.
 666  
 667  JIS has not endorsed the full Microsoft standard however.
 668  The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
 669  character sets, while Microsoft has always used C<Shift_JIS>
 670  to encode a wider character repertoire. See C<IANA> registration for
 671  C<Windows-31J>.
 672  
 673  As a historical predecessor, Microsoft's variant
 674  probably has more rights for the name, though it may be objected
 675  that Microsoft shouldn't have used JIS as part of the name
 676  in the first place.
 677  
 678  Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, and
 679  provided as an alias by Encode): C<Windows-31J>.
 680  
 681  Encode separately supports C<Shift_JIS> and C<cp932>.
 682  
 683  =back
 684  
 685  =head1 Glossary
 686  
 687  =over 2
 688  
 689  =item character repertoire
 690  
 691  A collection of unique characters.  A I<character> set in the strictest
 692  sense. At this stage, characters are not numbered.
 693  
 694  =item coded character set (CCS)
 695  
 696  A character set that is mapped in a way computers can use directly.
 697  Many character encodings, including EUC, fall in this category.
 698  
 699  =item character encoding scheme (CES)
 700  
 701  An algorithm to map a character set to a byte sequence.  You don't
 702  have to be able to tell which character set a given byte sequence
 703  belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
 704  example of being both a CCS and CES.
 705  
 706  =item charset (in MIME context)
 707  
 708  has long been used in the meaning of C<encoding>, CES.
 709  
 710  While the word combination C<character set> has lost this meaning
 711  in MIME context since [RFC 2130], the C<charset> abbreviation has
 712  retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
 713  
 714   This document uses the term "charset" to mean a set of rules for
 715   mapping from a sequence of octets to a sequence of characters, such
 716   as the combination of a coded character set and a character encoding
 717   scheme; this is also what is used as an identifier in MIME "charset="
 718   parameters, and registered in the IANA charset registry ...  (Note
 719   that this is NOT a term used by other standards bodies, such as ISO).
 720   [RFC 2277]
 721  
 722  =item EUC
 723  
 724  Extended Unix Character.  See ISO-2022.
 725  
 726  =item ISO-2022
 727  
 728  A CES that was carefully designed to coexist with ASCII.  There are a 7
 729  bit version and an 8 bit version.  
 730  
 731  The 7 bit version switches character set via escape sequence so it
 732  cannot form a CCS.  Since this is more difficult to handle in programs
 733  than the 8 bit version, the 7 bit version is not very popular except for
 734  iso-2022-jp, the I<de facto> standard CES for e-mails.
 735  
 736  The 8 bit version can form a CCS.  EUC and ISO-8859 are two examples
 737  thereof.  Pre-5.6 perl could use them as string literals.
 738  
 739  =item UCS
 740  
 741  Short for I<Universal Character Set>.  When you say just UCS, it means
 742  I<Unicode>.
 743  
 744  =item UCS-2
 745  
 746  ISO/IEC 10646 encoding form: Universal Character Set coded in two
 747  octets.
 748  
 749  =item Unicode
 750  
 751  A character set that aims to include all character repertoires of the
 752  world.  Many character sets in various national as well as industrial
 753  standards have become, in a way, just subsets of Unicode.
 754  
 755  =item UTF
 756  
 757  Short for I<Unicode Transformation Format>.  Determines how to map a
 758  Unicode character into a byte sequence.
 759  
 760  =item UTF-16
 761  
 762  A UTF in 16-bit encoding.  Can either be in big endian or little
 763  endian.  The big endian version is called UTF-16BE (equal to UCS-2 + 
 764  surrogate support) and the little endian version is called UTF-16LE.
 765  
 766  =back
 767  
 768  =head1 See Also
 769  
 770  L<Encode>, 
 771  L<Encode::Byte>, 
 772  L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
 773  L<Encode::EBCDIC>, L<Encode::Symbol>
 774  L<Encode::MIME::Header>, L<Encode::Guess>
 775  
 776  =head1 References
 777  
 778  =over 2
 779  
 780  =item ECMA
 781  
 782  European Computer Manufacturers Association
 783  L<http://www.ecma.ch>
 784  
 785  =over 2
 786  
 787  =item ECMA-035 (eq C<ISO-2022>)
 788  
 789  L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> 
 790  
 791  The specification of ISO-2022 is available from the link above.
 792  
 793  =back
 794  
 795  =item IANA
 796  
 797  Internet Assigned Numbers Authority
 798  L<http://www.iana.org/>
 799  
 800  =over 2
 801  
 802  =item Assigned Charset Names by IANA
 803  
 804  L<http://www.iana.org/assignments/character-sets>
 805  
 806  Most of the C<canonical names> in Encode derive from this list
 807  so you can directly apply the string you have extracted from MIME
 808  header of mails and web pages.
 809  
 810  =back
 811  
 812  =item ISO
 813  
 814  International Organization for Standardization
 815  L<http://www.iso.ch/>
 816  
 817  =item RFC
 818  
 819  Request For Comments -- need I say more?
 820  L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>,
 821  L<http://www.faqs.org/rfcs/>
 822  
 823  =item UC
 824  
 825  Unicode Consortium
 826  L<http://www.unicode.org/>
 827  
 828  =over 2
 829  
 830  =item Unicode Glossary
 831  
 832  L<http://www.unicode.org/glossary/>
 833  
 834  The glossary of this document is based upon this site.
 835  
 836  =back
 837  
 838  =back
 839  
 840  =head2 Other Notable Sites
 841  
 842  =over 2
 843  
 844  =item czyborra.com
 845  
 846  L<http://czyborra.com/>
 847  
 848  Contains a lot of useful information, especially gory details of ISO
 849  vs. vendor mappings.
 850  
 851  =item CJK.inf
 852  
 853  L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
 854  
 855  Somewhat obsolete (last update in 1996), but still useful.  Also try
 856  
 857  L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
 858  
 859  You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
 860  
 861  =item Jungshik Shin's Hangul FAQ
 862  
 863  L<http://jshin.net/faq>
 864  
 865  And especially its subject 8.
 866  
 867  L<http://jshin.net/faq/qa8.html>
 868  
 869  A comprehensive overview of the Korean (C<KS *>) standards.
 870  
 871  =item debian.org: "Introduction to i18n"
 872  
 873  A brief description for most of the mentioned CJK encodings is
 874  contained in
 875  L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
 876  
 877  =back
 878  
 879  =head2 Offline sources
 880  
 881  =over 2
 882  
 883  =item C<CJKV Information Processing> by Ken Lunde
 884  
 885  CJKV Information Processing
 886  1999 O'Reilly & Associates, ISBN : 1-56592-224-7
 887  
 888  The modern successor of C<CJK.inf>.
 889  
 890  Features a comprehensive coverage of CJKV character sets and
 891  encodings along with many other issues faced by anyone trying
 892  to better support CJKV languages/scripts in all the areas of
 893  information processing.
 894  
 895  To purchase this book, visit
 896  L<http://www.oreilly.com/catalog/cjkvinfo/>
 897  or your favourite bookstore.
 898  
 899  =back
 900  
 901  =cut


Generated: Tue Mar 17 22:47:18 2015 Cross-referenced by PHPXref 0.7.1