MIME::Charset (3)
Leading comments
Automatically generated by Pod::Man 2.28 (Pod::Simple 3.29) Standard preamble: ========================================================================
NAME
MIME::Charset - Charset Information for MIMESYNOPSIS
use MIME::Charset: $charset = MIME::Charset->new("euc-jp");
Getting charset information:
$benc = $charset->body_encoding; # e.g. "Q" $cset = $charset->as_string; # e.g. "US-ASCII" $henc = $charset->header_encoding; # e.g. "S" $cset = $charset->output_charset; # e.g. "ISO-2022-JP"
Translating text data:
($text, $charset, $encoding) = $charset->header_encode( "\xc9\xc2\xc5\xaa\xc0\xde\xc3\xef\xc5\xaa". "\xc7\xd1\xca\xaa\xbd\xd0\xce\xcf\xb4\xef", Charset => 'euc-jp'); # ...returns e.g. (<converted>, "ISO-2022-JP", "B"). ($text, $charset, $encoding) = $charset->body_encode( "Collectioneur path\xe9tiquement ". "\xe9clectique de d\xe9chets", Charset => 'latin1'); # ...returns e.g. (<original>, "ISO-8859-1", "QUOTED-PRINTABLE"). $len = $charset->encoded_header_len( "Perl\xe8\xa8\x80\xe8\xaa\x9e", Charset => 'utf-8', Encoding => "b"); # ...returns e.g. 28.
Manipulating module defaults:
MIME::Charset::alias("csEUCKR", "euc-kr"); MIME::Charset::default("iso-8859-1"); MIME::Charset::fallback("us-ascii");
Non-OO functions (may be deprecated in near future):
use MIME::Charset qw(:info); $benc = body_encoding("iso-8859-2"); # "Q" $cset = canonical_charset("ANSI X3.4-1968"); # "US-ASCII" $henc = header_encoding("utf-8"); # "S" $cset = output_charset("shift_jis"); # "ISO-2022-JP" use MIME::Charset qw(:trans); ($text, $charset, $encoding) = header_encode( "\xc9\xc2\xc5\xaa\xc0\xde\xc3\xef\xc5\xaa". "\xc7\xd1\xca\xaa\xbd\xd0\xce\xcf\xb4\xef", "euc-jp"); # ...returns (<converted>, "ISO-2022-JP", "B"); ($text, $charset, $encoding) = body_encode( "Collectioneur path\xe9tiquement ". "\xe9clectique de d\xe9chets", "latin1"); # ...returns (<original>, "ISO-8859-1", "QUOTED-PRINTABLE"); $len = encoded_header_len( "Perl\xe8\xa8\x80\xe8\xaa\x9e", "b", "utf-8"); # 28
DESCRIPTION
MIME::Charset provides information about character sets used forDefinitions
The charset is ``character set'' used inThe encoding is that used in
Constructor
- $charset = MIME::Charset->new([CHARSET[,OPTS]])
-
Create charset object.
OPTSmay accept following key-value pair.NOTE: When Unicode/multibyte support is disabled (see ``USE_ENCODE''), conversion will not be performed. So this option do not have any effects.
-
- Mapping => MAPTYPE
- Whether to extend mappings actually used for charset names or not. "EXTENDED" uses extended mappings. "STANDARD" uses standardized strict mappings. Default is "EXTENDED".
- Mapping =>
-
Getting Information of Charsets
- $charset->body_encoding
- body_encoding CHARSET
-
Get recommended transfer-encoding of CHARSETfor message body.
Returned value will be one of "B" (
BASE64), "Q" (QUOTED-PRINTABLE), "S" (shorter one of either) or "undef" (might not be transfer-encoded; either 7BIT or 8BIT). This may not be same as encoding for message header. - $charset->as_string
- canonical_charset CHARSET
- Get canonical name for charset.
- $charset->decoder
- Get ``Encode::Encoding'' object to decode strings to Unicode by charset. If charset is not specified or not known by this module, undef will be returned.
- $charset->dup
- Get a copy of charset object.
- $charset->encoder([CHARSET])
-
Get ``Encode::Encoding'' object to encode Unicode string using compatible
charset recommended to be used for messages on Internet.
If optional
CHARSETis specified, replace encoder (and output charset name) of $charset object with those ofCHARSET,therefore, $charset object will be a converter between original charset and newCHARSET. - $charset->header_encoding
- header_encoding CHARSET
-
Get recommended encoding scheme of CHARSETfor message header.
Returned value will be one of "B", "Q", "S" (shorter one of either) or "undef" (might not be encoded). This may not be same as encoding for message body.
- $charset->output_charset
- output_charset CHARSET
-
Get a charset which is compatible with given CHARSETand is recommended to be used forMIMEmessages on Internet (if it is known by this module).
When Unicode/multibyte support is disabled (see ``
USE_ENCODE''), this function will simply return the result of ``canonical_charset''.
Translating Text Data
- $charset->body_encode(STRING[,OPTS])
- body_encode STRING, CHARSET[,OPTS]
-
Get converted (if needed) data of STRINGand recommended transfer-encoding of that data for message body.CHARSETis the charset by whichSTRINGis encoded.OPTSmay accept following key-value pairs.NOTE: When Unicode/multibyte support is disabled (see ``USE_ENCODE''), conversion will not be performed. So these options do not have any effects.
-
- Detect7bit => YESNO
-
Try auto-detecting 7-bit charset when CHARSETis not given. Default is "YES".
- Replacement => REPLACEMENT
- Specifies error handling scheme. See ``Error Handling''.
- Detect7bit =>
-
3-item list of (converted string, charset for output, transfer-encoding) will be returned. Transfer-encoding will be either "BASE64", "QUOTED-PRINTABLE", "7BIT" or "8BIT". If charset for output could not be determined and converted string contains non-ASCII byte(s), charset for output will be "undef" and transfer-encoding will be "BASE64". Charset for output will be "US-ASCII" if and only if string does not contain any non-ASCII bytes.
-
- $charset->decode(STRING[,CHECK])
-
Decode STRINGto Unicode.
Note: When Unicode/multibyte support is disabled (see ``
USE_ENCODE''), this function will die. - detect_7bit_charset STRING
-
Guess 7-bit charset that may encode a string STRING.IfSTRINGcontains any 8-bit bytes, "undef" will be returned. Otherwise, Default Charset will be returned for unknown charset.
- $charset->encode(STRING[,CHECK])
-
Encode STRING(Unicode or non-Unicode) using compatible charset recommended to be used for messages on Internet (if this module knows it). Note that string will be decoded to Unicode then encoded even if compatible charset was equal to original charset.
Note: When Unicode/multibyte support is disabled (see ``
USE_ENCODE''), this function will die. - $charset->encoded_header_len(STRING[,ENCODING])
- encoded_header_len STRING, ENCODING, CHARSET
-
Get length of encoded STRINGfor message header (without folding).ENCODINGmay be one of "B", "Q" or "S" (shorter one of either "B" or "Q").
- $charset->header_encode(STRING[,OPTS])
- header_encode STRING, CHARSET[,OPTS]
-
Get converted (if needed) data of STRINGand recommended encoding scheme of that data for message headers.CHARSETis the charset by whichSTRINGis encoded.OPTSmay accept following key-value pairs.NOTE: When Unicode/multibyte support is disabled (see ``USE_ENCODE''), conversion will not be performed. So these options do not have any effects.
-
- Detect7bit => YESNO
-
Try auto-detecting 7-bit charset when CHARSETis not given. Default is "YES".
- Replacement => REPLACEMENT
- Specifies error handling scheme. See ``Error Handling''.
- Detect7bit =>
-
3-item list of (converted string, charset for output, encoding scheme) will be returned. Encoding scheme will be either "B", "Q" or "undef" (might not be encoded). If charset for output could not be determined and converted string contains non-ASCII byte(s), charset for output will be "8BIT" (this is not charset name but a special value to represent unencodable data) and encoding scheme will be "undef" (should not be encoded). Charset for output will be "US-ASCII" if and only if string does not contain any non-ASCII bytes.
-
- $charset->undecode(STRING[,CHECK])
-
Encode Unicode string STRINGto byte string by input charset of $charset. This is equivalent to "$charset->decoder->encode()".
Note: When Unicode/multibyte support is disabled (see ``
USE_ENCODE''), this function will die.
Manipulating Module Defaults
- alias ALIAS[,CHARSET]
-
Get/set charset alias for canonical names determined by
``canonical_charset''.
If
CHARSETis given and isn't false,ALIASwill be assigned as an alias ofCHARSET.Otherwise, alias won't be changed. In both cases, current charset name thatALIASis assigned will be returned. - default [CHARSET]
-
Get/set default charset.
Default charset is used by this module when charset context is unknown. Modules using this module are recommended to use this charset when charset context is unknown or implicit default is expected. By default, it is "US-ASCII".
If
CHARSETis given and isn't false, it will be set to default charset. Otherwise, default charset won't be changed. In both cases, current default charset will be returned.NOTE: Default charset should not be changed. - fallback [CHARSET]
-
Get/set fallback charset.
Fallback charset is used by this module when conversion by given charset is failed and "FALLBACK" error handling scheme is specified. Modules using this module may use this charset as last resort of charset for conversion. By default, it is "UTF-8".
If
CHARSETis given and isn't false, it will be set to fallback charset. IfCHARSETis "NONE", fallback charset will be undefined. Otherwise, fallback charset won't be changed. In any cases, current fallback charset will be returned.NOTE: It is useful that "US-ASCII" is specified as fallback charset, since result of conversion will be readable without charset information. - recommended CHARSET[,HEADERENC, BODYENC[,ENCCHARSET]]
-
Get/set charset profiles.
If optional arguments are given and any of them are not false, profiles for
CHARSETwill be set by those arguments. Otherwise, profiles won't be changed. In both cases, current profiles forCHARSETwill be returned as 3-item list of (HEADERENC, BODYENC, ENCCHARSET).HEADERENCis recommended encoding scheme for message header. It may be one of "B", "Q", "S" (shorter one of either) or "undef" (might not be encoded).BODYENCis recommended transfer-encoding for message body. It may be one of "B", "Q", "S" (shorter one of either) or "undef" (might not be transfer-encoded).ENCCHARSETis a charset which is compatible with givenCHARSETand is recommended to be used forMIMEmessages on Internet. If conversion is not needed (or this module doesn't know appropriate charset),ENCCHARSETis "undef".NOTE: This function in the future releases can accept more optional arguments (for example, properties to handle character widths, line folding behavior, ...). So format of returned value may probably be changed. Use ``header_encoding'', ``body_encoding'' or ``output_charset'' to get particular profile.
Constants
- USE_ENCODE
- Unicode/multibyte support flag. Non-empty string will be set when Unicode and multibyte support is enabled. Currently, this flag will be non-empty on Perl 5.7.3 or later and empty string on earlier versions of Perl.
Error Handling
``body_encode'' and ``header_encode'' accept following "Replacement" options:- DEFAULT
- Put a substitution character in place of a malformed character. For UCM-based encodings, <subchar> will be used.
- FALLBACK
- Try "DEFAULT" scheme using fallback charset (see ``fallback''). When fallback charset is undefined and conversion causes error, code will die on error with an error message.
- CROAK
- Code will die on error immediately with an error message. Therefore, you should trap the fatal error with eval{} unless you really want to let it die on error. Synonym is "STRICT".
- PERLQQ
- HTMLCREF
- XMLCREF
- Use "FB_PERLQQ", "FB_HTMLCREF" or "FB_XMLCREF" scheme defined by Encode module.
- numeric values
- Numeric values are also allowed. For more details see ``Handling Malformed Data'' in Encode.
If error handling scheme is not specified or unknown scheme is specified, "DEFAULT" will be assumed.
Configuration File
Built-in defaults for option parameters can be overridden by configuration file: MIME/Charset/Defaults.pm. For more details read MIME/Charset/Defaults.pm.sample.VERSION
Consult $VERSION variable.Development versions of this module may be found at <hatuka.nezumi.nu/repos/MIME-Charset>.
Incompatible Changes
- Release 1.001
-
-
- *
-
new() method returns an object when CHARSETargument is not specified.
-
- Release 1.005
-
-
- *
-
Restrict characters in encoded-word according to RFC 2047section 5 (3). This also affects return value of encoded_header_len() method.
-
- Release 1.008.2
-
-
- *
- body_encoding() method may also returns "S".
- *
-
Return value of body_encode() method for UTF-8may include "QUOTED-PRINTABLE" encoding item that in earlier versions was fixed to "BASE64".
-