========================================
 codecs -- String encoding and decoding
========================================

.. module:: codecs
    :synopsis: String encoding and decoding.

:Purpose: Encoders and decoders for converting text between different representations.
:Available In: 2.1 and later

The :mod:`codecs` module provides stream and file interfaces for
transcoding data in your program.  It is most commonly used to work
with Unicode text, but other encodings are also available for other
purposes.

Unicode Primer
==============

CPython 2.x supports two types of strings for working with text data.
Old-style :class:`str` instances use a single 8-bit byte to represent
each character of the string using its ASCII code.  In contrast,
:class:`unicode` strings are managed internally as a sequence of
Unicode *code points*.  The code point values are saved as a sequence
of 2 or 4 bytes each, depending on the options given when Python was
compiled.  Both :class:`unicode` and :class:`str` are derived from a
common base class, and support a similar API.

When :class:`unicode` strings are output, they are encoded using one
of several standard schemes so that the sequence of bytes can be
reconstructed as the same string later.  The bytes of the encoded
value are not necessarily the same as the code point values, and the
encoding defines a way to translate between the two sets of values.
Reading Unicode data also requires knowing the encoding so that the
incoming bytes can be converted to the internal representation used by
the :class:`unicode` class.

The most common encodings for Western languages are ``UTF-8`` and
``UTF-16``, which use sequences of one and two byte values
respectively to represent each character.  Other encodings can be more
efficient for storing languages where most of the characters are
represented by code points that do not fit into two bytes.

.. seealso::

  For more introductory information about Unicode, refer to the list
  of references at the end of this section.  The Python `Unicode
  HOWTO`_ is especially helpful.

Encodings
---------

The best way to understand encodings is to look at the different
series of bytes produced by encoding the same string in different
ways.  The examples below use this function to format the byte string
to make it easier to read.

.. include:: codecs_to_hex.py
   :literal:
   :start-after: #end_pymotw_header

The function uses :mod:`binascii` to get a hexadecimal representation
of the input byte string, then insert a space between every *nbytes*
bytes before returning the value.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_to_hex.py'))
.. }}}

::

	$ python codecs_to_hex.py
	61 62 63 64 65 66
	6162 6364 6566

.. {{{end}}}

The first encoding example begins by printing the text ``'pi: π'``
using the raw representation of the :class:`unicode` class.  The ``π``
character is replaced with the expression for the Unicode code point,
``\u03c0``.  The next two lines encode the string as UTF-8 and UTF-16
respectively, and show the hexadecimal values resulting from the
encoding.

.. include:: codecs_encodings.py
   :literal:
   :start-after: #end_pymotw_header

The result of encoding a :class:`unicode` string is a :class:`str`
object.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_encodings.py'))
.. }}}

::

	$ python codecs_encodings.py
	Raw   : u'pi: \u03c0'
	UTF-8 : 70 69 3a 20 cf 80
	UTF-16: fffe 7000 6900 3a00 2000 c003

.. {{{end}}}

Given a sequence of encoded bytes as a :class:`str` instance, the
:func:`decode` method translates them to code points and returns the
sequence as a :class:`unicode` instance.

.. include:: codecs_decode.py
   :literal:
   :start-after: #end_pymotw_header

The choice of encoding used does not change the output type.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_decode.py'))
.. }}}

::

	$ python codecs_decode.py
	Original : u'pi: \u03c0'
	Encoded  : 70 69 3a 20 cf 80 <type 'str'>
	Decoded  : u'pi: \u03c0' <type 'unicode'>

.. {{{end}}}

.. note::

  The default encoding is set during the interpreter start-up process,
  when :mod:`site` is loaded.  Refer to :ref:`sys-unicode-defaults`
  for a description of the default encoding settings accessible via
  :mod:`sys`.

Working with Files
==================

Encoding and decoding strings is especially important when dealing
with I/O operations.  Whether you are writing to a file, socket, or
other stream, you will want to ensure that the data is using the
proper encoding.  In general, all text data needs to be decoded from
its byte representation as it is read, and encoded from the internal
values to a specific representation as it is written.  Your program
can explicitly encode and decode data, but depending on the encoding
used it can be non-trivial to determine whether you have read enough
bytes in order to fully decode the data.  :mod:`codecs` provides
classes that manage the data encoding and decoding for you, so you
don't have to create your own.

The simplest interface provided by :mod:`codecs` is a replacement for
the built-in :func:`open` function.  The new version works just like
the built-in, but adds two new arguments to specify the encoding and
desired error handling technique.

.. include:: codecs_open_write.py
   :literal:
   :start-after: #end_pymotw_header

Starting with a :class:`unicode` string with the code point for π,
this example saves the text to a file using an encoding specified on
the command line.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_open_write.py utf-8'))
.. cog.out(run_script(cog.inFile, 'codecs_open_write.py utf-16', include_prefix=False))
.. cog.out(run_script(cog.inFile, 'codecs_open_write.py utf-32', include_prefix=False))
.. }}}

::

	$ python codecs_open_write.py utf-8
	Writing to utf-8.txt
	File contents:
	70 69 3a 20 cf 80

	$ python codecs_open_write.py utf-16
	Writing to utf-16.txt
	File contents:
	fffe 7000 6900 3a00 2000 c003

	$ python codecs_open_write.py utf-32
	Writing to utf-32.txt
	File contents:
	fffe0000 70000000 69000000 3a000000 20000000 c0030000

.. {{{end}}}

Reading the data with :func:`open` is straightforward, with one catch:
you must know the encoding in advance, in order to set up the decoder
correctly.  Some data formats, such as XML, let you specify the
encoding as part of the file, but usually it is up to the application
to manage.  :mod:`codecs` simply takes the encoding as an argument and
assumes it is correct.

.. include:: codecs_open_read.py
   :literal:
   :start-after: #end_pymotw_header

This example reads the files created by the previous program, and
prints the representation of the resulting :class:`unicode` object to
the console.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_open_read.py utf-8'))
.. cog.out(run_script(cog.inFile, 'codecs_open_read.py utf-16', include_prefix=False))
.. cog.out(run_script(cog.inFile, 'codecs_open_read.py utf-32', include_prefix=False))
.. }}}

::

	$ python codecs_open_read.py utf-8
	Reading from utf-8.txt
	u'pi: \u03c0'

	$ python codecs_open_read.py utf-16
	Reading from utf-16.txt
	u'pi: \u03c0'

	$ python codecs_open_read.py utf-32
	Reading from utf-32.txt
	u'pi: \u03c0'

.. {{{end}}}

Byte Order
==========

Multi-byte encodings such as UTF-16 and UTF-32 pose a problem when
transferring the data between different computer systems, either by
copying the file directly or with network communication.  Different
systems use different ordering of the high and low order bytes.  This
characteristic of the data, known as its *endianness*, depends on
factors such as the hardware architecture and choices made by the
operating system and application developer.  There isn't always a way
to know in advance what byte order to use for a given set of data, so
the multi-byte encodings include a *byte-order marker* (BOM) as the
first few bytes of encoded output.  For example, UTF-16 is defined
in such a way that 0xFFFE and 0xFEFF are not valid characters, and can
be used to indicate the byte order.  :mod:`codecs` defines constants
for the byte order markers used by UTF-16 and UTF-32.

.. include:: codecs_bom.py
   :literal:
   :start-after: #end_pymotw_header

``BOM``, ``BOM_UTF16``, and ``BOM_UTF32`` are automatically set to the
appropriate big-endian or little-endian values depending on the
current system's native byte order.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_bom.py'))
.. }}}

::

	$ python codecs_bom.py
	BOM          : fffe
	BOM_BE       : feff
	BOM_LE       : fffe
	BOM_UTF8     : efbb bf
	BOM_UTF16    : fffe
	BOM_UTF16_BE : feff
	BOM_UTF16_LE : fffe
	BOM_UTF32    : fffe 0000
	BOM_UTF32_BE : 0000 feff
	BOM_UTF32_LE : fffe 0000

.. {{{end}}}

Byte ordering is detected and handled automatically by the decoders in
:mod:`codecs`, but you can also choose an explicit ordering for the
encoding.  

.. include:: codecs_bom_create_file.py
   :literal:
   :start-after: #end_pymotw_header

``codecs_bom_create_file.py`` figures out the native byte ordering,
then uses the alternate form explicitly so the next example can
demonstrate auto-detection while reading.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_bom_create_file.py'))
.. }}}

::

	$ python codecs_bom_create_file.py
	Native order  : fffe
	Selected order: feff
	utf_16_be     : 0070 0069 003a 0020 03c0

.. {{{end}}}

``codecs_bom_detection.py`` does not specify a byte order when opening
the file, so the decoder uses the BOM value in the first two bytes of
the file to determine it.

.. include:: codecs_bom_detection.py
   :literal:
   :start-after: #end_pymotw_header

Since the first two bytes of the file are used for byte order
detection, they are not included in the data returned by :func:`read`.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_bom_detection.py'))
.. }}}

::

	$ python codecs_bom_detection.py
	Raw    : feff 0070 0069 003a 0020 03c0
	Decoded: u'pi: \u03c0'

.. {{{end}}}

Error Handling
==============

The previous sections pointed out the need to know the encoding being
used when reading and writing Unicode files.  Setting the encoding
correctly is important for two reasons.  If the encoding is configured
incorrectly while reading from a file, the data will be interpreted
wrong and may be corrupted or simply fail to decode.  Not all Unicode
characters can be represented in all encodings, so if the wrong
encoding is used while writing an error will be generated and data may
be lost.

:mod:`codecs` uses the same five error handling options that are
provided by the :func:`encode` method of :class:`unicode` and the
:func:`decode` method of :class:`str`.

=====================   ===========
Error Mode              Description
=====================   ===========
``strict``              Raises an exception if the data cannot be converted.
``replace``             Substitutes a special marker character for data that cannot be encoded.
``ignore``              Skips the data.
``xmlcharrefreplace``   XML character (encoding only)
``backslashreplace``    escape sequence (encoding only)
=====================   ===========

Encoding Errors
---------------

The most common error condition is receiving a
:ref:`UnicodeEncodeError <exceptions-UnicodeError>` when writing
Unicode data to an ASCII output stream, such as a regular file or
:ref:`sys.stdout <sys-input-output>`.  This sample program can be used
to experiment with the different error handling modes.

.. include:: codecs_encode_error.py
   :literal:
   :start-after: #end_pymotw_header

While ``strict`` mode is safest for ensuring your application
explicitly sets the correct encoding for all I/O operations, it can
lead to program crashes when an exception is raised.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_encode_error.py strict'))
.. }}}

::

	$ python codecs_encode_error.py strict
	ERROR: 'ascii' codec can't encode character u'\u03c0' in position 4: ordinal not in range(128)

.. {{{end}}}

Some of the other error modes are more flexible.  For example,
``replace`` ensures that no error is raised, at the expense of
possibly losing data that cannot be converted to the requested
encoding.  The Unicode character for pi still cannot be encoded in
ASCII, but instead of raising an exception the character is replaced
with ``?`` in the output.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_encode_error.py replace'))
.. }}}

::

	$ python codecs_encode_error.py replace
	File contents: 'pi: ?'

.. {{{end}}}

To skip over problem data entirely, use ``ignore``.  Any data that
cannot be encoded is simply discarded.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_encode_error.py ignore'))
.. }}}

::

	$ python codecs_encode_error.py ignore
	File contents: 'pi: '

.. {{{end}}}

There are two lossless error handling options, both of which replace
the character with an alternate representation defined by a standard
separate from the encoding.  ``xmlcharrefreplace`` uses an XML
character reference as a substitute (the list of character references
is specified in the W3C `XML Entity Definitions for Characters
<http://www.w3.org/TR/xml-entity-names/>`__).

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_encode_error.py xmlcharrefreplace'))
.. }}}

::

	$ python codecs_encode_error.py xmlcharrefreplace
	File contents: 'pi: &#960;'

.. {{{end}}}

The other lossless error handling scheme is ``backslashreplace`` which
produces an output format like the value you get when you print the
:func:`repr` of a :class:`unicode` object.  Unicode characters are
replaced with ``\u`` followed by the hexadecimal value of the code
point.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_encode_error.py backslashreplace'))
.. }}}

::

	$ python codecs_encode_error.py backslashreplace
	File contents: 'pi: \\u03c0'

.. {{{end}}}


Decoding Errors
---------------

It is also possible to see errors when decoding data, especially if
the wrong encoding is used.

.. include:: codecs_decode_error.py
   :literal:
   :start-after: #end_pymotw_header

As with encoding, ``strict`` error handling mode raises an exception
if the byte stream cannot be properly decoded.  In this case, a
:ref:`UnicodeDecodeError <exceptions-UnicodeError>` results from
trying to convert part of the UTF-16 BOM to a character using the
UTF-8 decoder.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_decode_error.py strict'))
.. }}}

::

	$ python codecs_decode_error.py strict
	Original     : u'pi: \u03c0'
	File contents: ff fe 70 00 69 00 3a 00 20 00 c0 03
	ERROR: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

.. {{{end}}}

Switching to ``ignore`` causes the decoder to skip over the invalid
bytes.  The result is still not quite what is expected, though, since
it includes embedded null bytes.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_decode_error.py ignore'))
.. }}}

::

	$ python codecs_decode_error.py ignore
	Original     : u'pi: \u03c0'
	File contents: ff fe 70 00 69 00 3a 00 20 00 c0 03
	Read         : u'p\x00i\x00:\x00 \x00\x03'

.. {{{end}}}

In ``replace`` mode invalid bytes are replaced with ``\uFFFD``, the
official Unicode replacement character, which looks like a diamond
with a black background containing a white question mark (|?|).

.. |?| unicode:: 0xFFFD

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_decode_error.py replace'))
.. }}}

::

	$ python codecs_decode_error.py replace
	Original     : u'pi: \u03c0'
	File contents: ff fe 70 00 69 00 3a 00 20 00 c0 03
	Read         : u'\ufffd\ufffdp\x00i\x00:\x00 \x00\ufffd\x03'

.. {{{end}}}

Standard Input and Output Streams
=================================

The most common cause of :ref:`UnicodeEncodeError
<exceptions-UnicodeError>` exceptions is code that tries to print
:class:`unicode` data to the console or a Unix pipeline when
:ref:`sys.stdout <sys-input-output>` is not configured with an
encoding.

.. include:: codecs_stdout.py
   :literal:
   :start-after: #end_pymotw_header

Problems with the default encoding of the standard I/O channels can be
difficult to debug because the program works as expected when the
output goes to the console, but cause encoding errors when it is used
as part of a pipeline and the output includes Unicode characters above
the ASCII range.  This difference in behavior is caused by Python's
initialization code, which sets the default encoding for each standard
I/O channel *only if* the channel is connected to a terminal
(:func:`isatty` returns ``True``).  If there is no terminal, Python
assumes the program will configure the encoding explicitly, and leaves
the I/O channel alone.

.. Do not use cog, since it never has a TTY.

::

	$ python codecs_stdout.py 
	Default encoding: utf-8
	TTY: True
	pi: π

	$ python codecs_stdout.py | cat -
	Default encoding: None
	TTY: False
	Traceback (most recent call last):
	  File "codecs_stdout.py", line 18, in <module>
	    print text
	UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c0' in
	 position 4: ordinal not in range(128)

To explicitly set the encoding on the standard output channel, use
:func:`getwriter` to get a stream encoder class for a specific
encoding.  Instantiate the class, passing ``sys.stdout`` as the only
argument.

.. include:: codecs_stdout_wrapped.py
   :literal:
   :start-after: #end_pymotw_header

Writing to the wrapped version of ``sys.stdout`` passes the Unicode
text through an encoder before sending the encoded bytes to stdout.
Replacing ``sys.stdout`` means that any code used by your application
that prints to standard output will be able to take advantage of the
encoding writer.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_stdout_wrapped.py'))
.. }}}

::

	$ python codecs_stdout_wrapped.py
	Via write: pi: π
	Via print: pi: π

.. {{{end}}}

The next problem to solve is how to know which encoding should be
used.  The proper encoding varies based on location, language, and
user or system configuration, so hard-coding a fixed value is not a
good idea.  It would also be annoying for a user to need to pass
explicit arguments to every program setting the input and output
encodings.  Fortunately, there is a global way to get a reasonable
default encoding, using :mod:`locale`.

.. include:: codecs_stdout_locale.py
   :literal:
   :start-after: #end_pymotw_header

:func:`getdefaultlocale` returns the language and preferred encoding
based on the system and user configuration settings in a form that can
be used with :func:`getwriter`.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_stdout_locale.py'))
.. }}}

::

	$ python codecs_stdout_locale.py
	Locale encoding    : UTF-8
	With wrapped stdout: pi: π

.. {{{end}}}

The encoding also needs to be set up when working with :ref:`sys.stdin
<sys-input-output>`.  Use :func:`getreader` to get a reader capable of
decoding the input bytes.

.. include:: codecs_stdin.py
   :literal:
   :start-after: #end_pymotw_header

Reading from the wrapped handle returns :class:`unicode` objects
instead of :class:`str` instances.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_stdout_locale.py | python codecs_stdin.py'))
.. }}}

::

	$ python codecs_stdout_locale.py | python codecs_stdin.py
	From stdin: u'Locale encoding    : UTF-8\nWith wrapped stdout: pi: \u03c0\n'

.. {{{end}}}

Network Communication
=====================

Network sockets are also byte-streams, and so Unicode data must be
encoded into bytes before it is written to a socket.

.. include:: codecs_socket_fail.py
   :literal:
   :start-after: #end_pymotw_header

You could encode the data explicitly, before sending it, but miss one
call to :func:`send` and your program would fail with an encoding
error.

.. Do not re-run this example every time, since it sometimes generates
.. errors within the thread that distracts from the unicode error.

.. cog.out(run_script(cog.inFile, 'codecs_socket_fail.py', ignore_error=True))

::

	$ python codecs_socket_fail.py
	Traceback (most recent call last):
	  File "codecs_socket_fail.py", line 43, in <module>
	    len_sent = s.send(text)
	UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c0' in
     position 4: ordinal not in range(128)


By using :func:`makefile` to get a file-like handle for the socket,
and then wrapping that with a stream-based reader or writer, you will
be able to pass Unicode strings and know they are encoded on the way
in to and out of the socket.

.. include:: codecs_socket.py
   :literal:
   :start-after: #end_pymotw_header

This example uses :class:`PassThrough` to show that the data is
encoded before being sent, and the response is decoded after it is
received in the client.

.. Do not re-run this example every time, since it sometimes generates
.. errors within the thread that distracts from the unicode error.

.. cog.out(run_script(cog.inFile, 'codecs_socket.py'))

::

	$ python codecs_socket.py
	Sending : u'pi: \u03c0'
	Writing : 'pi: \xcf\x80'
	Reading : 'pi: \xcf\x80'
	Received: u'pi: \u03c0'


Encoding Translation
====================

Although most applications will work with :class:`unicode` data
internally, decoding or encoding it as part of an I/O operation, there
are times when changing a file's encoding without holding on to that
intermediate data format is useful.  :func:`EncodedFile` takes an open
file handle using one encoding and wraps it with a class that
translates the data to another encoding as the I/O occurs.

.. include:: codecs_encodedfile.py
   :literal:
   :start-after: #end_pymotw_header

This example shows reading from and writing to separate handles
returned by :func:`EncodedFile`.  No matter whether the handle is used
for reading or writing, the *file_encoding* always refers to the
encoding in use by the open file handle passed as the first argument,
and *data_encoding* value refers to the encoding in use by the data
passing through the :func:`read` and :func:`write` calls.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_encodedfile.py'))
.. }}}

::

	$ python codecs_encodedfile.py
	Start as UTF-8   : 70 69 3a 20 cf 80
	Encoded to UTF-16: fffe 7000 6900 3a00 2000 c003
	Back to UTF-8    : 70 69 3a 20 cf 80

.. {{{end}}}


Non-Unicode Encodings
=====================

Although most of the earlier examples use Unicode encodings,
:mod:`codecs` can be used for many other data translations.  For
example, Python includes codecs for working with base-64, bzip2,
ROT-13, ZIP, and other data formats.

.. include:: codecs_rot13.py
   :literal:
   :start-after: #end_pymotw_header

Any transformation that can be expressed as a function taking a single
input argument and returning a byte or Unicode string can be
registered as a codec.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_rot13.py'))
.. }}}

::

	$ python codecs_rot13.py
	Original: abcdefghijklmnopqrstuvwxyz
	ROT-13  : nopqrstuvwxyzabcdefghijklm

.. {{{end}}}

Using :mod:`codecs` to wrap a data stream provides a simpler interface
than working directly with :mod:`zlib`.

.. include:: codecs_zlib.py
   :literal:
   :start-after: #end_pymotw_header

Not all of the compression or encoding systems support reading a
portion of the data through the stream interface using
:func:`readline` or :func:`read` because they need to find the end of
a compressed segment to expand it.  If your program cannot hold the
entire uncompressed data set in memory, use the incremental access
features of the compression library instead of :mod:`codecs`.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_zlib.py'))
.. }}}

::

	$ python codecs_zlib.py
	Original length : 1350
	ZIP compressed  : 48
	Read first line : 'abcdefghijklmnopqrstuvwxyz\n'
	Uncompressed    : 1350
	Same            : True

.. {{{end}}}

Incremental Encoding
====================

Some of the encodings provided, especially ``bz2`` and ``zlib``, may
dramatically change the length of the data stream as they work on it.
For large data sets, these encodings operate better incrementally,
working on one small chunk of data at a time.  The
:class:`IncrementalEncoder` and :class:`IncrementalDecoder` API is
designed for this purpose.

.. include:: codecs_incremental_bz2.py
   :literal:
   :start-after: #end_pymotw_header

Each time data is passed to the encoder or decoder its internal state
is updated.  When the state is consistent (as defined by the codec),
data is returned and the state resets.  Until that point, calls to
:func:`encode` or :func:`decode` will not return any data.  When the
last bit of data is passed in, the argument *final* should be set to
``True`` so the codec knows to flush any remaining buffered data.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_incremental_bz2.py', break_lines_at=69))
.. }}}

::

	$ python codecs_incremental_bz2.py
	Text length : 27
	Repetitions : 50
	Expected len: 1350
	Encoding:.................................................
	Encoded : 99 bytes
	Total encoded length: 99
	Decoding:............................................................
	............................
	Decoded : 1350 characters
	Decoding:..........
	Total uncompressed length: 1350

.. {{{end}}}


Defining Your Own Encoding
==========================

Since Python comes with a large number of standard codecs already, it
is unlikely that you will need to define your own.  If you do, there
are several base classes in :mod:`codecs` to make the process easier.

The first step is to understand the nature of the transformation
described by the encoding.  For example, an "invertcaps" encoding
converts uppercase letters to lowercase, and lowercase letters to
uppercase.  Here is a simple definition of an encoding function that
performs this transformation on an input string:

.. include:: codecs_invertcaps.py
   :literal:
   :start-after: #end_pymotw_header

In this case, the encoder and decoder are the same function (as with
``ROT-13``).

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_invertcaps.py'))
.. }}}

::

	$ python codecs_invertcaps.py
	abc.DEF
	ABC.def

.. {{{end}}}

Although it is easy to understand, this implementation is not
efficient, especially for very large text strings.  Fortunately,
:mod:`codecs` includes some helper functions for creating *character
map* based codecs such as invertcaps.  A character map encoding is
made up of two dictionaries.  The *encoding map* converts character
values from the input string to byte values in the output and the
*decoding map* goes the other way.  Create your decoding map first,
and then use :func:`make_encoding_map` to convert it to an encoding
map.  The C functions :func:`charmap_encode` and
:func:`charmap_decode` use the maps to convert their input data
efficiently.

.. include:: codecs_invertcaps_charmap.py
   :literal:
   :start-after: #end_pymotw_header

Although the encoding and decoding maps for invertcaps are the same,
that may not always be the case.  :func:`make_encoding_map` detects
situations where more than one input character is encoded to the same
output byte and replaces the encoding value with ``None`` to mark the
encoding as undefined.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_invertcaps_charmap.py'))
.. }}}

::

	$ python codecs_invertcaps_charmap.py
	('ABC.def', 7)
	(u'ABC.def', 7)
	True

.. {{{end}}}

The character map encoder and decoder support all of the standard
error handling methods described earlier, so you do not need to do any
extra work to comply with that part of the API.

.. include:: codecs_invertcaps_error.py
   :literal:
   :start-after: #end_pymotw_header

Because the Unicode code point for ``π`` is not in the encoding map,
the strict error handling mode raises an exception.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_invertcaps_error.py', ignore_error=True, break_lines_at=69))
.. }}}

::

	$ python codecs_invertcaps_error.py
	ignore : ('PI: ', 5)
	replace: ('PI: ?', 5)
	strict : 'charmap' codec can't encode character u'\u03c0' in position
	 4: character maps to <undefined>

.. {{{end}}}

After that the encoding and decoding maps are defined, you need to set
up a few additional classes and register the encoding.
:func:`register` adds a search function to the registry so that when a
user wants to use your encoding :mod:`codecs` can locate it.  The
search function must take a single string argument with the name of
the encoding, and return a :class:`CodecInfo` object if it knows the
encoding, or ``None`` if it does not.

.. include:: codecs_register.py
   :literal:
   :start-after: #end_pymotw_header

You can register multiple search functions, and each will be called in
turn until one returns a :class:`CodecInfo` or the list is exhausted.
The internal search function registered by :mod:`codecs` knows how to
load the standard codecs such as UTF-8 from :mod:`encodings`, so those
names will never be passed to your search function.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_register.py'))
.. }}}

::

	$ python codecs_register.py
	UTF-8: <codecs.CodecInfo object for encoding utf-8 at 0x100439ce8>
	search1: Searching for: no-such-encoding
	search2: Searching for: no-such-encoding
	ERROR: unknown encoding: no-such-encoding

.. {{{end}}}

The :class:`CodecInfo` instance returned by the search function tells
:mod:`codecs` how to encode and decode using all of the different
mechanisms supported: stateless, incremental, and stream.
:mod:`codecs` includes base classes that make setting up a character
map encoding easy.  This example puts all of the pieces together to
register a search function that returns a :class:`CodecInfo` instance
configured for the invertcaps codec.

.. include:: codecs_invertcaps_register.py
   :literal:
   :start-after: #end_pymotw_header

The stateless encoder/decoder base class is :class:`Codec`.  Override
:func:`encode` and :func:`decode` with your implementation (in this
case, calling :func:`charmap_encode` and :func:`charmap_decode`
respectively).  Each method must return a tuple containing the
transformed data and the number of the input bytes or characters
consumed.  Conveniently, :func:`charmap_encode` and
:func:`charmap_decode` already return that information.

:class:`IncrementalEncoder` and :class:`IncrementalDecoder` serve as
base classes for the incremental interfaces.  The :func:`encode` and
:func:`decode` methods of the incremental classes are defined in such
a way that they only return the actual transformed data.  Any
information about buffering is maintained as internal state.  The
invertcaps encoding does not need to buffer data (it uses a one-to-one
mapping).  For encodings that produce a different amount of output
depending on the data being processed, such as compression algorithms,
:class:`BufferedIncrementalEncoder` and
:class:`BufferedIncrementalDecoder` are more appropriate base classes,
since they manage the unprocessed portion of the input for you.

:class:`StreamReader` and :class:`StreamWriter` need :func:`encode`
and :func:`decode` methods, too, and since they are expected to return
the same value as the version from :class:`Codec` you can use multiple
inheritance for the implementation.

.. {{{cog
.. cog.out(run_script(cog.inFile, 'codecs_invertcaps_register.py'))
.. }}}

::

	$ python codecs_invertcaps_register.py
	Encoder converted "abc.DEF" to "ABC.def", consuming 7 characters
	StreamWriter for stdout: ABC.def
	IncrementalDecoder converted "ABC.def" to "abc.DEF"

.. {{{end}}}


.. seealso::

    `codecs <http://docs.python.org/library/codecs.html>`_
        The standard library documentation for this module.

    :mod:`locale`
        Accessing and managing the localization-based configuration
        settings and behaviors.

    :mod:`io`
        The :mod:`io` module includes file and stream wrappers that
        handle encoding and decoding, too.

    :mod:`SocketServer`
        For a more detailed example of an echo server, see the
        :mod:`SocketServer` module.

    :mod:`encodings`
        Package in the standard library containing the encoder/decoder
        implementations provided by Python..

    `Unicode HOWTO`_
        The official guide for using Unicode with Python 2.x.

    `Python Unicode Objects <http://effbot.org/zone/unicode-objects.htm>`_
        Fredrik Lundh's article about using non-ASCII character sets
        in Python 2.0.

    `How to Use UTF-8 with Python <http://evanjones.ca/python-utf8.html>`_
        Evan Jones' quick guide to working with Unicode, including XML
        data and the Byte-Order Marker.

    `On the Goodness of Unicode <http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode>`_
        Introduction to internationalization and Unicode by Tim Bray.

    `On Character Strings <http://www.tbray.org/ongoing/When/200x/2003/04/13/Strings>`_
        A look at the history of string processing in programming
        languages, by Tim Bray.

    `Characters vs. Bytes <http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF>`_
        Part one of Tim Bray's "essay on modern character string
        processing for computer programmers."  This installment covers
        in-memory representation of text in formats other than ASCII
        bytes.

    `The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) <http://www.joelonsoftware.com/articles/Unicode.html>`_
        An introduction to Unicode by Joel Spolsky.

    `Endianness <http://en.wikipedia.org/wiki/Endianness>`_
        Explanation of endianness in Wikipedia.


.. _Unicode HOWTO: http://docs.python.org/howto/unicode