======================================== codecs -- String encoding and decoding ======================================== .. module:: codecs :synopsis: String encoding and decoding. :Purpose: Encoders and decoders for converting text between different representations. :Available In: 2.1 and later The :mod:`codecs` module provides stream and file interfaces for transcoding data in your program. It is most commonly used to work with Unicode text, but other encodings are also available for other purposes. Unicode Primer ============== CPython 2.x supports two types of strings for working with text data. Old-style :class:`str` instances use a single 8-bit byte to represent each character of the string using its ASCII code. In contrast, :class:`unicode` strings are managed internally as a sequence of Unicode *code points*. The code point values are saved as a sequence of 2 or 4 bytes each, depending on the options given when Python was compiled. Both :class:`unicode` and :class:`str` are derived from a common base class, and support a similar API. When :class:`unicode` strings are output, they are encoded using one of several standard schemes so that the sequence of bytes can be reconstructed as the same string later. The bytes of the encoded value are not necessarily the same as the code point values, and the encoding defines a way to translate between the two sets of values. Reading Unicode data also requires knowing the encoding so that the incoming bytes can be converted to the internal representation used by the :class:`unicode` class. The most common encodings for Western languages are ``UTF-8`` and ``UTF-16``, which use sequences of one and two byte values respectively to represent each character. Other encodings can be more efficient for storing languages where most of the characters are represented by code points that do not fit into two bytes. .. seealso:: For more introductory information about Unicode, refer to the list of references at the end of this section. The Python `Unicode HOWTO`_ is especially helpful. Encodings --------- The best way to understand encodings is to look at the different series of bytes produced by encoding the same string in different ways. The examples below use this function to format the byte string to make it easier to read. .. include:: codecs_to_hex.py :literal: :start-after: #end_pymotw_header The function uses :mod:`binascii` to get a hexadecimal representation of the input byte string, then insert a space between every *nbytes* bytes before returning the value. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_to_hex.py')) .. }}} :: $ python codecs_to_hex.py 61 62 63 64 65 66 6162 6364 6566 .. {{{end}}} The first encoding example begins by printing the text ``'pi: π'`` using the raw representation of the :class:`unicode` class. The ``π`` character is replaced with the expression for the Unicode code point, ``\u03c0``. The next two lines encode the string as UTF-8 and UTF-16 respectively, and show the hexadecimal values resulting from the encoding. .. include:: codecs_encodings.py :literal: :start-after: #end_pymotw_header The result of encoding a :class:`unicode` string is a :class:`str` object. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_encodings.py')) .. }}} :: $ python codecs_encodings.py Raw : u'pi: \u03c0' UTF-8 : 70 69 3a 20 cf 80 UTF-16: fffe 7000 6900 3a00 2000 c003 .. {{{end}}} Given a sequence of encoded bytes as a :class:`str` instance, the :func:`decode` method translates them to code points and returns the sequence as a :class:`unicode` instance. .. include:: codecs_decode.py :literal: :start-after: #end_pymotw_header The choice of encoding used does not change the output type. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_decode.py')) .. }}} :: $ python codecs_decode.py Original : u'pi: \u03c0' Encoded : 70 69 3a 20 cf 80 Decoded : u'pi: \u03c0' .. {{{end}}} .. note:: The default encoding is set during the interpreter start-up process, when :mod:`site` is loaded. Refer to :ref:`sys-unicode-defaults` for a description of the default encoding settings accessible via :mod:`sys`. Working with Files ================== Encoding and decoding strings is especially important when dealing with I/O operations. Whether you are writing to a file, socket, or other stream, you will want to ensure that the data is using the proper encoding. In general, all text data needs to be decoded from its byte representation as it is read, and encoded from the internal values to a specific representation as it is written. Your program can explicitly encode and decode data, but depending on the encoding used it can be non-trivial to determine whether you have read enough bytes in order to fully decode the data. :mod:`codecs` provides classes that manage the data encoding and decoding for you, so you don't have to create your own. The simplest interface provided by :mod:`codecs` is a replacement for the built-in :func:`open` function. The new version works just like the built-in, but adds two new arguments to specify the encoding and desired error handling technique. .. include:: codecs_open_write.py :literal: :start-after: #end_pymotw_header Starting with a :class:`unicode` string with the code point for π, this example saves the text to a file using an encoding specified on the command line. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_open_write.py utf-8')) .. cog.out(run_script(cog.inFile, 'codecs_open_write.py utf-16', include_prefix=False)) .. cog.out(run_script(cog.inFile, 'codecs_open_write.py utf-32', include_prefix=False)) .. }}} :: $ python codecs_open_write.py utf-8 Writing to utf-8.txt File contents: 70 69 3a 20 cf 80 $ python codecs_open_write.py utf-16 Writing to utf-16.txt File contents: fffe 7000 6900 3a00 2000 c003 $ python codecs_open_write.py utf-32 Writing to utf-32.txt File contents: fffe0000 70000000 69000000 3a000000 20000000 c0030000 .. {{{end}}} Reading the data with :func:`open` is straightforward, with one catch: you must know the encoding in advance, in order to set up the decoder correctly. Some data formats, such as XML, let you specify the encoding as part of the file, but usually it is up to the application to manage. :mod:`codecs` simply takes the encoding as an argument and assumes it is correct. .. include:: codecs_open_read.py :literal: :start-after: #end_pymotw_header This example reads the files created by the previous program, and prints the representation of the resulting :class:`unicode` object to the console. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_open_read.py utf-8')) .. cog.out(run_script(cog.inFile, 'codecs_open_read.py utf-16', include_prefix=False)) .. cog.out(run_script(cog.inFile, 'codecs_open_read.py utf-32', include_prefix=False)) .. }}} :: $ python codecs_open_read.py utf-8 Reading from utf-8.txt u'pi: \u03c0' $ python codecs_open_read.py utf-16 Reading from utf-16.txt u'pi: \u03c0' $ python codecs_open_read.py utf-32 Reading from utf-32.txt u'pi: \u03c0' .. {{{end}}} Byte Order ========== Multi-byte encodings such as UTF-16 and UTF-32 pose a problem when transferring the data between different computer systems, either by copying the file directly or with network communication. Different systems use different ordering of the high and low order bytes. This characteristic of the data, known as its *endianness*, depends on factors such as the hardware architecture and choices made by the operating system and application developer. There isn't always a way to know in advance what byte order to use for a given set of data, so the multi-byte encodings include a *byte-order marker* (BOM) as the first few bytes of encoded output. For example, UTF-16 is defined in such a way that 0xFFFE and 0xFEFF are not valid characters, and can be used to indicate the byte order. :mod:`codecs` defines constants for the byte order markers used by UTF-16 and UTF-32. .. include:: codecs_bom.py :literal: :start-after: #end_pymotw_header ``BOM``, ``BOM_UTF16``, and ``BOM_UTF32`` are automatically set to the appropriate big-endian or little-endian values depending on the current system's native byte order. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_bom.py')) .. }}} :: $ python codecs_bom.py BOM : fffe BOM_BE : feff BOM_LE : fffe BOM_UTF8 : efbb bf BOM_UTF16 : fffe BOM_UTF16_BE : feff BOM_UTF16_LE : fffe BOM_UTF32 : fffe 0000 BOM_UTF32_BE : 0000 feff BOM_UTF32_LE : fffe 0000 .. {{{end}}} Byte ordering is detected and handled automatically by the decoders in :mod:`codecs`, but you can also choose an explicit ordering for the encoding. .. include:: codecs_bom_create_file.py :literal: :start-after: #end_pymotw_header ``codecs_bom_create_file.py`` figures out the native byte ordering, then uses the alternate form explicitly so the next example can demonstrate auto-detection while reading. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_bom_create_file.py')) .. }}} :: $ python codecs_bom_create_file.py Native order : fffe Selected order: feff utf_16_be : 0070 0069 003a 0020 03c0 .. {{{end}}} ``codecs_bom_detection.py`` does not specify a byte order when opening the file, so the decoder uses the BOM value in the first two bytes of the file to determine it. .. include:: codecs_bom_detection.py :literal: :start-after: #end_pymotw_header Since the first two bytes of the file are used for byte order detection, they are not included in the data returned by :func:`read`. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_bom_detection.py')) .. }}} :: $ python codecs_bom_detection.py Raw : feff 0070 0069 003a 0020 03c0 Decoded: u'pi: \u03c0' .. {{{end}}} Error Handling ============== The previous sections pointed out the need to know the encoding being used when reading and writing Unicode files. Setting the encoding correctly is important for two reasons. If the encoding is configured incorrectly while reading from a file, the data will be interpreted wrong and may be corrupted or simply fail to decode. Not all Unicode characters can be represented in all encodings, so if the wrong encoding is used while writing an error will be generated and data may be lost. :mod:`codecs` uses the same five error handling options that are provided by the :func:`encode` method of :class:`unicode` and the :func:`decode` method of :class:`str`. ===================== =========== Error Mode Description ===================== =========== ``strict`` Raises an exception if the data cannot be converted. ``replace`` Substitutes a special marker character for data that cannot be encoded. ``ignore`` Skips the data. ``xmlcharrefreplace`` XML character (encoding only) ``backslashreplace`` escape sequence (encoding only) ===================== =========== Encoding Errors --------------- The most common error condition is receiving a :ref:`UnicodeEncodeError ` when writing Unicode data to an ASCII output stream, such as a regular file or :ref:`sys.stdout `. This sample program can be used to experiment with the different error handling modes. .. include:: codecs_encode_error.py :literal: :start-after: #end_pymotw_header While ``strict`` mode is safest for ensuring your application explicitly sets the correct encoding for all I/O operations, it can lead to program crashes when an exception is raised. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_encode_error.py strict')) .. }}} :: $ python codecs_encode_error.py strict ERROR: 'ascii' codec can't encode character u'\u03c0' in position 4: ordinal not in range(128) .. {{{end}}} Some of the other error modes are more flexible. For example, ``replace`` ensures that no error is raised, at the expense of possibly losing data that cannot be converted to the requested encoding. The Unicode character for pi still cannot be encoded in ASCII, but instead of raising an exception the character is replaced with ``?`` in the output. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_encode_error.py replace')) .. }}} :: $ python codecs_encode_error.py replace File contents: 'pi: ?' .. {{{end}}} To skip over problem data entirely, use ``ignore``. Any data that cannot be encoded is simply discarded. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_encode_error.py ignore')) .. }}} :: $ python codecs_encode_error.py ignore File contents: 'pi: ' .. {{{end}}} There are two lossless error handling options, both of which replace the character with an alternate representation defined by a standard separate from the encoding. ``xmlcharrefreplace`` uses an XML character reference as a substitute (the list of character references is specified in the W3C `XML Entity Definitions for Characters `__). .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_encode_error.py xmlcharrefreplace')) .. }}} :: $ python codecs_encode_error.py xmlcharrefreplace File contents: 'pi: π' .. {{{end}}} The other lossless error handling scheme is ``backslashreplace`` which produces an output format like the value you get when you print the :func:`repr` of a :class:`unicode` object. Unicode characters are replaced with ``\u`` followed by the hexadecimal value of the code point. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_encode_error.py backslashreplace')) .. }}} :: $ python codecs_encode_error.py backslashreplace File contents: 'pi: \\u03c0' .. {{{end}}} Decoding Errors --------------- It is also possible to see errors when decoding data, especially if the wrong encoding is used. .. include:: codecs_decode_error.py :literal: :start-after: #end_pymotw_header As with encoding, ``strict`` error handling mode raises an exception if the byte stream cannot be properly decoded. In this case, a :ref:`UnicodeDecodeError ` results from trying to convert part of the UTF-16 BOM to a character using the UTF-8 decoder. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_decode_error.py strict')) .. }}} :: $ python codecs_decode_error.py strict Original : u'pi: \u03c0' File contents: ff fe 70 00 69 00 3a 00 20 00 c0 03 ERROR: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte .. {{{end}}} Switching to ``ignore`` causes the decoder to skip over the invalid bytes. The result is still not quite what is expected, though, since it includes embedded null bytes. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_decode_error.py ignore')) .. }}} :: $ python codecs_decode_error.py ignore Original : u'pi: \u03c0' File contents: ff fe 70 00 69 00 3a 00 20 00 c0 03 Read : u'p\x00i\x00:\x00 \x00\x03' .. {{{end}}} In ``replace`` mode invalid bytes are replaced with ``\uFFFD``, the official Unicode replacement character, which looks like a diamond with a black background containing a white question mark (|?|). .. |?| unicode:: 0xFFFD .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_decode_error.py replace')) .. }}} :: $ python codecs_decode_error.py replace Original : u'pi: \u03c0' File contents: ff fe 70 00 69 00 3a 00 20 00 c0 03 Read : u'\ufffd\ufffdp\x00i\x00:\x00 \x00\ufffd\x03' .. {{{end}}} Standard Input and Output Streams ================================= The most common cause of :ref:`UnicodeEncodeError ` exceptions is code that tries to print :class:`unicode` data to the console or a Unix pipeline when :ref:`sys.stdout ` is not configured with an encoding. .. include:: codecs_stdout.py :literal: :start-after: #end_pymotw_header Problems with the default encoding of the standard I/O channels can be difficult to debug because the program works as expected when the output goes to the console, but cause encoding errors when it is used as part of a pipeline and the output includes Unicode characters above the ASCII range. This difference in behavior is caused by Python's initialization code, which sets the default encoding for each standard I/O channel *only if* the channel is connected to a terminal (:func:`isatty` returns ``True``). If there is no terminal, Python assumes the program will configure the encoding explicitly, and leaves the I/O channel alone. .. Do not use cog, since it never has a TTY. :: $ python codecs_stdout.py Default encoding: utf-8 TTY: True pi: π $ python codecs_stdout.py | cat - Default encoding: None TTY: False Traceback (most recent call last): File "codecs_stdout.py", line 18, in print text UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c0' in position 4: ordinal not in range(128) To explicitly set the encoding on the standard output channel, use :func:`getwriter` to get a stream encoder class for a specific encoding. Instantiate the class, passing ``sys.stdout`` as the only argument. .. include:: codecs_stdout_wrapped.py :literal: :start-after: #end_pymotw_header Writing to the wrapped version of ``sys.stdout`` passes the Unicode text through an encoder before sending the encoded bytes to stdout. Replacing ``sys.stdout`` means that any code used by your application that prints to standard output will be able to take advantage of the encoding writer. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_stdout_wrapped.py')) .. }}} :: $ python codecs_stdout_wrapped.py Via write: pi: π Via print: pi: π .. {{{end}}} The next problem to solve is how to know which encoding should be used. The proper encoding varies based on location, language, and user or system configuration, so hard-coding a fixed value is not a good idea. It would also be annoying for a user to need to pass explicit arguments to every program setting the input and output encodings. Fortunately, there is a global way to get a reasonable default encoding, using :mod:`locale`. .. include:: codecs_stdout_locale.py :literal: :start-after: #end_pymotw_header :func:`getdefaultlocale` returns the language and preferred encoding based on the system and user configuration settings in a form that can be used with :func:`getwriter`. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_stdout_locale.py')) .. }}} :: $ python codecs_stdout_locale.py Locale encoding : UTF-8 With wrapped stdout: pi: π .. {{{end}}} The encoding also needs to be set up when working with :ref:`sys.stdin `. Use :func:`getreader` to get a reader capable of decoding the input bytes. .. include:: codecs_stdin.py :literal: :start-after: #end_pymotw_header Reading from the wrapped handle returns :class:`unicode` objects instead of :class:`str` instances. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_stdout_locale.py | python codecs_stdin.py')) .. }}} :: $ python codecs_stdout_locale.py | python codecs_stdin.py From stdin: u'Locale encoding : UTF-8\nWith wrapped stdout: pi: \u03c0\n' .. {{{end}}} Network Communication ===================== Network sockets are also byte-streams, and so Unicode data must be encoded into bytes before it is written to a socket. .. include:: codecs_socket_fail.py :literal: :start-after: #end_pymotw_header You could encode the data explicitly, before sending it, but miss one call to :func:`send` and your program would fail with an encoding error. .. Do not re-run this example every time, since it sometimes generates .. errors within the thread that distracts from the unicode error. .. cog.out(run_script(cog.inFile, 'codecs_socket_fail.py', ignore_error=True)) :: $ python codecs_socket_fail.py Traceback (most recent call last): File "codecs_socket_fail.py", line 43, in len_sent = s.send(text) UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c0' in position 4: ordinal not in range(128) By using :func:`makefile` to get a file-like handle for the socket, and then wrapping that with a stream-based reader or writer, you will be able to pass Unicode strings and know they are encoded on the way in to and out of the socket. .. include:: codecs_socket.py :literal: :start-after: #end_pymotw_header This example uses :class:`PassThrough` to show that the data is encoded before being sent, and the response is decoded after it is received in the client. .. Do not re-run this example every time, since it sometimes generates .. errors within the thread that distracts from the unicode error. .. cog.out(run_script(cog.inFile, 'codecs_socket.py')) :: $ python codecs_socket.py Sending : u'pi: \u03c0' Writing : 'pi: \xcf\x80' Reading : 'pi: \xcf\x80' Received: u'pi: \u03c0' Encoding Translation ==================== Although most applications will work with :class:`unicode` data internally, decoding or encoding it as part of an I/O operation, there are times when changing a file's encoding without holding on to that intermediate data format is useful. :func:`EncodedFile` takes an open file handle using one encoding and wraps it with a class that translates the data to another encoding as the I/O occurs. .. include:: codecs_encodedfile.py :literal: :start-after: #end_pymotw_header This example shows reading from and writing to separate handles returned by :func:`EncodedFile`. No matter whether the handle is used for reading or writing, the *file_encoding* always refers to the encoding in use by the open file handle passed as the first argument, and *data_encoding* value refers to the encoding in use by the data passing through the :func:`read` and :func:`write` calls. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_encodedfile.py')) .. }}} :: $ python codecs_encodedfile.py Start as UTF-8 : 70 69 3a 20 cf 80 Encoded to UTF-16: fffe 7000 6900 3a00 2000 c003 Back to UTF-8 : 70 69 3a 20 cf 80 .. {{{end}}} Non-Unicode Encodings ===================== Although most of the earlier examples use Unicode encodings, :mod:`codecs` can be used for many other data translations. For example, Python includes codecs for working with base-64, bzip2, ROT-13, ZIP, and other data formats. .. include:: codecs_rot13.py :literal: :start-after: #end_pymotw_header Any transformation that can be expressed as a function taking a single input argument and returning a byte or Unicode string can be registered as a codec. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_rot13.py')) .. }}} :: $ python codecs_rot13.py Original: abcdefghijklmnopqrstuvwxyz ROT-13 : nopqrstuvwxyzabcdefghijklm .. {{{end}}} Using :mod:`codecs` to wrap a data stream provides a simpler interface than working directly with :mod:`zlib`. .. include:: codecs_zlib.py :literal: :start-after: #end_pymotw_header Not all of the compression or encoding systems support reading a portion of the data through the stream interface using :func:`readline` or :func:`read` because they need to find the end of a compressed segment to expand it. If your program cannot hold the entire uncompressed data set in memory, use the incremental access features of the compression library instead of :mod:`codecs`. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_zlib.py')) .. }}} :: $ python codecs_zlib.py Original length : 1350 ZIP compressed : 48 Read first line : 'abcdefghijklmnopqrstuvwxyz\n' Uncompressed : 1350 Same : True .. {{{end}}} Incremental Encoding ==================== Some of the encodings provided, especially ``bz2`` and ``zlib``, may dramatically change the length of the data stream as they work on it. For large data sets, these encodings operate better incrementally, working on one small chunk of data at a time. The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` API is designed for this purpose. .. include:: codecs_incremental_bz2.py :literal: :start-after: #end_pymotw_header Each time data is passed to the encoder or decoder its internal state is updated. When the state is consistent (as defined by the codec), data is returned and the state resets. Until that point, calls to :func:`encode` or :func:`decode` will not return any data. When the last bit of data is passed in, the argument *final* should be set to ``True`` so the codec knows to flush any remaining buffered data. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_incremental_bz2.py', break_lines_at=69)) .. }}} :: $ python codecs_incremental_bz2.py Text length : 27 Repetitions : 50 Expected len: 1350 Encoding:................................................. Encoded : 99 bytes Total encoded length: 99 Decoding:............................................................ ............................ Decoded : 1350 characters Decoding:.......... Total uncompressed length: 1350 .. {{{end}}} Defining Your Own Encoding ========================== Since Python comes with a large number of standard codecs already, it is unlikely that you will need to define your own. If you do, there are several base classes in :mod:`codecs` to make the process easier. The first step is to understand the nature of the transformation described by the encoding. For example, an "invertcaps" encoding converts uppercase letters to lowercase, and lowercase letters to uppercase. Here is a simple definition of an encoding function that performs this transformation on an input string: .. include:: codecs_invertcaps.py :literal: :start-after: #end_pymotw_header In this case, the encoder and decoder are the same function (as with ``ROT-13``). .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_invertcaps.py')) .. }}} :: $ python codecs_invertcaps.py abc.DEF ABC.def .. {{{end}}} Although it is easy to understand, this implementation is not efficient, especially for very large text strings. Fortunately, :mod:`codecs` includes some helper functions for creating *character map* based codecs such as invertcaps. A character map encoding is made up of two dictionaries. The *encoding map* converts character values from the input string to byte values in the output and the *decoding map* goes the other way. Create your decoding map first, and then use :func:`make_encoding_map` to convert it to an encoding map. The C functions :func:`charmap_encode` and :func:`charmap_decode` use the maps to convert their input data efficiently. .. include:: codecs_invertcaps_charmap.py :literal: :start-after: #end_pymotw_header Although the encoding and decoding maps for invertcaps are the same, that may not always be the case. :func:`make_encoding_map` detects situations where more than one input character is encoded to the same output byte and replaces the encoding value with ``None`` to mark the encoding as undefined. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_invertcaps_charmap.py')) .. }}} :: $ python codecs_invertcaps_charmap.py ('ABC.def', 7) (u'ABC.def', 7) True .. {{{end}}} The character map encoder and decoder support all of the standard error handling methods described earlier, so you do not need to do any extra work to comply with that part of the API. .. include:: codecs_invertcaps_error.py :literal: :start-after: #end_pymotw_header Because the Unicode code point for ``π`` is not in the encoding map, the strict error handling mode raises an exception. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_invertcaps_error.py', ignore_error=True, break_lines_at=69)) .. }}} :: $ python codecs_invertcaps_error.py ignore : ('PI: ', 5) replace: ('PI: ?', 5) strict : 'charmap' codec can't encode character u'\u03c0' in position 4: character maps to .. {{{end}}} After that the encoding and decoding maps are defined, you need to set up a few additional classes and register the encoding. :func:`register` adds a search function to the registry so that when a user wants to use your encoding :mod:`codecs` can locate it. The search function must take a single string argument with the name of the encoding, and return a :class:`CodecInfo` object if it knows the encoding, or ``None`` if it does not. .. include:: codecs_register.py :literal: :start-after: #end_pymotw_header You can register multiple search functions, and each will be called in turn until one returns a :class:`CodecInfo` or the list is exhausted. The internal search function registered by :mod:`codecs` knows how to load the standard codecs such as UTF-8 from :mod:`encodings`, so those names will never be passed to your search function. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_register.py')) .. }}} :: $ python codecs_register.py UTF-8: search1: Searching for: no-such-encoding search2: Searching for: no-such-encoding ERROR: unknown encoding: no-such-encoding .. {{{end}}} The :class:`CodecInfo` instance returned by the search function tells :mod:`codecs` how to encode and decode using all of the different mechanisms supported: stateless, incremental, and stream. :mod:`codecs` includes base classes that make setting up a character map encoding easy. This example puts all of the pieces together to register a search function that returns a :class:`CodecInfo` instance configured for the invertcaps codec. .. include:: codecs_invertcaps_register.py :literal: :start-after: #end_pymotw_header The stateless encoder/decoder base class is :class:`Codec`. Override :func:`encode` and :func:`decode` with your implementation (in this case, calling :func:`charmap_encode` and :func:`charmap_decode` respectively). Each method must return a tuple containing the transformed data and the number of the input bytes or characters consumed. Conveniently, :func:`charmap_encode` and :func:`charmap_decode` already return that information. :class:`IncrementalEncoder` and :class:`IncrementalDecoder` serve as base classes for the incremental interfaces. The :func:`encode` and :func:`decode` methods of the incremental classes are defined in such a way that they only return the actual transformed data. Any information about buffering is maintained as internal state. The invertcaps encoding does not need to buffer data (it uses a one-to-one mapping). For encodings that produce a different amount of output depending on the data being processed, such as compression algorithms, :class:`BufferedIncrementalEncoder` and :class:`BufferedIncrementalDecoder` are more appropriate base classes, since they manage the unprocessed portion of the input for you. :class:`StreamReader` and :class:`StreamWriter` need :func:`encode` and :func:`decode` methods, too, and since they are expected to return the same value as the version from :class:`Codec` you can use multiple inheritance for the implementation. .. {{{cog .. cog.out(run_script(cog.inFile, 'codecs_invertcaps_register.py')) .. }}} :: $ python codecs_invertcaps_register.py Encoder converted "abc.DEF" to "ABC.def", consuming 7 characters StreamWriter for stdout: ABC.def IncrementalDecoder converted "ABC.def" to "abc.DEF" .. {{{end}}} .. seealso:: `codecs `_ The standard library documentation for this module. :mod:`locale` Accessing and managing the localization-based configuration settings and behaviors. :mod:`io` The :mod:`io` module includes file and stream wrappers that handle encoding and decoding, too. :mod:`SocketServer` For a more detailed example of an echo server, see the :mod:`SocketServer` module. :mod:`encodings` Package in the standard library containing the encoder/decoder implementations provided by Python.. `Unicode HOWTO`_ The official guide for using Unicode with Python 2.x. `Python Unicode Objects `_ Fredrik Lundh's article about using non-ASCII character sets in Python 2.0. `How to Use UTF-8 with Python `_ Evan Jones' quick guide to working with Unicode, including XML data and the Byte-Order Marker. `On the Goodness of Unicode `_ Introduction to internationalization and Unicode by Tim Bray. `On Character Strings `_ A look at the history of string processing in programming languages, by Tim Bray. `Characters vs. Bytes `_ Part one of Tim Bray's "essay on modern character string processing for computer programmers." This installment covers in-memory representation of text in formats other than ASCII bytes. `The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) `_ An introduction to Unicode by Joel Spolsky. `Endianness `_ Explanation of endianness in Wikipedia. .. _Unicode HOWTO: http://docs.python.org/howto/unicode