bz2 – bzip2 compression

Purpose:bzip2 compression
Available In:2.3 and later

The bz2 module is an interface for the bzip2 library, used to compress data for storage or transmission. There are three APIs provided:

  • “one shot” compression/decompression functions for operating on a blob of data
  • iterative compression/decompression objects for working with streams of data
  • a file-like class that supports reading and writing as with an uncompressed file

One-shot Operations in Memory

The simplest way to work with bz2 requires holding all of the data to be compressed or decompressed in memory, and then using compress() and decompress().

import bz2
import binascii

original_data = 'This is the original text.'
print 'Original     :', len(original_data), original_data

compressed = bz2.compress(original_data)
print 'Compressed   :', len(compressed), binascii.hexlify(compressed)

decompressed = bz2.decompress(compressed)
print 'Decompressed :', len(decompressed), decompressed
$ python bz2_memory.py
Original     : 26 This is the original text.
Compressed   : 62 425a683931415926535916be35a600000293804001040022e59c402000314c000111e93d434da223028cf9e73148cae0a0d6ed7f17724538509016be35a6
Decompressed : 26 This is the original text.

Notice that for short text, the compressed version can be significantly longer. While the actual results depend on the input data, for short bits of text it is interesting to observe the compression overhead.

import bz2

original_data = 'This is the original text.'

fmt = '%15s  %15s'
print fmt % ('len(data)', 'len(compressed)')
print fmt % ('-' * 15, '-' * 15)

for i in xrange(20):
    data = original_data * i
    compressed = bz2.compress(data)    
    print fmt % (len(data), len(compressed)), '*' if len(data) < len(compressed) else ''
$ python bz2_lengths.py
      len(data)  len(compressed)
---------------  ---------------
              0               14 *
             26               62 *
             52               68 *
             78               70
            104               72
            130               77
            156               77
            182               73
            208               75
            234               80
            260               80
            286               81
            312               80
            338               81
            364               81
            390               76
            416               78
            442               84
            468               84
            494               87

Working with Streams

The in-memory approach is not practical for real-world use cases, since you rarely want to hold both the entire compressed and uncompressed data sets in memory at the same time. The alternative is to use BZ2Compressor and BZ2Decompressor objects to work with streams of data, so that the entire data set does not have to fit into memory.

The simple server below responds to requests consisting of filenames by writing a compressed version of the file to the socket used to communicate with the client. It has some artificial chunking in place to illustrate the buffering behavior that happens when the data passed to compress() or decompress() doesn’t result in a complete block of compressed or uncompressed output.

Warning

This implementation has obvious security implications. Do not run it on a server on the open internet or in any environment where security might be an issue.

import bz2
import logging
import SocketServer
import binascii

BLOCK_SIZE = 32

class Bz2RequestHandler(SocketServer.BaseRequestHandler):

    logger = logging.getLogger('Server')
    
    def handle(self):
        compressor = bz2.BZ2Compressor()
        
        # Find out what file the client wants
        filename = self.request.recv(1024)
        self.logger.debug('client asked for: "%s"', filename)
        
        # Send chunks of the file as they are compressed
        with open(filename, 'rb') as input:
            while True:            
                block = input.read(BLOCK_SIZE)
                if not block:
                    break
                self.logger.debug('RAW "%s"', block)
                compressed = compressor.compress(block)
                if compressed:
                    self.logger.debug('SENDING "%s"', binascii.hexlify(compressed))
                    self.request.send(compressed)
                else:
                    self.logger.debug('BUFFERING')
        
        # Send any data being buffered by the compressor
        remaining = compressor.flush()
        while remaining:
            to_send = remaining[:BLOCK_SIZE]
            remaining = remaining[BLOCK_SIZE:]
            self.logger.debug('FLUSHING "%s"', binascii.hexlify(to_send))
            self.request.send(to_send)
        return


if __name__ == '__main__':
    import socket
    import threading
    from cStringIO import StringIO

    logging.basicConfig(level=logging.DEBUG,
                        format='%(name)s: %(message)s',
                        )
    logger = logging.getLogger('Client')

    # Set up a server, running in a separate thread
    address = ('localhost', 0) # let the kernel give us a port
    server = SocketServer.TCPServer(address, Bz2RequestHandler)
    ip, port = server.server_address # find out what port we were given

    t = threading.Thread(target=server.serve_forever)
    t.setDaemon(True)
    t.start()

    # Connect to the server
    logger.info('Contacting server on %s:%s', ip, port)
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect((ip, port))

    # Ask for a file
    requested_file = 'lorem.txt'
    logger.debug('sending filename: "%s"', requested_file)
    len_sent = s.send(requested_file)

    # Receive a response
    buffer = StringIO()
    decompressor = bz2.BZ2Decompressor()
    while True:
        response = s.recv(BLOCK_SIZE)
        if not response:
            break
        logger.debug('READ "%s"', binascii.hexlify(response))

        # Include any unconsumed data when feeding the decompressor.
        decompressed = decompressor.decompress(response)
        if decompressed:
            logger.debug('DECOMPRESSED "%s"', decompressed)
            buffer.write(decompressed)
        else:
            logger.debug('BUFFERING')

    full_response = buffer.getvalue()
    lorem = open('lorem.txt', 'rt').read()
    logger.debug('response matches file contents: %s', full_response == lorem)

    # Clean up
    s.close()
    server.socket.close()
$ python bz2_server.py
Client: Contacting server on 127.0.0.1:54092
Client: sending filename: "lorem.txt"
Server: client asked for: "lorem.txt"
Server: RAW "Lorem ipsum dolor sit amet, cons"
Server: BUFFERING
Server: RAW "ectetuer adipiscing elit. Donec
"
Server: BUFFERING
Server: RAW "egestas, enim et consectetuer ul"
Server: BUFFERING
Server: RAW "lamcorper, lectus ligula rutrum "
Server: BUFFERING
Server: RAW "leo, a
elementum elit tortor eu "
Server: BUFFERING
Server: RAW "quam. Duis tincidunt nisi ut ant"
Server: BUFFERING
Server: RAW "e. Nulla
facilisi. Sed tristique"
Server: BUFFERING
Server: RAW " eros eu libero. Pellentesque ve"
Server: BUFFERING
Server: RAW "l arcu. Vivamus
purus orci, iacu"
Server: BUFFERING
Server: RAW "lis ac, suscipit sit amet, pulvi"
Server: BUFFERING
Server: RAW "nar eu,
lacus. Praesent placerat"
Server: BUFFERING
Server: RAW " tortor sed nisl. Nunc blandit d"
Server: BUFFERING
Server: RAW "iam egestas
dui. Pellentesque ha"
Server: BUFFERING
Server: RAW "bitant morbi tristique senectus "
Server: BUFFERING
Server: RAW "et netus et
malesuada fames ac t"
Server: BUFFERING
Server: RAW "urpis egestas. Aliquam viverra f"
Server: BUFFERING
Server: RAW "ringilla
leo. Nulla feugiat augu"
Server: BUFFERING
Server: RAW "e eleifend nulla. Vivamus mauris"
Server: BUFFERING
Server: RAW ". Vivamus sed
mauris in nibh pla"
Server: BUFFERING
Server: RAW "cerat egestas. Suspendisse poten"
Server: BUFFERING
Server: RAW "ti. Mauris massa. Ut
eget velit "
Server: BUFFERING
Server: RAW "auctor tortor blandit sollicitud"
Server: BUFFERING
Server: RAW "in. Suspendisse imperdiet
justo."
Server: BUFFERING
Server: RAW "
"
Server: BUFFERING
Server: FLUSHING "425a68393141592653590fd264ff00004357800010400524074b003ff7ff0040"
Server: FLUSHING "01dd936c1834269926d4d13d232640341a986935343534f5000018d311846980"
Client: READ "425a68393141592653590fd264ff00004357800010400524074b003ff7ff0040"
Server: FLUSHING "0001299084530d35434f51ea1ea13fce3df02cb7cde200b67bb8fca353727a30"
Client: BUFFERING
Server: FLUSHING "fe67cdcdd2307c455a3964fad491e9350de1a66b9458a40876613e7575a9d2de"
Client: READ "01dd936c1834269926d4d13d232640341a986935343534f5000018d311846980"
Server: FLUSHING "db28ab492d5893b99616ebae68b8a61294a48ba5d0a6c428f59ad9eb72e0c40f"
Client: BUFFERING
Server: FLUSHING "f449c4f64c35ad8a27caa2bbd9e35214df63183393aa35919a4f1573615c6ae3"
Client: READ "0001299084530d35434f51ea1ea13fce3df02cb7cde200b67bb8fca353727a30"
Server: FLUSHING "611f18917467ad690abb4cb67a3a5f1fd36c2511d105836a0fed317be03702ba"
Client: BUFFERING
Server: FLUSHING "394984c68a595d1cc2f5219a1ada69b6d6863cf5bd925f36626046d68c3a9921"
Client: READ "fe67cdcdd2307c455a3964fad491e9350de1a66b9458a40876613e7575a9d2de"
Server: FLUSHING "3103445c9d2438d03b5a675dfdc74e3bed98e8b72dec76c923afa395eb5ce61b"
Client: BUFFERING
Server: FLUSHING "50cfc0ccaaa726b293a50edc28b551261dd09a24aba682972bc75f1fae4c4765"
Client: READ "db28ab492d5893b99616ebae68b8a61294a48ba5d0a6c428f59ad9eb72e0c40f"
Server: FLUSHING "f3b7eeea36e771e577350970dab4baf07750ccf96494df9e63a9454b7133be1d"
Client: BUFFERING
Server: FLUSHING "ee330da50a869eea59f73319b18959262860897dafdc965ac4b79944c4cc3341"
Client: READ "f449c4f64c35ad8a27caa2bbd9e35214df63183393aa35919a4f1573615c6ae3"
Server: FLUSHING "5b23816d45912c8860f40ea930646fc8adbc48040cbb6cd4fc222f8c66d58256"
Client: BUFFERING
Server: FLUSHING "d508d8eb4f43986b9203e13f8bb9229c284807e9327f80"
Client: READ "611f18917467ad690abb4cb67a3a5f1fd36c2511d105836a0fed317be03702ba"
Client: BUFFERING
Client: READ "394984c68a595d1cc2f5219a1ada69b6d6863cf5bd925f36626046d68c3a9921"
Client: BUFFERING
Client: READ "3103445c9d2438d03b5a675dfdc74e3bed98e8b72dec76c923afa395eb5ce61b"
Client: BUFFERING
Client: READ "50cfc0ccaaa726b293a50edc28b551261dd09a24aba682972bc75f1fae4c4765"
Client: BUFFERING
Client: READ "f3b7eeea36e771e577350970dab4baf07750ccf96494df9e63a9454b7133be1d"
Client: BUFFERING
Client: READ "ee330da50a869eea59f73319b18959262860897dafdc965ac4b79944c4cc3341"
Client: BUFFERING
Client: READ "5b23816d45912c8860f40ea930646fc8adbc48040cbb6cd4fc222f8c66d58256"
Client: BUFFERING
Client: READ "d508d8eb4f43986b9203e13f8bb9229c284807e9327f80"
Client: DECOMPRESSED "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec
egestas, enim et consectetuer ullamcorper, lectus ligula rutrum leo, a
elementum elit tortor eu quam. Duis tincidunt nisi ut ante. Nulla
facilisi. Sed tristique eros eu libero. Pellentesque vel arcu. Vivamus
purus orci, iaculis ac, suscipit sit amet, pulvinar eu,
lacus. Praesent placerat tortor sed nisl. Nunc blandit diam egestas
dui. Pellentesque habitant morbi tristique senectus et netus et
malesuada fames ac turpis egestas. Aliquam viverra fringilla
leo. Nulla feugiat augue eleifend nulla. Vivamus mauris. Vivamus sed
mauris in nibh placerat egestas. Suspendisse potenti. Mauris massa. Ut
eget velit auctor tortor blandit sollicitudin. Suspendisse imperdiet
justo.
"
Client: response matches file contents: True

Mixed Content Streams

BZ2Decompressor can also be used in situations where compressed and uncompressed data is mixed together. After decompressing all of the data, the unused_data attribute contains any data not used.

import bz2

lorem = open('lorem.txt', 'rt').read()
compressed = bz2.compress(lorem)
combined = compressed + lorem

decompressor = bz2.BZ2Decompressor()
decompressed = decompressor.decompress(combined)

print 'Decompressed matches lorem:', decompressed == lorem
print 'Unused data matches lorem :', decompressor.unused_data == lorem
$ python bz2_mixed.py
Decompressed matches lorem: True
Unused data matches lorem : True

Writing Compressed Files

BZ2File can be used to write to and read from bzip2-compressed files using the usual methods for writing and reading data. To write data into a compressed file, open the file with mode 'w'.

import bz2
import os

output = bz2.BZ2File('example.txt.bz2', 'wb')
try:
    output.write('Contents of the example file go here.\n')
finally:
    output.close()

os.system('file example.txt.bz2')
$ python bz2_file_write.py
example.txt.bz2: bzip2 compressed data, block size = 900k

Different compression levels can be used by passing a compresslevel argument. Valid values range from 1 to 9, inclusive. Lower values are faster and result in less compression. Higher values are slower and compress more, up to a point.

import bz2
import os

data = open('lorem.txt', 'r').read() * 1024
print 'Input contains %d bytes' % len(data)

for i in xrange(1, 10):
    filename = 'compress-level-%s.bz2' % i
    output = bz2.BZ2File(filename, 'wb', compresslevel=i)
    try:
        output.write(data)
    finally:
        output.close()
    os.system('cksum %s' % filename)

The center column of numbers in the output of the script is the size in bytes of the files produced. As you see, for this input data, the higher compression values do not always pay off in decreased storage space for the same input data. Results will vary for other inputs.

$ python bz2_file_compresslevel.py
3018243926 8771 compress-level-1.bz2
1942389165 4949 compress-level-2.bz2
2596054176 3708 compress-level-3.bz2
1491394456 2705 compress-level-4.bz2
1425874420 2705 compress-level-5.bz2
2232840816 2574 compress-level-6.bz2
447681641 2394 compress-level-7.bz2
3699654768 1137 compress-level-8.bz2
3103658384 1137 compress-level-9.bz2
Input contains 754688 bytes

A BZ2File instance also includes a writelines() method that can be used to write a sequence of strings.

import bz2
import itertools
import os

output = bz2.BZ2File('example_lines.txt.bz2', 'wb')
try:
    output.writelines(itertools.repeat('The same line, over and over.\n', 10))
finally:
    output.close()

os.system('bzcat example_lines.txt.bz2')
$ python bz2_file_writelines.py
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.

Reading Compressed Files

To read data back from previously compressed files, simply open the file with mode 'r'.

import bz2

input_file = bz2.BZ2File('example.txt.bz2', 'rb')
try:
    print input_file.read()
finally:
    input_file.close()

This example reads the file written by bz2_file_write.py from the previous section.

$ python bz2_file_read.py
Contents of the example file go here.

While reading a file, it is also possible to seek and read only part of the data.

import bz2

input_file = bz2.BZ2File('example.txt.bz2', 'rb')
try:
    print 'Entire file:'
    all_data = input_file.read()
    print all_data
    
    expected = all_data[5:15]
    
    # rewind to beginning
    input_file.seek(0)
    
    # move ahead 5 bytes
    input_file.seek(5)
    print 'Starting at position 5 for 10 bytes:'
    partial = input_file.read(10)
    print partial
    
    print
    print expected == partial
finally:
    input_file.close()

The seek() position is relative to the uncompressed data, so the caller does not even need to know that the data file is compressed.

$ python bz2_file_seek.py
Entire file:
Contents of the example file go here.

Starting at position 5 for 10 bytes:
nts of the

True

See also

bz2
The standard library documentation for this module.
bzip2.org
The home page for bzip2.
zlib
The zlib module for GNU zip compression.
gzip
A file-like interface to GNU zip compressed files.
SocketServer
Base classes for creating your own network servers.
Bookmark and Share