Thursday, November 2

codecs.StreamRecoder example

The documentation for the (very useful) Python codecs module is quite terse in places. In particular I didn't find the StreamRecoder docs especially helpful, and there are no examples. Basically what I needed was an on-the-fly conversion from a file encoded in Latin-1 (which makes a good catch-all when you aren't sure of the encoding) into UTF-8. I was using this file as input to the database, via psycopg2's copy_from() function. copy_from() requires a file-like object. To further complicate matters, the incoming file was gzip compressed, but that's not too important for this example.

Given an input file-like object (infile), here is the incantation to recode it from Latin-1 to UTF-8.
    recoder = codecs.StreamRecoder(infile, 
codecs.getencoder('utf-8'),
codecs.getdecoder('utf-8'),
codecs.getreader('latin-1'),
codecs.getwriter('latin-1'))
The result recoder is a file-like object which when read returns UTF-8 encoded strings. When written it accepts UTF-8, but the data is written to the file in Latin-1--though in my case I was only reading from the file.