Thursday, August 28

Faking read() support for psycopg's copy_from()

This has been a burr in my side for a long time. If you've ever tried to use copy_from() in psycopg2 from a source other than a real file, you've probably run into this issue: copy_from() requires a file-like object that supports both readline() and read() (!) methods! This is not an issue if the source of the data is already coming from a file or pipe, but what if you are synthesizing the data line-by-line? Like, I don't know, parsing a huge honking log file and trying to insert it en masse into Postgres? Then what?

Well, I've implemented a semi-ridiculous workaround class that implements read() by buffering up a bunch of lines until it's got enough to shovel off to psycopg2, but of course, lines never exactly fill the requested number of bytes, so you've got to deal with that and you're left with a pretty damn messy, and stupid read() method. Ugh.

Reading the psycopg2 source (ain't open source grand? ;^) made me realize a couple of things. One: readline() is only called with older protocols (i.e., older postgres versions) and would never be used in my case. So really only read() was even used at all. Ok, that's interesting, but read() is the problem-child here, so that doesn't soothe the irritation.

The second thing I saw from the code was that psycopg doesn't care how many bytes read() returns at all. As long as it returns some bytes it's happy. So that made me realize that read() could just return the same thing as readline() and it should still work. I fired up vim and punched in some test code to verify:

import psycopg2

class ReadFaker:

def __init__(self, data):
self.iter = iter(data)

def readline(self, size=None):
try:
line = self.iter.next()
except StopIteration:
return ''
else:
return "%s\t%s\n" % line

read = readline

fakefile = ReadFaker([
('foo', 'bar'),
('baz', 'spam'),
('bar', 'foo'),
('foo', 'foo'),
])

db = psycopg2.connect(database='test')
c = db.cursor()
c.copy_from(fakefile, 'foo')
db.commit()

Then I created a test table with two text columns in a local database named test. Voila! It works!

Of course to be proper, my readline() method shouldn't ignore its input size argument, but that's probably more trouble then its worth practically.

Someday I think this use-case should be handled better by psycopg2, i.e., you should just be able to pass it an iterable object of equal-length sequences for each record. Then it can compose the textual representation to throw off to Postgres. There are complexities there around complex types like dates and times, but even if it only handled textual and numeric types and made you deal with the others it would be an improvement. Maybe a "spare time" project for me some day 8^)

Labels: , , ,

1 Comments:

Blogger Yevgen Antymyrov said...

Thanks for the tip about read(). I honestly started implementing it.

8:18 AM  

Post a Comment

<< Home