Thursday, April 27

Slony 1.1.5 subscribe set + autovacuum == deadlock

Slony has been pretty helpful in our migration to postgresql 8.1. Since you can run different versions of postgres on different nodes of the same cluster, we were able to bring up 8.1 on the primary and vet it for a few days while leaving the secondary on 7.4. That way if we ran into any showstoppers (hey, even with lots of QA, production can often throw you a curve), we could do a switchover and be back on 7.4 in a few minutes.

Well, that scenario never came to pass, and now I am upgrading the secondary to 8.1 as well. After a rather slow experience reloading the primary after its upgrade I opted to rebuild the secondary rather than dumping it, reloading it and letting it sync.

Everything was going swimmingly and the largest table was copied over when suddenly it caught a deadlock and aborted the load. Bummer. Troubleshooting deadlock errors are a bit like coming on an accident scene and trying to figure out what the drivers were thinking before the crash. Only in this case the crash has been rolled back, at least one of the vehicles has disappeared and all you have is a couple of terse notes to go on.

Looking at the logs, I did see connections coming in from some other servers that I didn't expect. I decided to block all connections except local ones and those from the primary (via pg_hba.conf). Things were still trying to connect at regular intervals -- exactly what is a bit of a mystery to me, the secondary should only receive periodic connections at predictable times -- but now failing. This is when a bit of good luck came along: ironically in the form of another deadlock, this time much faster than before. I say this was good luck because I might well have waited another hour (yeah there's a lot of data) before it blew up.

I was scratching my head trying to figure out what was connecting. The pg logs were not particularly helpful even though I log a good bit of detail about the clients for this purpose. It dawned on me that it was probably the autovacuum daemon making the rounds. Now, I had just recreated this same secondary a few days ago under 7.4 with autovacuum (same version of slony) and there was no deadlock. However I know for a fact that the old autovac config was much less aggressive than I have it configured in 8.1 (especially given the key vacuum throttling available now), so I imagine it just never found anything to do, or there was some other behavioral difference in the new version that bit me.

So, long story short, I turned off autovac during the the secondary subscribe and so far so good, it hasn't fallen over. It's still chugging away on one of the big-ass tables, but here's hoping it comes around.

Moral: Make sure nothing else can mess with the secondary while it subscribes. Disable outside connections and turn off autovacuum.

In truth this whole problem is really a known bug in slony 1.1.5. I'm told that slony 1.2 grabs its locks early to block out everyone else at the get-go. Here's hoping that 1.2 hits prime-time soon, though I suspect being a feature release might make our integration longer to sort out. Nevertheless I'm happy to see so much effort going into Slony, it's a good reliable system, not without a gotcha or three, but well suited to its purpose.

0 Comments:

Post a Comment

<< Home