Our project runs on Django Project engine and as with other open-source projects we rely on, we did contribute bug fixes and features to it on a number of occasions. Recently we stumbled on a quite cryptic error that was initially caused by a series of hardware failures on the primary database server.

At some point our worker processes started throwing the following exceptions in one of get_or_create() calls responsible for adding a cookie name to the database:

django.db.utils.IntegrityError: duplicate key value violates unique constraint "cookies_cookiename_name_c804f9c09c135ab_uniq"
DETAIL:  Key (name, type)=(ASP.NET_SessionId, 1) already exists.

The thing is, this type of errors should never happen in this particular method at all — getorcreate is specifically designed to first check if a record exists, and if not, only then add it. Here the first part is obviously failing and only at the second stage database blocks insertion of a duplicate.

Another symptom (which we saw later on) was this error, which gave us hint on what's wrong:

webcookies.register_site.models.Url.MultipleObjectsReturned: get() returned more than one Url -- it returned 2!

Index corruption

After a long investigation (not much on StackOverflow either), some code rewrites and simulation in raw SQL we found the root cause: the main index on that particular table was corrupted and as result returned no records even if the record was present. And because the index was there, PostgreSQL was always using it for fast SELECT to check for presence, but then denied the write into the actual table — at this is where get_or_create() was hitting the wall.

After the root cause was identified, the solution was more or less simple — REINDEX all database indexes, find out that there was one more table with duplicates (which prevents REINDEX from completing), write some Python code to find, relink and delete duplicates and successfully complete the reindex operation.

Lessons learned:

  • Always run smartmontools on your critical storage servers
  • Once you experience any stability issues, immediately enable fsync
  • PostgreSQL does all it physically can to ensure data integrity at any stage
Fully automated RESTful API is now available. Subscribe for your free trial today!