Wednesday, December 2, 2009

The Definitive Guide to POSKeyError

Trying to get PloneChix off the ground during the holidays has been a tough start so I figured we could get the blog a good head start with a much needed post for anyone who has had the hair pulling, spirit breaking chance to work with a POSKeyError. Happen upon one of these bad boys and your life is filled with countless threads, code snippits, and half finished reference articles. Here is a shot at explaining the source of the problem to start with (prevention), and giving a consolidated page of tools to tackle the beast that is POSKeyError head on (treatment).

Getting Started
If you are on this page, there is a good chance you are pooping yourself right now because the database isn't responding, things aren't happy, systems are failing. Let's take this moment to make a copy of everything in its messed up state and while that copies - breathe. Here's your worst case scenario: everything is screwed and you have to restore from backups. I've dealt with several different KeyErrors and never had to restore but that doesn't mean you won't.

In fact, if you have recent enough backups to where you can restore with minimal data loss, do it. This is not a fast process (set aside at least 3 hours, eons more if you don't have giant cojones). You will not likely fix the source of the problem but at least your dbs will be back online. If you have any scheduled cron jobs for packing the databases or taking backups make sure they are disabled since there is a good chance that one of those could re-trigger the error.

So start copying and if your db is huge go tell someone to start pulling that restore from tape. Breathe.

Types of Key Errors
In my experience, there are two types of key errors - POSKeyError and [connection] KeyError. It's hard to pinpoint which one you have sometimes because they can mask as one another but once you get into debugging you'll know right away.

In theory, POSKeyError is short for POSitional Key Error. It's likely to happen if the database gets corrupted, you blow out the file system storage space, your iscsi drive flip out, etc... It may happen when you cross reference database (more on that later) but I've only seen a more traditional KeyError in that case. Your trackback probably looks something like this*:

Traceback (most recent call last):
  File "/usr/local/lib/python/ZEO/zrpc/connection.py", line 421, in handle_request
    ret = meth(*args)
  File "/usr/local/lib/python/ZODB/FileStorage/FileStorage.py", line 625, in modifiedInVersion
    pos = self._lookup_pos(oid)
  File "/usr/local/lib/python/ZODB/FileStorage/FileStorage.py", line 514, in _lookup_pos
    raise POSKeyError(oid)
POSKeyError: 0x172024 

Notice that this is referencing the object that its looking up and the error is not happening in a connection handling phase. This type of error may or may not cause your whole site to fart, depending on where the corrupted object is.  For example, if the error is in your portal_skins/custom folder then you are screwed since aquisition is probably picking it up for every site.

A quick digression on the 0x172024 "thing". This is the oid (object id, in the zodb) of the object that is the offender. For this to be useful in any database manipulation, we will most certainly need to convert it using the p64 function in ZODB.utils. A quick docstring excerpt of two very useful functions we'll use later to convert this oid to an 8-byte string*:

def p64(v):
    """Pack an integer or long into a 8-byte string"""
def u64(v):
    """Unpack an 8-byte string into a 64-bit long integer."""

The other KeyError (which manifests as POSKeyError and just KeyError, but I'll refer to as just KeyError for clarity) usually has some sort of connection traceback info in the error message like so:

2009-12-01 16:51:18,056 ERROR @/opt/Zope-2.9/lib/python/ZODB/Connection.py/Connection.py Connection 737  : Couldn't load state for 0x01
Traceback (most recent call last):
 File "/opt/Zope-2.9/lib/python/ZODB/Connection.py", line 732, in setstate
   self._setstate(obj)
 File "/opt/Zope-2.9/lib/python/ZODB/Connection.py", line 786, in _setstate
   self._reader.setGhostState(obj, p)
 File "/opt/Zope-2.9/lib/python/ZODB/serialize.py", line 604, in setGhostState
   state = self.getState(pickle)
 File "/opt/Zope-2.9/lib/python/ZODB/serialize.py", line 597, in getState
   return unpickler.load()
 File "/opt/Zope-2.9/lib/python/ZODB/serialize.py", line 479, in _persistent_load
   return self.loaders[reference_type](self, *args)
 File "/opt/Zope-2.9/lib/python/ZODB/serialize.py", line 540, in load_multi_oid
   conn = self._conn.get_connection(database_name)
 File "/opt/Zope-2.9/lib/python/ZODB/Connection.py", line 305, in get_connection
   new_con = self._db.databases[database_name].open(
KeyError: 'mysite'

Ewe. This ones a little more mucky. It may or may not have an oid. These connection KeyErrors happen when you have multiple database mounts and one database is weak referencing another. !@#$ -> what the hell does that mean, right? Let's say you have two Plone Sites, foo and bar, both mounted onto your main db. You are messing around one day and you customize logo.jpg for foo. But you messed up and that was actually meant to be for bar so being a good lazy dev you just use plone's sweet cut and paste functionality to move logo.jpg from foo into bar. Everything is working perfect (and this is indeed "legal"). This will continue to work, until the fateful day that you pack foo and then BAMN! It's key error time.

When you cut and pasted logo.jpg from foo to bar, you didn't actually create a new object.  You created a weak reference to foo from bar.  As long as that reference wasn't garbage collected in foo, you were golden. When you run pack, it destroys references to the object. The official explanation from ZODB/cross-database-references.txt

    Garbage collection is done on a database by database basis.
    If an object on a database only has references to it from other
    databases, then the object will be garbage collected when its
    database is packed.  The cross-database references to it will be
    broken.

What is bar to do now? It can't find the logo.jpg object, and since every page in your site has a logo you are screwed.

I've seen this happen for non cut/paste reasons as well. If you get the error after a fresh restart on one zope and the others are running fine, then something happened in the mounting of your databases that messed up the stream. I don't know the exact details of why, but I've seen rapid restarts of a zope instance invoke this hella error. Do not restart the other zopes if you can avoid it and use those to start salvaging data just in case.  If you are lucky, just go into the undo at the root of the site and undo the last couple of transactions, especially if you see a suspicious addMount transaction. That might be all you need!

Verifying
Let's get to fixing then shall we! First things first, if you know you had some kind of activity on the disk that may have corrupted the data, we can easily verify by running fstest.py:

 /usr/local/bin/fstest.py /srv/zeo/zeo0/Data.fs

If nothing is corrupted, it will return nothing. Bad data will return something like

/srv/zeo/zeo0/Data.fs truncated possibly because of damaged records at 1670660850

Yuck! But at least we know what's going on. If this is indeed the case, you may want to start with @spamsch's great tutorial on dealing with transaction corruption and then come back here if that doesn't work.

To check for bad references, we can  use fsrefs.py:

 /usr/local/bin/fsrefs.py /srv/zeo/zeo0/Data.fs

Unfortunately I lost my example of what a bad ref error looks like but its similar to fstest. Most importantly it will tell you the oid of the item that is causing things to break. If you're error message has no oid in it, you will absolutely need this!

I have not tried it, but it may be worth checking out zc.zodbgc and its multi-zodb-check-refs script.

If you are certain its a [connection] key error and nothing shows up, you can try to manually search for it from within the [zopectl] debug console:

from ZODB.POSException import POSKeyError
def error_finder(folder, exception=POSKeyError, stop_on_first=None):
    """ start at the given folderish object.
    If stop_on_first is true, quit after one exception;
    otherwise, keep going through the whole tree."""
    for id, next_item in folder.objectItems():
        try:
            print id
            next_item.getId()
        except exception:
            print `exception`, "in folder",
            print  '/'.join(folder.getPhysicalpath()),     
            print "at id:", id
            if stop_on_first:
                raise "done"   # hack to break out of recursion
        else:
            # no error, recurse if it's objectManagerish
            if hasattr(next_item.aq_base, 'objectItems'):
                error_finder(next_item, exception, stop_on_first)

error_finder(app.mysite, stop_on_first=True)
Fix 'Er Up
The goal here is simple: delete the offending object. This can be tricky since its hard to delete something that doesn't exist. This can all done from the [zopectl] debug console. If you think this is a connection key error and you are working from zopectl debug, first and foremost modify your config file to get rid of any unnecessary mounts. Broken objects are better than key errors and if it allows you to recover certain parts then even better.

Once in the debug prompt we can find out which object that oid represents:

from ZODB.utils import p64
o = root._p_jar[p64(0x172024)
print o.getId()
print o.bobobase_modification_time()
print o.meta_type
print o.__ac_local_roles__

Hopefully you now know the id of the offending object and lets try to delete it

from AccessControl.SecurityManagement import newSecurityManager 
admin = app.acl_users.getUserById('admin') 
admin = admin.__of__(app.acl_users)
newSecurityManager(None, admin) 
from Testing import makerequest
req=makerequest.makerequest(app.mysite.bad_folder)
req.manage_delObjects(['bad_folder'])
import transaction
transaction.commit()

If that worked and your stuff now loads - lucky you. If all you got was another pos key error, try to delete the parent. If the deletion strategy doesn't work, the next strategy is to fill the void of that object with a fake one, and then delete it. For example:

app.mysite._setOb('bad_folder', newFolderInstance)

If neither of the above work, you are probably looking at a connection Key Error. Feel free to try what is described here, but I have no idea what that code looks like so I won't mention it. Instead let's leave the zope environment and go directly to manipulating the ZODB by hand.

In the python shell, we can create a connection to the database and manually inspect the objects. We can try manually delete there, and go back in time transaction by transaction. You are probably going to have to poke (use dir!) around so I'm just going to give some getting started code.  To manually open the db:

from ZODB import FileStorage, DB
from ZODB.utils import u64
storage = FileStorage.FileStorage('Data.fs')
 
# don't use this code if you don't have to, especially
# if you are having a connection error. You'll know because
# app won't let you list its objects without throwing
# the db_connection error 
db = DB(storage)
connection = db.open()
root = connection.root()
app = root['Application']

To view the contents of 'app', you must use list to get your keys in sanely readable form:

list(app.keys()) # returns child nodes of the app

If at any point listing the keys fails, you are getting warmer to finding the broken object. Try to delete things if possible, remembering to commit the transaction. Sometimes thats enough to get by. Also note that since you are no longer in a zope environment, all of the objects will be listed as "broken". Don't get obsessed with this - its just because you don't have the proper modules loaded. These are not your key error items and don't get obsessed with fixing them unless you need to traverse in.

Another route to try is undoing transactions one at a time. Since its perfectly explained in the docs, I won't replicate it but this can do wonders. Remember to commit frequently and take the time to check all the time.

If you are trying to find a lost object, take advantage of the iterators in zodb. For example, to iterate through the pickle of each object:

it = storage.iterator()
found = False
while not found:
    try:
        item  = it.next().next()
        data = item.data
        poo = "%s"%data
        if poo.count('broken_object_id'):
            print "OID: ", u64(item.oid)
            print "TID: ", u64(item.tid)
            maybeThisIsWhatIWant = root._p_jar[item.oid]
            print maybeThisIsWhatIWant

If your db screwed the pooch and lost a references, maybe you can use this id to map it back. At this point the options are endless and you will have to be creative so use a copy, take a risk, and jump in to coding with ZODB. Remember: the goal is to delete or restore the reference.

If something looks incorrect, there is a chance it is since this material exists only in the bermuda triangle so please leave a comment and I'll make sure to update or clarify.

Prevention
With multi-reference, the best thing to do is to NOT cut and paste between mounts. I know its easier said than done. Just remember, copy-paste-delete. In a followup discussion, @davisagli mentioned that in ZODB 3.9 you can disallow cross-references by setting the allow-implicit-cross-referenced. If you're looking into mounting and you have a newer version of plone I would enable that by default.

[UPDATE] I have reason to believe that the key error from mounting can come from a multi zope setup where the config files are not lined up with each other. I've been able to recreate this on my local server where 2 zopes share zodbs, including the main db, but where one zope has another mount configured. Going to the manage screen and viewing the mount form in each instance show different results but there is also a transaction that occurs in the background, which you can see in the undo log. After just listing the mount points in each instance, the error triggers on the other instance. The simple fix to this is to undo the last transaction, which is labeled "addMountForm". I think anyways...

Happy Hunting Ploners!

 

Other Useful Links

* No DBs were harmed in the generation of these traces. They are 100% real and, yes, you are allowed to feel empathetic to my pain of having to deal with all of them. These are just a few of my favorites. Some day I'll put together a "best-of" compilation, under the influence of course.
* I don't think this applies to Python 3 because it handles unicode differently. I think @davisagli mentioned that but I can't recall at the moment why.

6 comments:

  1. Really great article.
    You can make it more known in the Plone Community by getting listed on http://planet.plone.org

    - WouterVH

    ReplyDelete
  2. The $64M question of course is how to prevent cut'n'paste across mounted storages. This is a loaded gun lying around in full view in every Zope with mounted storages.

    ReplyDelete
  3. @Jean as far as I know, the new versions of zeo have an option to disallow in the config file. Here is the changeset - not sure what the release version is though: https://mail.zope.org/pipermail/zodb-checkins/2009-April/011762.html

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  4. Great article! The advice on how to disallow cross database referencing with mounted storage has saved me a lot of trouble. This was the source of 99% of our (still rare) POSKeyErrors and we have not had any on content since then.

    I did get an error on users recently, however. Of note, the error_finder method above will not find these since they are not "content" in some folderish container. My problem was I would get this error any time it tried to access data from portal_memberdata (acl_users itself seemed fine), such as on the users overview controlpanel. I could find which users were bad with this:
    acl=p.acl_users
    bads=[]
    rolemakers = acl.plugins.listPlugins(IRolesPlugin)
    for user in acl.getUserIds():
    _user=p.portal_membership.getMemberById(user)
    for rolemaker_id, rolemaker in rolemakers:
    try:
    rolemaker.getRolesForPrincipal(_user)
    except:
    bads.append(user)

    Deleting the users from acl_users wasn't enough, because on recreation it picked up the portal_memberdata data that was corrupted. To delete that, I just ran its pruneMemberDataContents method in a debugger after deleting from acl_users.

    Unfortunately, I am clueless as to the cause.

    ReplyDelete
  5. Awesome Artical Really i have searching this type of valuable information From a lot of days i found satisfaction when Read your blog Thanks for giving this type blog

    ReplyDelete