Cyrus-imapd lock_flock patch to work around .seen file locking problems

A while back I emailed the cyrus mailing list about a locking problem we were having with Cyrus 2.0.16 on RedHat 7.0 (kernel 2.2.19-7.0.8).  I spent a lot of time exploring blocked imapd processes with gdb and finally came to the conclusion that this appeared to be a kernel or glibc problem, not cyrus, because processes were blocking trying to get a lock on .seen files that they already had locked.  I even added code to attempt to unlock the files first before trying to lock and it did not resolve the problem.   Finally, I came up with this hack.

Warning This patch has only been tested by me on RedHat 7.0 Kernel 2.2.19-7.0.8enterprise, although it ought to work on any platform that uses flock.   Other folks have installed this on 2.1.3 and I have included an untested patch that should work for this.   Thanks to Thomas Jarosch for tips on improving the patching process.

Cyrus 2.0.16 Users:

The revised source is available at http://servercc.oakton.edu/~jwade/cyrus/cyrus-imapd-2.0.16/lock_flock.c
This should replace cyrus-imapd-2.0.16/lib/lock_flock.c

A diff (for patching) is available at http://servercc.oakton.edu/~jwade/cyrus/cyrus-imapd-2.0.16/cyrus-imapd-2.0.16-flock.patch

To patch, copy this to your cyrus-imapd-2.0.16 source directory and

patch -p1 < cyrus-imapd-2.0.16-flock.patch

Cyrus 2.1.3 Users:  (untested by me)

The revised source is available at http://servercc.oakton.edu/~jwade/cyrus/cyrus-imapd-2.1.3/lock_flock.c
This should replace cyrus-imapd-2.1.3/lib/lock_flock.c

A diff (for patching) is available at http://servercc.oakton.edu/~jwade/cyrus/cyrus-imapd-2.1.3/cyrus-imapd-2.1.3-flock.patch

To patch, copy this to your cyrus-imapd-2.1.3 source directory and

patch -p1 < cyrus-imapd-2.1.3-flock.patch

How it works

Basically this modifies the lock_reopen function to call flock() in a non-blocking manner with a timeout parameter.   It will try once, try again immediately then sleep with a quadraticaly increasing delay until it reaches some maximum delay as #defined by MAXTIME,    With max time set to 99 seconds, this gives a total delay of 1+4+9+16+25+36+49+64+81 = 285 seconds (4.75 minutes)     So that you can see what it is doing, if you have sysloging turned on at the debug level, it will log a "lock_reopen-blocked: sleeping" message to syslog.  If 285 seconds elapses and it can't get a lock, it returns an error and the imapd process that is hung up exits and puts and error in the syslog.

Many of the folks experiencing the problem have reduced maxtime to values as low as 17  (0+1+4+9+16=30 second total delay)  and have been happy,  I would recommend starting with a higher value and checking to see how long legitimate locks actually take (by reviewing the sleeping messages in your syslog.)

This happens a couple of times a week for us, and it has ceased to be an issue (we only know about it from the logs.)   That being said, it is still a hack.

What it fixes (and how to tell if it might work for you)

To see if this fix is designed to resolve your problem, see below.

The symptoms we saw were not tied to the mailboxes.db or any of the Berkeley db databases.  Instead, I would have processes that would lock
up trying to access a user's mailbox.   Subsequent attempts to access the mailbox would block waiting for the first process to release the
locks on the various header and seen files in the mailbox.   Finally, incoming mail would die because all of the lmtpd processes would get
blocked trying to deliver to the locked mailbox.

An easy way to see if you are experiencing this is to use gdb to debug one of the processes and then do a bt (backtrace) if you see that the
process is in the middle of an flock call.   To be absolutely sure follow the complete steps below.

First, check to see which processes are accessing the user's seen file:
[root@borg /root]# lsof | grep jwade.seen

Then use gdb to get a backtrace of each of these processes.    Here is a
sample gdb backtrace.   Just substitute the process id of your suspect
imapd process in place of 24876

[root@borg /root]#gdb /usr/cyrus/bin/imapd 24876
GNU gdb 5.0
Copyright 2000 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "i386-redhat-linux"...
/root/24876: No such file or directory.
Attaching to program: /usr/cyrus/bin/imapd, Pid 24876
Reading symbols from /usr/local/lib/libsasl.so.7...done.
Loaded symbols for /usr/local/lib/libsasl.so.7
<snip, lots of symbol loading cut out here>
Reading symbols from /lib/libnss_nis.so.2...done.
Loaded symbols for /lib/libnss_nis.so.2
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
0x4028a8a1 in __flock () from /lib/libc.so.6
(gdb) bt
#0  0x4028a8a1 in __flock () from /lib/libc.so.6
#1  0x8079caf in lock_reopen (fd=22,
    filename=0x810fa08 "/var/imap/user/p/jwade.seen", sbuf=0xbfffebe0,
    failaction=0xbfffebdc) at lock_flock.c:84
#2  0x807b8c7 in starttxn_or_refetch (db=0x8111298, mytid=0x8111288)
    at cyrusdb_flat.c:211
#3  0x807ba27 in myfetch (db=0x8111298, key=0x8104328 "0e461b1e389ccb86",
    keylen=16, data=0xbfffecb4, datalen=0xbfffecb8, mytid=0x8111288)
    at cyrusdb_flat.c:270
#4  0x806f943 in seen_readit (seendb=0x8111278, lastreadptr=0xbfffed38,
    lastuidptr=0xbfffed3c, lastchangeptr=0xbfffed40,seenuidsptr=0xbfffed44,
    rw=1) at seen_db.c:268
#5  0x806fb01 in seen_lockread (seendb=0x8111278,lastreadptr=0xbfffed38,
    lastuidptr=0xbfffed3c, lastchangeptr=0xbfffed40,seenuidsptr=0xbfffed44)
    at seen_db.c:332
#6  0x8060daa in append_addseen (mailbox=0xbfffedb0,
    userid=0xbffff05c "jwade", msgrange=0x810f930 "271") at append.c:896
#7  0x805f929 in append_commit (as=0xbfffedb0, uidvalidity=0xbfffeda4,
    start=0xbfffeda8, num=0xbfffedac) at append.c:240
#8  0x8059a2f in index_copy (mailbox=0x80ffe20, sequence=0x810fbf8 "11352",
    usinguid=1, name=0xbffff2c0 "user.jwade.FaMiLy", copyuidp=0xbffff2bc)
    at index.c:1226
#9  0x80531cf in cmd_copy (tag=0x810fae8 "47", sequence=0x810fbf8 "11352",
    name=0x810fc68 "INBOX.FaMiLy", usinguid=1) at imapd.c:3192
#10 0x804d85c in cmdloop () at imapd.c:791
#11 0x804cf34 in service_main (argc=1, argv=0xbffffe14, envp=0xbffffe1c)
    at imapd.c:546
#12 0x804bce6 in main (argc=1, argv=0xbffffe14, envp=0xbffffe1c)
    at service.c:285
#13 0x401d2f31 in __libc_start_main (main=0x804b8f8 <main>, argc=1,
    ubp_av=0xbffffe14, init=0x804a9e4 <_init>, fini=0x807e84c <_fini>,
    rtld_fini=0x4000e274 <_dl_fini>, stack_end=0xbffffe0c)
    at ../sysdeps/generic/libc-start.c:129
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: /usr/cyrus/bin/imapd, Pid 24876

The key thing here is that this process was in the lock_reopen function trying to lock the .seen file.  (see the section highlighted in red)

 If this is your problem, my patch might help.   If you use this successfully (or if you have any problems), send me an email: jwade@oakton.edu

Enjoy,

John Wade
Director of Systems and Network Services
Oakton Community College
email:  jwade@oakton.edu
http://servercc.oakton.edu/~jwade/