When a server application I am supporting starts to misbehave, there are a few tools I reach for to perform a quick diagnosis. Some of these are little known to DBAs who usually work within the database, so here is a quick overview of some of them.
In the first example, a Shareplex process appears to have hung. Killing it with kill -9
and restarting it just results in it hanging again (I don’t do that myself, but that is the first thing most people would try). Instead let’s see what it is doing. First find the process ID, then strace
it:
$ strace -p 14010 Process 14010 attached - interrupt to quit [ Process PID=14010 runs in 32 bit mode. ] fcntl64(7, 0xe /* F_??? */
This tells you a lot if you know how to interpret it. It is waiting for the completion of a fcntl
call, on file descriptor 7, to do operation 0xE (14 in hex). We can see what is connected to the file descriptor:
$ ls -lh /proc/14010/fd/7 lrwx------ 1 oracle dba 64 May 27 12:17 /proc/14010/fd/7 -> /shareplex/var/data/statusdb.lck
And what is 0xE in fcntl.h
(which can be found with the locate
command on most systems)?
#define F_SETLKW64 14 /* Set record locking info (blocking). */
So we are waiting forever to lock a file, which another process already has locked. The way to find out who is with the fuser
command:
$ fuser /shareplex/var/data/statusdb.lck $
No-one… That’s strange. Well not really: this is an NFS volume, so if there is a lock, yet no process locking it on the Unix side, then it must be a stale lock on the NetApp. I had a sysadmin clear it, and problem solved. Not a Shareplex problem after all! Another tool that works well with strace
is lsof
, which lists open files. Let’s say I have another hung process, and I strace
it and see:
$ strace -p 23771 Process 23771 attached - interrupt to quit read(11,
I look to see what is on FD 11:
$ ls -lh /proc/23771/fd/11 lr-x------ 1 oracle dba 64 May 27 13:29 /proc/23771/fd/11 -> pipe:[17004658]
It’s not a “file” at all, but in Unix everything is a file, so:
$ lsof|grep 17004658 sqlplus 23771 oracle 11r FIFO 0,7 17004658 pipe oracle 23774 oracle 12w FIFO 0,7 17004658 pipe
It is connected to Oracle on that FD, waiting to read, and now I can simply query on the Oracle side to see what is is stuck on, using the Oracle PID:
SQL> select wait_class, event from v$session where paddr in (select addr from v$process where spid=23774); WAIT_CLASS EVENT --------------- ----------------------------------- Application enq: TX - row lock contention
Another tool I often reach for is gdb
, this can be used to interrogate a running process, but it is also used for finding out why a process has crashed, using bt
for backtrace:
$ gdb $SP_SYS_VARDIR/.app-modules/sp_ocap /var/crash/oracle/core (gdb) bt #0 0xf710e265 in OlogConv4 (x=0x0) at ../src/olog/olog.c:242 #1 0x0807f45e in ol_idx (r=0x0, x=0xffd9a8a0) at ../src/ocap/ol_p.c:2498 #2 0x08093f0e in ol_olr (r=0x0, o=0x9924278) at ../src/ocap/ol_p.c:9103 #3 0x08058147 in fetch_nonops_log_record (result=0xffd9ccfc) at ../src/ocap/sp_ocap.c:3880 #4 0x0805b0c2 in fetch_log_record (result=0xffd9ccfc) at ../src/ocap/sp_ocap.c:4996 #5 0x08060365 in main (argc=2, argv=0xffd9cd94) at ../src/ocap/sp_ocap.c:7178
It is nice to be able to tell a developer on which exact line of code their product crashed, even if you don’t have the source – I know that from personal experience of the other side!
Nice post Guy. How are you doing? I heard that Jaco is going back to Betfair as Oracle DBA Manager.
Thanks 🙂 Yep, he started back at BF a couple of weeks ago. Unai has transferred to San Francisco. How’re things at BarCap?
I guess you have seen BF all over the news over the last few weeks…
I stumbled onto your blog via your linkedin profile.
Yeah heard about Betfair CEO leaving the company. Well, there is post vacant for you then.
Things are fine at BarCap. They are merging their Retail Bank with BarCap.
Heh, there are plenty of middle managers leaving according to the Grauniad, tho’ their reporting is not entirely accurate. You will see from this blog that my interests are off on a tangent from just Oracle…
I’d heard that the govt. were planning to legally enforce separation of retail/commercial and investment banking, like the old Glass-Steagal act in the US
Pingback: Extprocs in OCaml | So I decided to take my work back underground