Unix diagnostic tools for DBAs

When a server application I am supporting starts to misbehave, there are a few tools I reach for to perform a quick diagnosis. Some of these are little known to DBAs who usually work within the database, so here is a quick overview of some of them.

In the first example, a Shareplex process appears to have hung. Killing it with kill -9 and restarting it just results in it hanging again (I don’t do that myself, but that is the first thing most people would try). Instead let’s see what it is doing. First find the process ID, then strace it:

$ strace -p 14010
Process 14010 attached - interrupt to quit
[ Process PID=14010 runs in 32 bit mode. ]
fcntl64(7, 0xe /* F_??? */

This tells you a lot if you know how to interpret it. It is waiting for the completion of a fcntl call, on file descriptor 7, to do operation 0xE (14 in hex). We can see what is connected to the file descriptor:

$ ls -lh /proc/14010/fd/7
lrwx------ 1 oracle dba 64 May 27 12:17 /proc/14010/fd/7 -> /shareplex/var/data/statusdb.lck

And what is 0xE in fcntl.h (which can be found with the locate command on most systems)?

#define F_SETLKW64      14      /* Set record locking info (blocking).  */

So we are waiting forever to lock a file, which another process already has locked. The way to find out who is with the fuser command:

$ fuser /shareplex/var/data/statusdb.lck
$

No-one… That’s strange. Well not really: this is an NFS volume, so if there is a lock, yet no process locking it on the Unix side, then it must be a stale lock on the NetApp. I had a sysadmin clear it, and problem solved. Not a Shareplex problem after all! Another tool that works well with strace is lsof, which lists open files. Let’s say I have another hung process, and I strace it and see:

$ strace -p 23771
Process 23771 attached - interrupt to quit
read(11,

I look to see what is on FD 11:

$ ls -lh /proc/23771/fd/11
lr-x------  1 oracle dba 64 May 27 13:29 /proc/23771/fd/11 -> pipe:[17004658]

It’s not a “file” at all, but in Unix everything is a file, so:

$ lsof|grep 17004658
sqlplus   23771   oracle   11r     FIFO        0,7               17004658 pipe
oracle    23774   oracle   12w     FIFO        0,7               17004658 pipe

It is connected to Oracle on that FD, waiting to read, and now I can simply query on the Oracle side to see what is is stuck on, using the Oracle PID:

SQL> select wait_class, event from v$session where paddr in (select addr from v$process where spid=23774);
WAIT_CLASS      EVENT
--------------- -----------------------------------
Application     enq: TX - row lock contention

Another tool I often reach for is gdb, this can be used to interrogate a running process, but it is also used for finding out why a process has crashed, using bt for backtrace:

$ gdb  $SP_SYS_VARDIR/.app-modules/sp_ocap /var/crash/oracle/core
(gdb) bt
#0  0xf710e265 in OlogConv4 (x=0x0) at ../src/olog/olog.c:242
#1  0x0807f45e in ol_idx (r=0x0, x=0xffd9a8a0) at ../src/ocap/ol_p.c:2498
#2  0x08093f0e in ol_olr (r=0x0, o=0x9924278) at ../src/ocap/ol_p.c:9103
#3  0x08058147 in fetch_nonops_log_record (result=0xffd9ccfc) at ../src/ocap/sp_ocap.c:3880
#4  0x0805b0c2 in fetch_log_record (result=0xffd9ccfc) at ../src/ocap/sp_ocap.c:4996
#5  0x08060365 in main (argc=2, argv=0xffd9cd94) at ../src/ocap/sp_ocap.c:7178

It is nice to be able to tell a developer on which exact line of code their product crashed, even if you don’t have the source – I know that from personal experience of the other side!

About Gaius

Jus' a good ol' boy, never meanin' no harm
This entry was posted in Linux, Oracle, Shareplex. Bookmark the permalink.

5 Responses to Unix diagnostic tools for DBAs

  1. Vishal Gupta says:

    Nice post Guy. How are you doing? I heard that Jaco is going back to Betfair as Oracle DBA Manager.

    • Gaius says:

      Thanks 🙂 Yep, he started back at BF a couple of weeks ago. Unai has transferred to San Francisco. How’re things at BarCap?

      I guess you have seen BF all over the news over the last few weeks…

  2. Vishal Gupta says:

    I stumbled onto your blog via your linkedin profile.

    Yeah heard about Betfair CEO leaving the company. Well, there is post vacant for you then.

    Things are fine at BarCap. They are merging their Retail Bank with BarCap.

    • Gaius says:

      Heh, there are plenty of middle managers leaving according to the Grauniad, tho’ their reporting is not entirely accurate. You will see from this blog that my interests are off on a tangent from just Oracle…

      I’d heard that the govt. were planning to legally enforce separation of retail/commercial and investment banking, like the old Glass-Steagal act in the US

  3. Pingback: Extprocs in OCaml | So I decided to take my work back underground

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s