Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Absolute BSD - The Ultimate Guide To FreeBSD (2002).pdf
Скачиваний:
25
Добавлен:
17.08.2013
Размер:
8.15 Mб
Скачать

The problem here is that savecore needs to use a kernel file to build the dump image. By default, savecore uses the booting kernel. If you are booting off a different kernel after a panic, you must run savecore manually to tell it where to find the proper kernel file. Interrupt the boot during the initial countdown, and boot into single−user mode with this command:

...............................................................................................

ok boot −s

...............................................................................................

When the system gives you a command prompt, fsck your system first. After a panic, the disks are almost always dirty:

...............................................................................................

# fsck −p

...............................................................................................

(This can take several minutes on a modern (huge) disk.)

Once fsck finishes, mount the filesystem where you keep your kernel core files:

...............................................................................................

# mount /var

...............................................................................................

Finally, save your kernel core using the proper kernel file, telling savecore which kernel file to use with the −N flag. If your panicked kernel is /kernel.bad, use something like this:

...............................................................................................

# savecore −N /kernel.bad /var/crash

...............................................................................................

You can, of course, use additional savecore options like −v and −z in a manual core dump.

Using the Dump

If you're a kernel developer, this is where you stop listening to me and rely upon your own debugging experience. If you're a new systems administrator, though, you probably don't know enough about C and kernel internals to have any real hope of debugging a complicated kernel issue. As such, we'll focus on extracting enough information to give a developer a good shot at identifying the problem.

If you look at /var/crash after a dumped panic, you'll see the files kernel.0 and vmcore.0. (Each subsequent crash dump will get a consecutively higher number, such as kernel.1 and vmcore.1.) The vmcore.0 file is the actual memory dump, while the kernel.0 file is a copy of the crashed kernel. The kernel.0 file isn't useful for what we're doing, but keep it just in case. The vmcore.0 file is vital.

Once you actually have a crash, you might copy your debugging kernel to /var/crash/kernel.debug.0 to keep dumps in sync with their kernels.

457

Note The rest of this process is an excellent opportunity to use script(1).

Now start the gdb debugger. Gdb takes three arguments: a −k to configure the debugger appropriately for kernel work, the name of a file containing the kernel with symbols, and the name of the memory dump:

...............................................................................................

# gdb −k kernel.debug.0 vmcore.0

...............................................................................................

Once you do that, gdb will spit out its copyright information, the panic message, and a copy of the memory−dumping process. We've seen an example of a panic earlier, so I won't repeat it now; what's new is the debugger prompt you get back at the end of all this:

...............................................................................................

(kgdb)

...............................................................................................

You've now gotten further than any number of people who have system panics. Pat yourself on the back. To find out exactly where the panic happened, type where and hit ENTER.

...............................................................................................

(kgdb) where

#0 dumpsys () at ../../../kern/kern_shutdown.c:505

#1 0xc0143119 in db_fncall (dummy1=0, dummy2=0, dummy3=0, dummy4=0xe0b749a4 `` \0048\200%'') at ../../../ddb/db_command.c:551

#2 0xc0142f33 in db_command (last_cmdp=0xc0313724, cmd_table=0xc0313544, aux_cmd_tablep=0xc030df2c, aux_cmd_tablep_end=0xc030df30)

at ../../../ddb/db_command.c:348

#3 0xc0142fff in db_command_loop () at ../../../ddb/db_command.c:474

#4 0xc0145393 in db_trap (type=12, code=0) at ../../../ddb/db_trap.c:72 #5 0xc02ad0f6 in kdb_trap (type=12, code=0, regs=0xe0b74af4)

at ../../../i386/i386/db_interface.c:161

#6 0xc02ba004 in trap_fatal (frame=0xe0b74af4, eva=40) at ../../../i386/i386/trap.c:846

#7 0xc02b9d71 in trap_pfault (frame=0xe0b74af4, usermode=0, eva=40) at ../../../i386/i386/trap.c:765

#8 0xc02b9907 in trap (frame={tf_fs = 24, tf_es = 16, tf_ds = 16, tf_edi = 0, tf_esi = 0, tf_ebp = −524858548, tf_isp = −524858592,

tf_ebx = −525288192, tf_edx = 0, tf_ecx = 1000000000, tf_eax = 0, tf_trapno = 12, tf_err = 0, tf_eip = −1071645917, tf_cs = 8, tf_eflags = 66182, tf_esp = −1070136512, tf_ss = 0})

at ../../../i386/i386/trap.c:433

#9 0xc01ffb23 in vcount (vp=0xe0b0bd00) at ../../../kern/vfs_subr.c:2301 #10 0xc01a5e58 in spec_close (ap=0xe0b74b94)

at ../../../fs/specfs/spec_vnops.c:591

#11 0xc01a55f1 in spec_vnoperate (ap=0xe0b74b94) at ../../../fs/specfs/spec_vnops.c:121

#12 0xc0207454 in vn_close (vp=0xe0b0bd00, flags=3, cred=0xc32cce00, td=0xe0a8d360) at vnode_if.h:183

#13 0xc0207fab in vn_closefile (fp=0xc3369080, td=0xe0a8d360) at ../../../kern/vfs_vnops.c:757

#14 0xc01b1d50 in fdrop_locked (fp=0xc3369080, td=0xe0a8d360) at ../../../sys/file.h:230

#15 0xc01b155a in fdrop (fp=0xc3369080, td=0xe0a8d360) at ../../../kern/kern_descrip.c:1538

#16 0xc01b152d in closef (fp=0xc3369080, td=0xe0a8d360) at ../../../kern/kern_descrip.c:1524

#17 0xc01b114e in fdfree (td=0xe0a8d360) at ../../../kern/kern_descrip.c:1345 #18 0xc01b5173 in exit1 (td=0xe0a8d360, rv=256)

458

at ../../../kern/kern_exit.c:199

#19 0xc01b4ec2 in sys_exit (td=0xe0a8d360, uap=0xe0b74d20) at ../../../kern/kern_exit.c:109

#20 0xc02ba2b7 in syscall (frame={tf_fs = 47, tf_es = 47, tf_ds = 47,

tf_edi = 135227560,

tf_esi =

0,

tf_ebp = −1077941020,

tf_isp = −524857996, tf_ebx

= −1, tf_edx = 135044144,

tf_ecx =

−1077942116, tf_eax

= 1, tf_trapno = 12, tf_err = 2,

tf_eip =

134865696,

tf_cs =

31,

tf_eflags = 663, tf_esp = −1077941064,

tf_ss =

47}) at ../../../i386/i386/trap.c:1049

#21 0xc02ae06d in syscall_with_err_pushed ()

#22 0x80503a5

in ?? ()

#23 0x807024a

in ?? ()

#24 0xbfbfffb4 in ?? ()

#25 0x807daaf

in ?? ()

#26 0x807d6eb

in ?? ()

#27 0x80630c1

in ?? ()

#28 0x8062fed

in ?? ()

#29 0x805ea4c

in ?? ()

#30 0x8065949

in ?? ()

#31 0x806544d

in ?? ()

#32 0x806dc17

in ?? ()

#33 0x80616b7

in ?? ()

#34 0x80613f0

in ?? ()

#35 0x8048135

in ?? ()

(kgdb)

 

...............................................................................................

Whoa! This is some pretty intense stuff. If you copied this and the output of uname −a into an email and sent it to hackers@FreeBSD.org, various developers would take note and help you out. They'd probably write you back and tell you other things to type at the kgdb prompt, but you'd definitely get developer attention. You'd be well on your way to getting the problem solved, and helping the FreeBSD folks squash a bug.

Advanced Kernel Debugging

If you're not familiar with programming, nobody would blame you if you stopped here, but dig we must. So, without further ado, let's see what we can learn from the debug message, and try to figure out some things to include in that first email. Without being intimate with the kernel, you can't solve the problem yourself, but you might be able to help narrow things down a little.

By gathering the information you can before sending an email, you short− circuit a round or two of email. (If you've used email support in a crisis, you know just how valuable this is!) Without being a kernel hacker, you can't know which tidbit of knowledge is most important, so you need to include everything you can glean from the output.

The first thing to realize is that the debugger backtrace contains actual instructions carried out by the kernel, in reverse order. Line #1 is the last thing the kernel did before dumping the system entirely in line 0. (When someone says "before" or "after," they're almost certainly talking about chronological order and not the order things appear in the debugger.)

In a panic, the kernel will call either a function called trap or (if you have INVARIANTS in your kernel) one called panic. You'll see variants on trap and panic, such as db_trap, but you just want the plain, old unadorned trap or panic. Look through your gdb output for either of these functions. In the previous example, there's a trap in line #8. We see other types of trap on lines 4–7, but no plain, straightforward trap statements. These other traps are helper functions called by trap to try to figure

459

out exactly what happened and what to do about it.

Whatever happened right before line #8 chose to panic. In line #9, we see:

...............................................................................................

#9 0xc01ffb23 in vcount (vp=0xe0b0bd00) at ../../../kern/vfs_subr.c:2301

...............................................................................................

The hex numbers don't mean much, but we see in this panic something called vcount. If you try man vcount, you'll see that vcount(9) is a standard system call. The panic occurred while executing code that was compiled from line 2301 of the file /usr/src/sys/kern/vfs_subr.c. (All paths in these dumps should be under the kernel source directory, usually /usr/src/sys.) This gives a developer a very good idea of where to look for this problem.

Examining Lines

Let's look at line #9 in more detail. Use the up command and the number of lines you want to move:

...............................................................................................

(kgdb) up 9

#9 0xc01ffb23 in vcount (vp=0xe0b0bd00) at ../../../kern/vfs_subr.c:2301 2301 SLIST_FOREACH(vq, &vp−>v_rdev−>si_hlist, v_specnext) (kgdb)

...............................................................................................

Here we see the actual line of vfs_subr.c that was compiled into the panicking code. You don't need to know what SLIST_FOREACH is (it's a macro, by the way). Getting this far is pretty good, but there's still a little more information you can squeeze out of this dump without knowing exactly how the kernel works.

Examining Variables

If you have some minor programming experience, you'd probably suspect that the terms in the parentheses after SLIST_FOREACH are variables, and you'd be right. Each of those variables has a range of acceptable values, and someone familiar with the code would recognize the legitimate ones. By printing out the contents of each variable, we can jump−start the debugging process. (Tell gdb to print a variable's contents with the p command, giving the variable name as an argument.)

Let's look at the middle variable, vp:

...............................................................................................

(kgdb) p vp

$2 = (struct vnode *) 0xe0b0bd00

(kgdb)

...............................................................................................

The (struct vnode *) bit tells us that this is a pointer to a data structure. You can show its contents by putting an asterisk in front of the variable name, like so:

...............................................................................................

(kgdb) p *vp

460

$3 = {v_flag = 8, v_usecount = 2, v_writecount = 1, v_holdcnt = 0,

v_id = 6985, v_mount = 0x0, v_op = 0xc2d52a00, v_freelist = {tqe_next = 0x0, tqe_prev = 0xe083de1c}, v_nmntvnodes = {tqe_next = 0xe0b0b700,

tqe_prev = 0xe0b0c024}, v_cleanblkhd = {tqh_first = 0x0, tqh_last = 0xe0b0bd2c}, v_dirtyblkhd = {tqh_first = 0x0,

tqh_last = 0xe0b0bd34}, v_synclist = {le_next = 0x0, le_prev = 0x0}, v_numoutput = 0, v_type = VBAD, v_un = {vu_mountedhere = 0x0,

vu_socket = 0x0, vu_spec = {vu_specinfo = 0x0, vu_specnext = { sle_next = 0x0}}, vu_fifoinfo = 0x0}, v_lastw = 0, v_cstart = 0,

v_lasta = 0, v_clen = 0, v_object = 0x0, v_interlock = {mtx_object = { lo_class = 0xc0335c60, lo_name = 0xc02ef5c1 "vnode interlock", lo_flags = 196608, lo_list = {stqe_next = 0x0}, lo_witness = 0x0},

mtx_lock = 4, mtx_recurse = 0, mtx_blocked = {tqh_first = 0x0,

tqh_last = 0xe0b0bd84}, mtx_contested = {le_next = 0x0, le_prev = 0x0}, tsp = {tv_sec = 3584, tv_nsec = 101067509},

file = 0xc02ef50a "../../../kern/vfs_subr.c", line = 1726, has_trace_time = 0}, v_lock = {lk_interlock = 0xc036e320, lk_flags = 16777216, lk_sharecount = 0, lk_waitcount = 0,

lk_exclusivecount = 0, lk_prio = 80, lk_wmesg = 0xc02ef5d1 "vnlock", lk_timo = 6, lk_lockholder = −1}, v_vnlock = 0x0, v_tag = VT_NON,

v_data = 0x0, v_cache_src = {lh_first = 0x0}, v_cache_dst = {

tqh_first = 0x0, tqh_last = 0xe0b0bdd8}, v_dd = 0xe0b0bd00, v_ddid = 0, v_pollinfo = 0x0, (kgdb)

...............................................................................................

Note For those of you who are learning C, this is an excellent example of how it's easier to hand around a pointer than the object it references.

An interested developer can dig through this to see what's going on. Let's look at the first variable, vq, and try to get similar information from it:

...............................................................................................

(kgdb) p vq

$4 = (struct vnode *) 0x0

(kgdb)

...............................................................................................

This isn't exactly a problem, but we're stuck. A pointer equal to 0x0 is a null pointer. There are many legitimate reasons for having a null pointer, but there isn't anything in it for us to view. Feel free to try, however; you really can't hurt the dump any further by using gdb.

...............................................................................................

(kgdb) p *vq

Cannot access memory at address 0x0.

(kgdb)

...............................................................................................

You've probably heard the words "null pointer" in close proximity to the word "panic." Without digging into the kernel code, you can't assume that this particular null pointer caused the panic. In fact, in this particular panic, the null pointer is perfectly legitimate; the kernel panicked trying to decide what value to assign to this newly allocated pointer.[1]

461