Unix crash analysis cookbook Digital Internal Use Only this page is under construction [(silly cartoon)] This is not a step-by-step guide to crash analysis. It's just a bag of tricks that I've found useful. This page changes as I learn new tricks. If you know any useful crash analysis techniques, please let me know and I can add them to this page. Thanks. This page is maintained by Leon Strauss Comments welcome ---------------------------------------------------------------------------- contents * debugging * getting started o processes and threads o overview o kernel memory fault o full and partial dumps o simple_lock panic o documentation o semaphores o header files o shared memory o version numbers o message information o address notation o mount table o memory map o cam peripheral devices * dbx o i/o configuration o setting up dbx o flck file locking o displaying data information o identifying o files inodes vnodes datastructures o unexplored stuff * general information about the * kdbx system * forced crash o swap space * examining sources o memory usage * other tools * miscellaneous hints ---------------------------------------------------------------------------- overview * crashes The first step is to identify the thread of code that was executing when the system crashed. Then identify the routines on the stack which lead to this point. In a multi-cpu system, there may be more than one running thread. To determine why the crash occurred requires detailed analysis of the code, and some navigation through kernel data structures. With any luck, someone has already done this for you and generated a patch. If not, you'll have to gather as much information as you can, and open a CLD. The hard part is knowing what to look for, or finding what you know you're looking for. Either way, you're a long way from home ... See the kernel memory fault example * hangs Analysing forced crashes from system hangs is similar to analysing crashes, but may require analysis of several threads. Start with the processes and threads in the RUN state, to see what was executing at the time of the hang. Then proceed as for a crash. ---------------------------------------------------------------------------- full and partial dumps * By default, partial dumps are enabled. * To enable full dumps ... o either >>> set boot_osflags d o or (dbx) a partial_dump = 0 * To check whether partial or full dumps are enabled (dbx) p partial_dump 0 means full dumps 1 means partial dumps ---------------------------------------------------------------------------- documentation * Kernel Debugging Guide is more complete than ... * Kernel Debugging Manual header files * /usr/include/sys/proc.h p_ fields * /usr/include/sys/user.h uu_ fields, utask, uthread * /usr/include/kern/task.h * /usr/include/kern/thread.h * /usr/include/machine/pcb.h * /usr/sys/include/mach/kern_return.h KERN_ kernel return codes * /sys/conf/param.c back to the contents ---------------------------------------------------------------------------- version numbers * from the shell uname -a * from dbx p utsname * Digital UNIX version history with pointers to other pages (some restrictions apply) * UNIX Version to Revision conversion table with Patch IDs back to the contents ---------------------------------------------------------------------------- address notation Notation Address Type Replaces Example v virtual ffffffff v0xNNNNNNNN = 0xffffffffNNNNNNNN e virtual fffffffe e0xNNNNNNNN = 0xfffffffeNNNNNNNN k kseg fffffc00 k0xNNNNNNNN = 0xfffffc00NNNNNNNN u user space 00000000 u0xNNNNNNNN = 0x00000000NNNNNNNN ? - - ?0xNNNNNNNN Unrecognized or random type back to the contents ---------------------------------------------------------------------------- memory map ------------------- ffff ffff ffff ffff v0xffffffff virtual reserved for kernel ffff fc00 0000 0000 k0x00000000 kseg ------------------- ffff fbff ffff ffff not accessible 0000 0400 0000 0000 ------------------- 0000 03ff ffff ffff dynamic loader shared libraries mappable by program e.g. SysV shared libraries heap (grows up) sbrk and break .bss uninitialised data with size > value specified by -G option .sbss data <= -G (default = 8) .got global offset table <- $gp .sdata small data .data data initialised for data section .rdata data initialised for rdata section 0000 0001 4000 0000 .text (program code) 0000 0001 2000 0000 ------------------- 0000 0001 1fff ffff stack (grows towards 0) <- $sp mappable by program e.g. SysV shared libraries 0000 0000 0001 0000 u0x00010000 user space ------------------- 0000 0000 0000 ffff not accessible 0000 0000 0000 0000 ------------------- back to the contents ---------------------------------------------------------------------------- dbx setting up dbx set $hexints=1 display integers in hex set $page=0 suppress paging of long displays set $hexstrings=1 gets past structures with uninitialised string pointers set $printdata=1 display register contents when displaying instructions saving your dbx session in a file There are two ways to do this. * run dbx from inside a script(1) session from the shell This saves all your output (including shell sessions started from the dbx sh command) and all your input exactly as you typed it. * run the following commands from the (dbx) prompt set $rimode=1 save recorded input and output to the same file ri FILENAME save input dbx commands to FILENAME ro FILENAME save output from dbx to FILENAME status show the status of recorded input and/or output Abbreviated commands will be expanded and dbx shell sessions will not be saved. displaying data px ADDRESS displays ADDRESS in hex pd ADDRESS displays ADDRESS in decimal po ADDRESS displays ADDRESS in octal ADDRESS/X displays ADDRESS and contents in hex ADDRESS/10X displays ADDRESS and 10 following addresses in hex ADDRESS?10X displays ADDRESS and 10 preceding addresses in hex identifying datastructures dbx can help you find data type and datastructure definitions. For example, suppose you encounter a data field of type ino_t somewhere in a data structure. You can determine its true data type as follows ... (dbx) whatis ino_t typedef uint_t ino_t; (dbx) whatis uint_t typedef unsigned int; (dbx) whatis unsigned int unsigned int In this example, ino_t resolves to an unsigned int dbx can also display complex datastructures, and show you the header file where that datastructure is defined. For example, suppose you want to know about the uthread structure ... (dbx) whatis uthread struct uthread { struct unameicache { int nc_prevoffset; : : This displays the whole structure of uthread. To see what the fields actually represent, you have to dig through the header files. The whereis command can sometimes help. This example returns something resembling a filespec /sys/include/machine/pcb.stack_layout.uthread. In this case, uthread is not defined in /sys/include/machine/pcb.h, but it can be tracked down from other header files included in /sys/include/machine/pcb.h. (dbx) whereis uthread .uthread .stack_layout.uthread .uthread .uthread .uthread /sys/include/machine/pcb.stack_layout.uthread .uthread (dbx) sh # ls /sys/include/machine/pcb* /sys/include/machine/pcb.h # grep uthread /sys/include/machine/pcb.h : struct uthread uthread; : # grep '\.h' /sys/include/machine/pcb.h #include #include #include # grep uthread /sys/include/sys/user.h struct uthread { struct np_uthread *np_uthread; See examining sources for more about finding data definitions. ---------------------------------------------------------------------------- general information about the system p utsname get system version etc p hostname hostname p *pmsgbuf message buffer kps PIDs and command px kernel_memory_fault_data fault_va, fault_pc, fault_ra etc p cpus_in_box number of processors present in the system p ncpus highest available cpu number plus one p machine_slot[paniccpu].cpu_panic_thread thread that panicked p machine_slot { [0] struct { is_cpu = 0x1 cpu_type = 0xf cpu_subtype = 0x8 running = 0x1 cpu_ticks = { [0] 0x3098aad [1] 0x579da [2] 0x423d03b [3] 0x6894345f [4] 0x336171 } clock_freq = 0x400 error_restart = 0x0 cpu_panicstr = (nil) cpu_panic_thread = (nil) } : : p sched_tick uptime in seconds * /usr/sys/include/mach/machine.h * /usr/sys/include/dec/binlog/binlog.h p machine_info struct { major_version = 0x1 minor_version = 0x0 max_cpus = 0x10 avail_cpus = 0x1 memory_size = 0x4e00000 } /usr/sys/include/mach/machine.h swap space p vm_swap_lazy 0=disabled, !0=enabled, /sbin/swapdefault not found p vm_swap_eager 0=disabled, !0=enabled, /sbin/swapdefault exists p vm_total_swap_space total swap space p vm_swap_space free swap space memory usage p vm_page_free_count How many pages are free? p vm_page_wire_count wired page count p physmem size of physical memory /usr/sys/include/vm/vm_page.h See also per-process memory usage ---------------------------------------------------------------------------- debugging p savedefp[28] current pc from the exception frame savedefp[28]/i current instruction from the exception frame p savedefp[23] return address from the exception frame p savedefp the whole exception frame (saved register values) ADDRESS/i instruction and routine name at this address /usr/include/machine/reg.h lists register definitions t stack trace func ROUTINE set context to this routine file which source file contains this routine dump display variables in the current procedure dump . display variables in all procedures on the stack dump PROCEDURE display variables in procedure PROCEDURE p $pid current pid p $tid current thread kps PIDs and command tlist list threads thread_boot is most likely to call panic() thread_block threads are unlikely to call panic() On the running system, ps -elfm lists all threads of all processes set $pid=NEW-PID set context to new process set $tid=NEW-TID set context to new thread tset THREAD set context to THREAD tset machine_slot[paniccpu].cpu_panic_thread set context to panicking thread ADDRESS?5i 5 instructions leading up to ADDRESS ADDRESS/5i 5 instructions starting at ADDRESS processes and threads p maxusers maximum number of users p wait_queue queue of threads, each waiting for exactly one event Isolate a particular thread ($tid) within a particular process ($pid), then use these dbx commands to display information about that thread or process. If the thread is in the active_threads array, information can also be retrieved from there. For example, the following commands refer to the same data ... * thread address p (struct thread *) $tid 0xfffffc00045dab80 p active_threads[0] 0xfffffc00045dab80 * thread state p (*(struct thread *) $tid).state 4 p active_threads[0].state 4 * utask structure p *(*(struct thread *) $tid).stack.uthread.utask p *active_threads[0].stack.uthread.utask The active_threads array is indexed on the CPU number, for example active_threads[0], active_threads[paniccpu], etc. If the system panicked, paniccpu will contain the number of the panicking CPU. Otherwise, paniccpu = -1 and active_threads[paniccpu] will be meaningless. p active_threads[paniccpu].state state of active thread p *active_threads[paniccpu].task active task structure p active_threads[paniccpu].task.procfs active procfs structure p *active_threads[paniccpu].stack.uthread.utask active utask structure main process and thread datastructures set context to tset THREAD THREAD single thread of control. retains thread p (*(struct thread *) $tid) processor context when thread is not active thread.h address space task p *(*(struct thread *) $tid).task description, thread queue head, ports, etc task.h stack is too big to display, pointers to pcb, stack fails with internal stack uthread, utask, proc, overflow etc ... pcb.h registers, stack pcb p (*(struct thread *) pointers etc $tid).stack.pcb pcb.h thread-related information carried uthread p (*(struct thread *) over from Unix user $tid).stack.uthread structure user.h task-related information carried utask p *(*(struct thread *) over from Unix user $tid).stack.uthread.utask structure user.h signals, process group proc p *(*(struct thread *) links, credentials, $tid).stack.uthread.proc resource usage proc.h process and thread data tset THREAD set context to THREAD p *(*(struct thread *) $tid).stack.uthread.np_uthread non-paged uthread data (saved r0, signals, etc) p (*(struct thread *) $tid).stack.uthread.utask current task structure p (*(struct thread *) $tid).stack.uthread.utask.uu_file_state.uf_ofile open file list p (*(struct thread *) $tid).stack.uthread.utask.uu_comm current command p (*(struct thread *) $tid).stack.uthread.utask.uu_logname user's login name p (*(*(struct thread *) $tid).stack.uthread.proc).p_pid current pid p (*(*(struct thread *) $tid).stack.uthread.proc).p_stat process state p (*(struct thread *) $tid).stack.pcb the pcb p (*(struct thread *) $tid).stack.pcb.pcb_regs all the registers p (*(struct thread *) $tid).stack.pcb.pcb_regs[N] register N p (*(struct thread *) $tid).stack.pcb.pcb_ksp kernel stack pointer p (*(struct thread *) $tid).stack.pcb.pcb_usp user stack pointer p (*(struct thread *) $tid).stack.pcb.pcb_current_cpu cpu on which this thread is executing process and thread state /usr/include/sys/proc.h lists process state definitions p (*(*(struct thread *) $tid).stack.uthread.proc).p_stat process state #define SSLEEP 1 /* awaiting an event */ #define SWAIT 2 /* (abandoned state) */ #define SRUN 3 /* running */ #define SIDL 4 /* intermediate state in process creation */ #define SZOMB 5 /* intermediate state in process termination */ #define SSTOP 6 /* process being traced */ p (*(*(struct thread *) $tid).stack.uthread.proc).p_flag /usr/include/sys/proc.h lists the values of p_flag /usr/sys/include/kern/thread.h lists thread state definitions p (*(struct thread *) $tid).state state of current thread #define TH_WAIT 0x01 /* thread is queued for waiting */ #define TH_SUSP 0x02 /* thread has been asked to stop */ #define TH_RUN 0x04 /* thread is running or on runq */ #define TH_SWAPPED 0x08 /* thread is swapped out */ #define TH_IDLE 0x10 /* thread is an idle thread */ per-process memory usage /usr/sys/include/vm/vm_map.h /usr/sys/include/mach/vm_statistics.h pd (* (struct thread *) $tid).task.map.vm_size virtual size (VSZ) pd (* (struct thread *) $tid).task.map.vm_pmap.stats Physical map stats struct { resident_count = 447 total pages mapped (RSS) max_resident_count = 0 maximum number of pages mapped wired_count = 0 number of pages wired resident_text = 8 number of resident code pages? } pd (* (struct thread *) $tid).task.map.vm_pmap.stats.resident_count (RSS) pd (* (struct thread *) $tid).task.map.vm_maximum maximum size pd (* (struct thread *) $tid).task.map.vm_pagefaults Accumulated pagefault nameidata /usr/include/sys/namei.h px * (* (struct thread *) $tid).stack.uthread.uu_nd.ni_utnd.utnd_cdir : v_type = VDIR v_tag = VT_NFS : vnode of current directory px * (* (struct thread *) $tid).stack.uthread.uu_nd.ni_dvp struct { v_lock = struct { : v_type = VDIR v_tag = VT_UFS : vnode of intermediate directory kernel memory fault example of analysing a kernel memory fault * display the kernel memory fault (dbx) px kernel_memory_fault_data struct { fault_va = 0x20 fault_pc = 0xfffffc0000282e4c fault_ra = 0xfffffc0000282e0c fault_sp = 0xffffffffd1817080 access = 0x0 status = 0x0 cpunum = 0x2 count = 0x1 pcb = 0xffffffffd1817a58 thread = 0xffffffff8211eb40 task = 0xffffffff81d5eb80 proc = 0xffffffff81d5ed90 } * set context to the thread (dbx) tset kernel_memory_fault_data.thread OR (dbx) tset 0xffffffff8211eb40 stopped at [stop_secondary_cpu:353 ,0xfffffc00004c658c] Source not available * display the stack trace (dbx) t > 0 stop_secondary_cpu() ["../../../../src/kernel/arch/alpha/cpu.c":352, 0xfffffc00004c6588] 1 panic(s = 0xfffffc00005cc7a8 = "event_timeout: panic request") ["../../../../src/kernel/bsd/] 2 event_timeout(func = 0xfffffc000042e930, arg = 0xfffffc000065a130, timeout = 0x1) ["../../..] 3 xcpu_puts(s = 0xffffffffd1816bd8, prfbufp = 0xfffffc000065a130) ["../../../../src/kernel/bsd] 4 printf(va_alist = 0xfffffc00005bcba0) ["../../../../src/kernel/bsd/subr_prf.c":351, 0xfffffc] 5 panic(s = 0xfffffc00005cf110 = "kernel memory fault") ["../../../../src/kernel/bsd/subr_prf.] 6 trap() ["../../../../src/kernel/arch/alpha/trap.c":1281, 0xfffffc00004db178] 7 _XentMM(0x0, 0xfffffc0000282e4c, 0xfffffc0000605a90, 0x5, 0x3863c0) ["../../../../src/kernel] 8 procfs_ioctl(vp = 0xfffffc0000441f1c, com = 0x40404625, data = 0xffffffffaf8030d8 = "^0\200\] 9 vn_ioctl(0x30, 0x40404625, 0xffffffffaf607000, 0xffffffff00000000, 0xfffffc0000285568) ["../] 10 procfs_ioctl_interface(p = 0xffffffff81d5ed90, args = 0xffffffffd18178c8, retval = 0xfffffff] 11 ioctl_base(0xffffffff81d5ed90, 0xffffffffd18178c8, 0xffffffffd18178b8, 0x0, 0xfffffc00004d98] 12 ioctl(0xffffffffd18178b8, 0x0, 0xfffffc00004d9888, 0xfffffc00004dac1c, 0xfffffc00004d92f8) [] 13 syscall(0x12000ca14, 0x8, 0x1608, 0x41, 0x36) ["../../../../src/kernel/arch/alpha/syscall_tr] 14 _Xsyscall(0x8, 0x3ff800d5ad8, 0x14000b060, 0x3, 0x40404625) ["../../../../src/kernel/arch/al] * list the instruction stream leading up to the failing instruction at fault_pc (dbx) kernel_memory_fault_data.fault_pc?10i OR (dbx) 0xfffffc0000282e4c?10i [procfs_ioctl:4967, 0xfffffc0000282e28] ldq r22, 16(r0) [procfs_ioctl:4967, 0xfffffc0000282e2c] stq r22, 0(r2) [procfs_ioctl:4969, 0xfffffc0000282e30] ldq r5, 40(r0) [procfs_ioctl:4969, 0xfffffc0000282e34] ldq_u r23, 46(r5) [procfs_ioctl:4969, 0xfffffc0000282e38] lda r28, 46(r5) [procfs_ioctl:4969, 0xfffffc0000282e3c] extwl r23, r28, r23 [procfs_ioctl:4969, 0xfffffc0000282e40] cmpeq r23, 0x8, r7 [procfs_ioctl:4969, 0xfffffc0000282e44] beq r7, 0xfffffc0000282e64 [procfs_ioctl:4970, 0xfffffc0000282e48] ldq r8, 48(r0) [procfs_ioctl:4970, 0xfffffc0000282e4c] ldq r25, 32(r8) * In this example, procfs_ioctl is the name of the routine which failed. o 4970 is the number of the line of code which failed. o The failing instruction is ldq r25, 32(r8) o The operand 32(r8) means offset 32 from the address in register 8. o From the kernel memory fault data, the fault virtual address fault_va = 0x20 which is 32 decimal. o If r8 contained the value 0, then 32(r8) would be 32. o Attempting to access address 32, which is an illegal location, triggers a kernel memory fault. o Examine the contents of r8 - it is indeed 0. (dbx) p $r8 0 * The question now is how or why r8 came to contain 0. The instructions immediately preceding ldq r25, 32(r8) don't load any values into r8. We could either search further backwards through the code stream, or examine the source code for proc_fs, or start with the running process and work forwards. * This is the login name of the process (dbx) p (*(struct thread *) $tid).stack.uthread.utask.uu_logname "arcarms" * This is the program running (dbx) p (*(struct thread *) $tid).stack.uthread.utask.uu_comm "aaf_cai.exe" * This is the state of the process (dbx) p (*(*(struct thread *) $tid).stack.uthread.proc).p_stat 0x3 * p_stat = 3 means the process is in the R (running or runable) state simple_lock panic example of analysing a "simple_lock: time limit exceeded" panic * Display the preserved message buffer (dbx) p *pmsgbuf : simple_lock: time limit exceeded pc of caller: 0xfffffc000046fc78 pc of caller: 0xfffffc000047015c lock address: 0xfffffc0000658f88 lock address: 0xfffffc0000658f88 current lock state: 0x000000000047015d (cpu=0,pc=0xfffffc000047015c,busy) current lock state: 0x000000000047015d (cpu=0,pc=0xfffffc000047015c,busy) panic (cpu 9): simple_lock: time limit exceeded * This shows the PCs of two threads contending for the lock. * Use these PCs to display the routines that are accessing the lock ... (dbx) 0xfffffc000046fc78/i [lock_write:417, 0xfffffc000046fc78] bsr r26, simple_lock(line 262) (dbx) 0xfffffc000047015c/i [lock_done:516, 0xfffffc000047015c] bsr r26, simple_lock(line 262) * In this case, the routines are lock_write and lock_done * pmsgbuf also gives the address of the lock and some state information. * Here is the structure of the lock ... (dbx) p * (struct lock *) 0xfffffc0000658f88 struct { l_lock = struct { sl_data = 0x47015d sl_info = 0x0 sl_cpuid = 0x0 sl_lifms = 0x0 } l_caller = 0x2442a0 l_wait_writers = 0x0 l_readers = 0x0 l_flags = ' ' l_lifms = '\200' l_info = 0x0 l_lastlocker = 0xfffffc00397bcb80 } * The fields in the lock structure are defined in /usr/sys/include/kern/lock.h o l_lastlocker represents the thread that last acquired the lock. o l_caller is the lower half of the calling PC of the locker. In other words, l_caller = 0x2442a0 represents the PC 0xfffffc00002442a0, so we should be able to see the code thread leading up to this point ... * Set thread to l_lastlocker, and display the code thread ... (dbx) tset 0xfffffc00397bcb80 (dbx) 0xfffffc00002442a0?10i [waitf:1468, 0xfffffc000024427c] bsr r26, simple_unlock(line 263) [waitf:1469, 0xfffffc0000244280] ldq r16, 96(r10) [waitf:1469, 0xfffffc0000244284] bsr r26, crfree(line 1166) [waitf:1470, 0xfffffc0000244288] lda r16, 416(r10) [waitf:1470, 0xfffffc000024428c] bsr r26, uarea_lock_terminate(line ) [waitf:1476, 0xfffffc0000244290] bis r9, r9, r16 [waitf:1476, 0xfffffc0000244294] bsr r26, pid_free(line 500) [waitf:1478, 0xfffffc0000244298] ldah r16, 3(gp) [waitf:1478, 0xfffffc000024429c] lda r16, -4840(r16) [waitf:1478, 0xfffffc00002442a0] bsr r26, ulock_write(line 273) * This thread is running in the process which ran this command ... (dbx) p (*(struct thread *) $tid).stack.uthread.utask.uu_comm "init" * The process that last acquired the lock is not necessarily the process that crashed the system. cpu_panic_thread is different from l_lastlocker ... (dbx) tset machine_slot[paniccpu].cpu_panic_thread thread 0xfffffc00099deb80 stopped at [stop_secondary_cpu:364 ,0xfffffc00004deb2c] Source not available (dbx) t > 0 stop_secondary_cpu() ["../../../../src/kernel/arch/alpha/cpu.c":363, 0xfffffc00004deb28] 1 panic(s = 0xfffffc00005f42e8 = "event_timeout: panic request") ["../../../../src/kernel/bsd/subr_prf.c":669, 0xfffffc000044228c] 2 event_timeout(func = 0xfffffc00004424e0, arg = 0xfffffc0000675ce0, timeout = 0xffffffffff7fc000) ["../../../../src/kernel/arch/alpha/cpu.c":719, 0xff 3 xcpu_puts(s = 0xffffffffa907f330, prfbufp = 0xfffffc0000675ce0) ["../../../../src/kernel/bsd/subr_prf.c":810, 0xfffffc0000442544] 4 printf(va_alist = 0xfffffc00005e4370) ["../../../../src/kernel/bsd/subr_prf.c":355, 0xfffffc0000441894] 5 panic(s = 0xfffffc00005e7088 = "simple_lock: time limit exceeded") ["../../../../src/kernel/bsd/subr_prf.c":719, 0xfffffc00004423fc] 6 simple_lock_fault(slp = 0xfffffc0000658f88, state = 0xfffffc00099df210, caller = 0xfffffc000046fc78, arg = (nil), fmt = (nil), error = 0xfffffc00005e 7 simple_lock_time_violation(slp = 0xfffffc000046fc78, state = 0x0, caller = (nil)) ["../../../../src/kernel/kern/lock.c":1863, 0xfffffc0000473138] 8 lock_write(l = 0xfffffc000020d2f8) ["../../../../src/kernel/kern/lock.c":417, 0xfffffc000046fc78] 9 waitf(0xfffffc00099de210, 0xfffffc000a5ef6c0, 0xffffffffa907f8b8, 0xffffffffa907f8c8, 0xfffffc00004f2e28) ["../../../../src/kernel/bsd/kern_exit.c":1 10 wait4(0xffffffffa907f8b8, 0xffffffffa907f8c8, 0xfffffc00004f2e28, 0x8000000001, 0xfffffc00004f2898) ["../../../../src/kernel/bsd/kern_exit.c":1220, 0 11 syscall(0x10430, 0x1, 0x1, 0x0, 0x7) ["../../../../src/kernel/arch/alpha/syscall_trap.c":519, 0xfffffc00004f2894] 12 _Xsyscall(0x8, 0x12002c138, 0x14000ae50, 0xffffffffffffffff, 0x11fffa7c8) ["../../../../src/kernel/arch/alpha/locore.s":1094, 0xfffffc00004e1fe4] (dbx) p (*(struct thread *) $tid).stack.uthread.utask.uu_logname "" (dbx) p (*(struct thread *) $tid).stack.uthread.utask.uu_comm "rcmgr" (dbx) pd (*(*(struct thread *) $tid).stack.uthread.proc).p_pid 218 (dbx) p (*(*(struct thread *) $tid).stack.uthread.proc).p_stat 0x3 (dbx) p (*(struct thread *) $tid).state 0x4 For whatever reason, one process "init" hung on to the lock for too long, and the other process "rcmgr" timed out. semaphores p seminfo struct { semmni = 0x10 semmsl = 0x19 semopm = 0xa semume = 0xa semvmx = 0x7fff semaem = 0x4000 sema = 0x1 } /usr/sys/include/sys/sem.h shared memory p shminfo struct { shmmax = 0x400000 shmmin = 0x1 shmmni = 0x80 shmseg = 0x20 } /usr/sys/include/sys/shm.h message parameters p msginfo struct { msgmax = 8192 msgmnb = 32768 msgmni = 64 msgtql = 40 msg = 0 } /usr/sys/include/sys/msg.h mount table BUG ALERT! the following example fails with DBX Fault: Segmentation fault under V3.2c and T4.0 (rev 345). see QAR 46123 or digital_unix note 5356. p rootfs.m_stat.f_mntonname ; p rootfs.m_stat.f_mntfromname "/" "/dev/rz0a" set $NXT=rootfs.m_next ; p $NXT.m_stat.f_mntonname ; p $NXT.m_stat.f_mntfromname "/mnt" "hntsmn:/archive1" Repeat the following command until you loop around to / again set $NXT=$NXT.m_next ; p $NXT.m_stat.f_mntonname ; p $NXT.m_stat.f_mntfromname "/dumps/mnt1" "/dumps1@hntsmn" CAM peripheral devices * /usr/sys/include/io/cam/pdrv.h * /usr/sys/include/io/common/devio.h p pdrv_unit_table unit table p *pdrv_unit_table[N].pu_device p *pdrv_unit_table[N].pu_device.pd_dev_desc device description p pdrv_unit_table[N].pu_device.pd_dev_desc.dd_pv_name Product ID and vendor string p pdrv_unit_table[N].pu_device.pd_dev_desc.dd_dev_name Device name - see devio.h i/o configuration /sys/BINARY/ioconf.c (dbx) whatis bus_list bus_list[3] of struct bus { u_long * bus_mbox; : (dbx) px bus_list[0] struct { bus_mbox = (nil) nxt_bus = (nil) ctlr_list = 0xfffffc00005ec680 : (dbx) px *(struct controller *) bus_list[0].ctlr_list struct { ctlr_mbox = (nil) nxt_ctlr = 0xfffffc00005ec7e8 : (dbx) whatis device_list device_list[6] of struct device { struct device * nxt_dev; struct controller * ctlr_hd; char * dev_type; char * dev_name; int logunit; int unit; char * ctlr_name; int ctlr_num; int alive; private[8] of void *; conn_priv[8] of void *; rsvd[8] of void *; } ; (dbx) px device_list[0] struct { nxt_dev = 0xfffffc00005ed008 ctlr_hd = 0xfffffc00005ec3b0 dev_type = 0xfffffc00005ed4e0 = "disk" dev_name = 0xfffffc00005f6da8 = "rz" logunit = 0x0 unit = 0x0 ctlr_name = 0xfffffc00005ecef0 = "scsi" ctlr_num = 0x0 alive = 0x1 : flck file lock information p flckinfo struct { recmax = 0xe filmax = 0xe reccnt = 0x3 filcnt = 0x3 rectot = 0x6b03 filtot = 0x6abd flckinfo_lock = struct { sl_data = 0x0 sl_info = 0x0 sl_cpuid = 0x0 sl_lifms = 0x0 } } /usr/sys/include/sys/flock.h files, inodes, vnodes /usr/sys/include/sys/vnode.h /usr/sys/include/ufs/inode.h /usr/sys/include/sys/file.h /usr/sys/include/sys/mount.h /usr/sys/include/ufs/fs.h using kdbx to identify files * list all open files for all processes (kdbx) ofile Proc=0xfffffc000107f210 pid= 124 ofile[ 0]=0xfffffc0001279300 ofile[ 1]=0xfffffc0001279300 : : * display F_data for each file Addr (kdbx) file Addr Type Ref Msg Fileops F_data Cred Offset Flags =========== ==== === === ======= =========== =========== ====== ===== [Process ID: 124] k0x01279300 file 3 0 vnops k0x0173ce00 k0x04fbe100 0 r w : : * display the file structure for a given file Addr * the f_data field should match F_data (kdbx) p *(struct file *) 0xfffffc0001279300 struct { : : f_data = 0xfffffc000173ce00 = "" : : * use f_data to display the vnode structure (kdbx) p *(struct vnode *) 0xfffffc000173ce00 struct { v_lock = struct { : : v_data = " " } * The inode address is calculated by adding 0xb0 to the vnode address. 0xb0 is the size of the vnode structure. vnode.v_data and vnode.v_lock are the last and first fields respectively in a vnode. This may be version-dependant, so always check! (kdbx> px &vnode.v_data - &vnode.v_lock 0xb0 * display the inode structure (kdbx) p (unsigned int *) (0xfffffc000173ce00 + 0xb0) 0xfffffc000173ceb0 (kdbx) p *(struct inode *) 0xfffffc000173ceb0 struct { i_chain = { [0] 0xfffffc0000223e20 [1] 0xfffffc0001fe34b0 } i_vnode = 0xfffffc000173ce00 i_devvp = 0xfffffc0004ddb600 i_flag = 0x40000 i_dev = 0x800000 i_number = 0x607 i_fs = 0xffffffff844cc000 : : * from the i_dev and i_number fields, we can identify the device (kdbx) filename 0x800000 0x607 dev_t: 0x00800000 (8388608) Major/Minor: 8, 0 Device name: /dev/rz0a using dbx to identify files This is the same procedure as for the kdbx example, just a different format ... (dbx) p (*(struct thread *) $tid).stack.uthread.utask.uu_file_state.uf_ofile { [0] 0xfffffc000168e280 [1] 0xfffffc000168e280 : [N] 0xFILEADDRESS (dbx) p *(struct file *) 0xFILEADDRESS : f_data = 0xfffffc0003d9dc00 = "" 0xF_DATA : (dbx) p *(struct vnode *) 0xF_DATA : v_mount = 0xfffffc0005ffb000 0xV_MOUNT (dbx) p *(struct mount *) 0xV_MOUNT : m_stat = struct { : f_mntonname = "/" f_mntfromname = "/dev/rz0a" : (dbx) px &vnode.v_data - &vnode.v_lock 0xVNODE_SIZE 0xb0 (dbx) p (unsigned int *) (0xb0 + 0xfffffc0003d9dc00) 0xfffffc0003d9dcb0 0xINODE = 0xVNODE_SIZE + 0xF_DATA (dbx) p *(struct inode *) 0xfffffc0003d9dcb0 : i_vnode = 0xfffffc0003d9dc00 should point back to F_DATA i_devvp = 0xfffffc0005ffb200 vnode for block i/o i_flag = 0x40020 i_dev = 0x800000 i_number = 0x606 i_fs = 0xffffffff81155000 0xI_FS : i_din = struct { : di_atime = 0x3245eb76 access time di_mtime = 0x2fb2b309 modification time di_ctime = 0x310f500f creation time (dbx) p *(struct fs *) 0xI_FS : fs_fsmnt = "/" unexplored stuff ... /usr/sys/include/sys/config.h int max_vnodes; /* max vnodes in the system */ int min_free_vnodes; /* low water mark for free vnodes */ (dbx) p sthinfo struct { st_rdinit = 0xfffffc000057c990 st_wrinit = 0xfffffc000057c9c8 st_muxrinit = (nil) st_muxwinit = (nil) } /usr/sys/include/sys/vnode.h extern int nvnode; /* number of slots in the table */ /sys/conf/param.c int nvnode = NVNODE; /*for historic reasons; not a limit on dyn. vnodes*/ int min_free_vnodes = NVNODE; /* low water mark for free vnodes */ int min_free_vnodes = MIN_FREE_VNODES; /* defined in param.h */ int max_vnodes = MAX_VNODES; /* max vnodes is now a percentage of memory */ (dbx) pd free_vnodes, total_vnodes (dbx) pd vnode_stats struct { vn_allocations = 614680 : (dbx) pd stats_ioretry struct { st_already_being_flushed = 4 : p *cpusw p cons_sw p *master_processor back to the contents ---------------------------------------------------------------------------- kdbx trace list every thread on the system (long output!) thread Thread Addr Task Addr Proc Addr Event pcb state | | | .--------------------------' : v proc Addr PID PPID PGRP UID NICE SIGCATCH P_SIG Event Flags : v pcb Thread Addr Addr pcb ksp usp pc ps sp ptbr pcb_physaddr r9 r10 r11 r12 r13 r14 r15 task Task Addr Ref Threads Map Swap_state Utask Addr Proc Addr Pid list_action "struct thread *" thread_list.next 0 active_threads[paniccpu] p %i,%c,%c.state array_action "struct thread *" 10 active_threads[paniccpu] p %i,%c.stack.uthread.utask.uu_comm back to the contents ---------------------------------------------------------------------------- forced crash An incomplete and possibly inaccurate guide to forcing a crash. Refer to the hardware documentation, STARS, etc for authoritative information. * either ... processor what to do 2100 press HALT button (not RESET) >>> crash 3000 >>> crash 4000 >>> crash 7000 ^P note the PC for later analysis >>> crash * or ... o from the shell prompt, nm -x /vmunix | grep '^start' The system should display something like this ... start | 0xfffffc0000238950 | T | 0x00000000000008 o make a note of the ADDRESS (in this example 0xfffffc0000238950) o press the RESET button o >>> set radix 16 o 3000 >>> start ADDRESS+4 4000 and 7000 >>> deposit pc ADDRESS+4 >>> continue * or ... o get the ADDRESS of doadump + either nm -x /vmunix | grep '^doadump' + note ADDRESS + or echo 'px doadump;quit' | dbx -k /vmunix + note ADDRESS + these two methods may return slightly different addresses. Either one should work. o press the RESET button o >>> set radix 16 o 3000 >>> start ADDRESS 4000 and 7000 >>> deposit pc ADDRESS >>> continue >>> boot back to the contents ---------------------------------------------------------------------------- examining sources cscope(1) is a useful tool for navigating the operating system sources, but first you have to create a reference file. This can take around half an hour, and consumes about 80 megabytes of disk space, so you don't want to do this unless it's necessary. /usr/local/tools/cscope.unix.kernel contains a script called MKreffile which creates reference files. But first check in /usr/local/tools/cscope.unix.kernel to see if a reference file already exists. For the record, here's how it's done ... * create a list of .c and .h files ... find /usr/sde/osf1/build/v40supportos/src \( -name '*.c' -o -name '*.h' \) -prin t > v40supportos.list * create the reference file ... /usr/opt/svr4/usr/bin/cscope -b -i v40supportos.list * this creates cscope.out which can be renamed to something more meaningful like v40supportos.cscope.out * when you invoke cscope ... /usr/opt/svr4/usr/bin/cscope -f v40supportos.cscope.out -i v40supportos.list it may take a few minutes to load everything into cscope. back to the contents ---------------------------------------------------------------------------- other tools * cda Pointers to location of cda: A Crash Dump Analyzer for Digital Unix, developed mainly to analyse virtual memory problems. Look in /usr/local/tools/cda for local copies. * machine check analysis - /usr/local/tools/MCHK_OSF.V32 ---------------------------------------------------------------------------- miscellaneous hints crude search of all .h files for definition of VARIABLE csh> foreach i ( `find /src/OSF/OSC400/src -name '*.h' -print` ) ? grep VARIABLE $i && echo $i ? end Other good places to search ... /usr/sys /usr/include back to the contents ---------------------------------------------------------------------------- when all else fails ... If you can't identify the problem, issue this command ... (dbx) why did this system crash? why did this system crash? ^ syntax error ... and tell the customer that the problem was a "syntax error" ;-) ---------------------------------------------------------------------------- This page is maintained by Leon Strauss Comments welcome