glenda.party

term% ls -F

index.txt

term% pwd

$home/manuals/unix_v7/8/crash

term% cat index.txt

CRASH(8)                    System Manager's Manual                   CRASH(8)

NAME
       crash - what to do when the system crashes

DESCRIPTION
       This  section  gives  at  least a few clues about how to proceed if the
       system crashes.  It can't pretend to be complete.

       Bringing it back up.  If the reason for the crash is not  evident  (see
       below for guidance on ‘evident') you may want to try to dump the system
       if you feel up to debugging.  At the moment a dump can be taken only on
       magtape.  With a tape mounted and ready, stop the machine, load address
       44,  and  start.   This  should write a copy of all of core on the tape
       with an EOF mark.  Caution: Any error is taken to mean the end of  core
       has been reached.  This means that you must be sure the ring is in, the
       tape  is  ready, and the tape is clean and new.  If the dump fails, you
       can try again, but some of the registers will be lost.  See  below  for
       what to do with the tape.

       In  restarting  after  a crash, always bring up the system single-user.
       This is accomplished by following the directions in boot(8) as modified
       for your particular installation; a single-user system is indicated  by
       having a particular value in the switches (173030 unless you've changed
       init)  as  the  system starts executing.  When it is running, perform a
       dcheck and icheck(1) on all file systems which could have been  in  use
       at  the  time  of  the  crash.  If any serious file system problems are
       found, they should be repaired.  When you are satisfied with the health
       of your disks, check and set the date if necessary, then come up multi-
       user.  This is most easily accomplished  by  changing  the  single-user
       value  in the switches to something else, then logging out by typing an
       EOT.

       To even boot UNIX0 at all, three files (and the directories leading  to
       them) must be intact.  First, the initialization program /etc/init must
       be  present  and  executable.   If it is not, the CPU will loop in user
       mode at location 6.  For init to work correctly, /dev/tty8 and  /bin/sh
       must  be  present.   If  either does not exist, the symptom is best de‐
       scribed as thrashing.  Init will go into a  fork/exec  loop  trying  to
       create a Shell with proper standard input and output.

       If  you  cannot  get  the system to boot, a runnable system must be ob‐
       tained from a backup medium.  The root file system may then be doctored
       as a mounted file system as described below.  If there are any problems
       with the root file system, it is probably prudent to  go  to  a  backup
       system to avoid working on a mounted file system.

       Repairing disks.  The first rule to keep in mind is that an addled disk
       should be treated gently; it shouldn't be mounted unless necessary, and
       if  it  is  very  valuable yet in quite bad shape, perhaps it should be
       dumped before trying surgery on it.  This is an area  where  experience
       and informed courage count for much.

       The  problems  reported by icheck typically fall into two kinds.  There
       can be problems with the free list: duplicates in  the  free  list,  or
       free  blocks  also  in files.  These can be cured easily with an icheck
       -s.  If the same block appears in more than one file or if a file  con‐
       tains bad blocks, the files should be deleted, and the free list recon‐
       structed.   The  best way to delete such a file is to use clri(1), then
       remove its directory entries.  If any of the affected files  is  really
       precious, you can try to copy it to another device first.

       Dcheck  may  report files which have more directory entries than links.
       Such situations are potentially dangerous;  clri  discusses  a  special
       case  of the problem.  All the directory entries for the file should be
       removed.  If on the other hand there are more links than directory  en‐
       tries,  there is no danger of spreading infection, but merely some disk
       space that is lost for use.  It is sufficient to copy the file  (if  it
       has  any  entries  and is useful) then use clri on its inode and remove
       any directory entries that do exist.

       Finally, there may be inodes reported by dcheck that have 0 links and 0
       entries.  These occur on the root device when  the  system  is  stopped
       with  pipes  open, and on other file systems when the system stops with
       files that have been deleted while still open.  A clri  will  free  the
       inode, and an icheck -s will recover any missing blocks.

       Why  did it crash?  UNIX types a message on the console typewriter when
       it voluntarily crashes.  Here is the current  list  of  such  messages,
       with  enough information to provide a hope at least of the remedy.  The
       message has the form ‘panic: ...', possibly accompanied by other infor‐
       mation.  Left unstated in all cases is the possibility that hardware or
       software error produced the message in some unexpected way.

       blkdev
            The getblk routine was called with a nonexistent major  device  as
            argument.  Definitely hardware or software error.

       devtab
            Null  device  table entry for the major device used as argument to
            getblk.  Definitely hardware or software error.

       iinit
            An I/O error reading the super-block for the root file system dur‐
            ing initialization.

       out of inodes
            A mounted file system has no more i-nodes when  creating  a  file.
            Sorry, the device isn't available; the icheck should tell you.

       no fs
            A  device  has  disappeared  from the mounted-device table.  Defi‐
            nitely hardware or software error.

       no imt
            Like ‘no fs', but produced elsewhere.

       no inodes
            The in-core  inode  table  is  full.   Try  increasing  NINODE  in
            param.h.  Shouldn't be a panic, just a user error.

       no clock
            During initialization, neither the line nor programmable clock was
            found to exist.

       swap error
            An  unrecoverable  I/O error during a swap.  Really shouldn't be a
            panic, but it is hard to fix.

       unlink - iget
            The directory containing a file  being  deleted  can't  be  found.
            Hardware or software.

       out of swap space
            A  program  needs  to  be  swapped  out, and there is no more swap
            space.  It has to be increased.  This really shouldn't be a panic,
            but there is no easy fix.

       out of text
            A pure procedure program is being executed, and the table for such
            things is full.  This shouldn't be a panic.

       trap
            An unexpected trap has occurred within the system.  This is accom‐
            panied by three numbers: a ‘ka6', which is  the  contents  of  the
            segmentation  register for the area in which the system's stack is
            kept; ‘aps', which is the location where the hardware  stored  the
            program  status  word during the trap; and a ‘trap type' which en‐
            codes which trap occurred.  The trap types are:

       0         bus error
       1         illegal instruction
       2         BPT/trace
       3         IOT
       4         power fail
       5         EMT
       6         recursive system call (TRAP instruction)
       7         11/70 cache parity, or programmed interrupt
       10        floating point trap
       11        segmentation violation

       In some of these cases it is possible for octal 20 to be added into the
       trap type; this indicates that the processor was in user mode when  the
       trap occurred.  If you wish to examine the stack after such a trap, ei‐
       ther  dump the system, or use the console switches to examine core; the
       required address mapping is described below.

       Interpreting dumps.  All file system problems should be taken  care  of
       before  attempting  to look at dumps.  The dump should be read into the
       file /usr/sys/core; cp(1) will do.  At this point, you  should  execute
       ps  -alxk  and who to print the process table and the users who were on
       at the time of the crash.  You should dump ( od(1)) the first 30  bytes
       of  /usr/sys/core.   Starting  at location 4, the registers R0, R1, R2,
       R3, R4, R5, SP and KDSA6 (KISA6 for 11/40s) are stored.   If  the  dump
       had  to  be restarted, R0 will not be correct.  Next, take the value of
       KA6 (location 022(8) in  the  dump)  multiplied  by  0100(8)  and  dump
       01000(8) bytes starting from there.  This is the per-process data asso‐
       ciated  with the process running at the time of the crash.  Relabel the
       addresses 140000 to 141776.   R5  is  C's  frame  or  display  pointer.
       Stored  at (R5) is the old R5 pointing to the previous stack frame.  At
       (R5)+2 is the saved PC of the calling procedure.   Trace  this  calling
       chain until you obtain an R5 value of 141756, which is where the user's
       R5 is stored.  If the chain is broken, you have to look for a plausible
       R5,  PC  pair  and continue from there.  Each PC should be looked up in
       the system's name list using adb(1) and its ‘:' command, to get  a  re‐
       verse calling order.  In most cases this procedure will give an idea of
       what  is  wrong.  A more complete discussion of system debugging is im‐
       possible here.

SEE ALSO
       clri(1), icheck(1), dcheck(1), boot(8)

                                                                      CRASH(8)