glenda.party
term% ls -F
term% cat index.txt
DOC2TXT(1)                  General Commands Manual                 DOC2TXT(1)

NAME
       doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables
       - extract printable text from Microsoft documents

SYNOPSIS
       doc2txt [ file.doc ]
       doc2ps [ file.doc ]
       wdoc2txt [ file.doc ]
       xls2txt [ file.xls ]
       aux/olefs [ -m mtpt ] file.doc
       aux/mswordstrings mtpt/WordDocument
       aux/msexceltables  [  -qaDnt  ]  [  -d delim ] [ -c column-range ] [ -w
       worksheet-range ] mtpt/Workbook

DESCRIPTION
       Doc2txt is an rc(1) script that uses olefs and mswordstrings to extract
       the printable text from the body of a Microsoft Word document and write
       it on the standard output.  Doc2ps is  similar,  but  emits  PostScript
       corresponding  to  the  document.   Wdoc2txt is similar to doc2txt, but
       uses plumb(1) to send the output  to  a  new  acme(1)  window  instead.
       Xls2txt performs a similar function for Microsoft Excel documents.

       Microsoft Office documents are stored in OLE (Object Linking and Embed‐
       ding)  format,  which  is a scaled down version of Microsoft's FAT file
       system.  Olefs presents the contents of an MS Office document as a file
       system on mtpt, which defaults to  /mnt/doc.   Mswordstrings  or  msex‐
       celtables may then be used to parse the files inside, extracting a text
       stream.   Msexceltables  may be given options to control the formatting
       of its output.

       -a     Attempt  conversion  of  non-tabular  sheets  in  the   workbook
              (charts).

       -d delim
              Sets the inter-field delimiter to the string delim, by default a
              single space.

       -D     Enables debugging output.

       -c range
              Range  is  a  comma-separated list of column numbers and ranges.
              Ranges are separated by dashes.  Limit processing to just  those
              columns named; by default all columns are output.

       -n     Disables field padding to column width.

       -q     Disable quoting of textural fields (see quote(2).)

       -t     Truncate fields to the column width.

       -w range
              Range is a comma-separated list of worksheet numbers and ranges,
              this  limits  the  sheets output using the same syntax as the -c
              option above.  Suppressed chart pages are always included in the
              sheet count.

EXAMPLE
       Extract pieces of an MS Excel spreadsheet.
              aux/olefs report.xls
              msexceltables -q -w 1,7,9-14 -c 3-5 -n -d '@' /mnt/doc/Workbook > rpt.txt
              unmount /mnt/doc

SOURCE
       /rc/bin
              doc2txt, doc2ps, wdoc2txt, and xls2txt

       /sys/src/cmd/aux
              the others

SEE ALSO
       strings(1)
       ‘‘Microsoft Word 97 Binary  File  Format'',  at  Microsoft's  developer
       (MSDN) home page.
       ‘‘LAOLA Binary Structures'', http://user.cs.tu-berlin.de/~schwartz/pmh
       ‘‘OpenOffice.Org's Excel Documentation'',
       http://sc.openoffice.org/excelfileformat.pdf

                                                                    DOC2TXT(1)