index.txt
DOC2TXT(1) General Commands Manual DOC2TXT(1) NAME doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables - extract printable text from Microsoft documents SYNOPSIS doc2txt [ file.doc ] doc2ps [ file.doc ] wdoc2txt [ file.doc ] xls2txt [ file.xls ] aux/olefs [ -m mtpt ] file.doc aux/mswordstrings mtpt/WordDocument aux/msexceltables [ -qaDnt ] [ -d delim ] [ -c column-range ] [ -w worksheet-range ] mtpt/Workbook DESCRIPTION Doc2txt is an rc(1) script that uses olefs and mswordstrings to extract the printable text from the body of a Microsoft Word document and write it on the standard output. Doc2ps is similar, but emits PostScript corresponding to the document. Wdoc2txt is similar to doc2txt, but uses plumb(1) to send the output to a new acme(1) window instead. Xls2txt performs a similar function for Microsoft Excel documents. Microsoft Office documents are stored in OLE (Object Linking and Embed‐ ding) format, which is a scaled down version of Microsoft's FAT file system. Olefs presents the contents of an MS Office document as a file system on mtpt, which defaults to /mnt/doc. Mswordstrings or msex‐ celtables may then be used to parse the files inside, extracting a text stream. Msexceltables may be given options to control the formatting of its output. -a Attempt conversion of non-tabular sheets in the workbook (charts). -d delim Sets the inter-field delimiter to the string delim, by default a single space. -D Enables debugging output. -c range Range is a comma-separated list of column numbers and ranges. Ranges are separated by dashes. Limit processing to just those columns named; by default all columns are output. -n Disables field padding to column width. -q Disable quoting of textural fields (see quote(2).) -t Truncate fields to the column width. -w range Range is a comma-separated list of worksheet numbers and ranges, this limits the sheets output using the same syntax as the -c option above. Suppressed chart pages are always included in the sheet count. EXAMPLE Extract pieces of an MS Excel spreadsheet. aux/olefs report.xls msexceltables -q -w 1,7,9-14 -c 3-5 -n -d '@' /mnt/doc/Workbook > rpt.txt unmount /mnt/doc SOURCE /rc/bin doc2txt, doc2ps, wdoc2txt, and xls2txt /sys/src/cmd/aux the others SEE ALSO strings(1) ``Microsoft Word 97 Binary File Format'', at Microsoft's developer (MSDN) home page. ``LAOLA Binary Structures'', http://user.cs.tu-berlin.de/~schwartz/pmh ``OpenOffice.Org's Excel Documentation'', http://sc.openoffice.org/excelfileformat.pdf DOC2TXT(1)