Copyright (c) 2001-2003 Leon Bottou, Yann Le Cun, Patrick Haffner, Copyright (c) 2001 AT&T Corp., and Lizardtech, Inc. This is free documentation; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. The GNU General Public License's references to "object code" and "executables" are to be interpreted as the output of any document formatti...
NAMEdjvutxt - Extract the hidden text from DjVu documents.
SYNOPSISdjvutxt [options] inputdjvufile [outputtxtfile]
DESCRIPTIONProgram djvutxt decodes the hidden text layer of a DjVu document inputdjvufile and prints it into file outputtxtfile or on the standard output. The hidden text layer is usually generated with the help of an optical character recognition software.
Without options -detail and -escape, this program simply outputs the UTF-8 text. Option -detail cause the output of S-expressions describing the text and its location. Option -escape uses C-style escape sequences to represent nonprintable non-ASCII characters.
- Specify which pages should be processed. When this option is not specified, the text of all pages of the documents is concatenated into the output file. The page specification pagespec contains one or more comma-separated page ranges. A page range is either a page number, or two page numbers separated by a dash. For instance, specification 1-10 outputs pages 1 to 10, and specification 1,3,99999-4 outputs pages 1 and 3, followed by all the document pages in reverse order up to page 4.
- This options causes djvutxt to output S-expressions specifying the position of the text in the page. See the manual page djvused(1) for a description of the output format. Argument keyword specifies the maximum level of detail for which text location is reported. The recognized values are: page, column, region, para, line, word, and char. All other values are interpreted as char.
Output escape sequences of the form
for all non ASCII or non printable UTF-8
characters and for the backslash character.