lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jose Galiana" <>
Subject RE: PDF Text Stripper
Date Wed, 10 Jul 2002 07:59:32 GMT

I?ve used JPedal ( ). I?s distibuited under LGPL license and
extract raw text, among other uses.

I wrote code to extract text using Etymon PJ library, with PDF?s withs
propietary fonts, I needed to create a cross tabla to translate Unicode to
ASCII because Distiller inserts only a subset of Unicode tabla for each
propietary font.

JPedal has not problem with thats fonts and extract all text in XML,
suitalble for use with Lucene.

-----Mensaje original-----
De: Ben Litchfield []
Enviado el: martes, 09 de julio de 2002 16:48
Asunto: PDF Text Stripper


I have written a PDF library that can be used to strip text from PDF
documents.  It is released under LGPL so have fun.

There is one class which can be used to easily index PDF documents.
pdfparser.searchengine.lucene.LucenePDFDocument  has a getDocument
method which will take a PDF file and return a Lucene Document which you
can add to an index.

If you would like to see the quality of the text extraction you can run
pdfparser.Main from the command line which will take a PDF document and
write a txt file.

I am looking for any input that you might have.  Please mail me if you
have any bugs or feature requests.

The library can be retrieved from

-Ben Litchfield

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message