lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From xx28 <>
Subject RE: [ANN] PDFBox 0.6.0
Date Thu, 06 Mar 2003 14:30:47 GMT

I downloaded pdfbox and installed it. And I can use:
 java org.pdfbox.Main <PDF-file> <output-text-file>
to convert .pdf file to string file.

Then I tried to integrate with Lucene. I modified the following codes in

else if(file.getPath().endsWith(".pdf")) {
        Document doc =  LucenePDFDocument.getDocument(file);
        System.out.println("adding " + "pdf files");

It did pass ant compiler (ant wardemo). However, when I tested:
java org.apache.lucene.demo.IndexHTML -create -index {index-dir} ..

It seems to me it still didnot pick up new, still did not index 
.pdf files.

Did I miss something here?



>===== Original Message From Lucene Users List 
<> =====
>I would like to announce the next release of PDFBox.  PDFBox allows for
>PDF documents to be indexed using lucene through a simple interface.
>Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
>which will extract all text and PDF document summary properties as lucene
>You can obtain the latest release from
>Please send all bug reports to me and attach the PDF document when
>RELEASE 0.6.0
>-Massive improvements to memory footprint.
>-Must call close() on the COSDocument(LucenePDFDocument does this for you)
>-Really fixed the bug where small documents were not being indexed.
>-Fixed bug where no whitespace existed between obj and start of object.
>    Exception in thread "main" expected='obj'
>    actual='obj<</Pro
>-Fixed issue with spacing where textLineMatrix was not being copied
> properly
>-Fixed 'bug' where parsing would fail with some pdfs with double endobj
> definitions
>-Added PDF document summary fields to the lucene document
>Thank you,
>Ben Litchfield
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message