nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Liu <andyliu1...@gmail.com>
Subject Re: PDF Parsing Revisited
Date Tue, 29 Mar 2005 20:52:45 GMT
We've been using pdftotext / parse-ext also.  It works well.  

We also ended using pdftotext's -htmlmeta option so we could parse out
the PDF's title from the resulting HTML.  In some cases, where the
title cannot be parsed out of the PDF file, we use anchor text as the
page's title instead.

On Tue, 29 Mar 2005 12:33:40 -0800, Doug Cutting <cutting@nutch.org> wrote:
> I'm currently using xpdf's pdftotext program to parse pdf, via the
> parse-ext plugin.  It seems much faster than PDFBox.
> 
> To try it, copy the attached plugin.xml file to
> 
>    build/plugins/parse-ext/plugin.xml
> 
> then copy the attached parse-pdf.sh script to
> 
>    bin/parse-pdf.sh
> 
> and make it executable
> 
>    chmod +x bin/parse-pdf.sh
> 
> finally, include the parse-ext plugin in your nutch-site.xml.
> 
> What do you think?
> 
> Doug
> 
> 
>

Mime
View raw message