nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "kranthi reddy" <>
Subject Re: Crawler Data
Date Tue, 27 May 2008 19:07:02 GMT
If u are trying to extract data from web pages...then u need to work on the
"parse-html" code.
In the "src/plugin" directory u can work on other formats like "pdf,msexcel"
U need to work on parsing because pages are parsed before they are
indexed.So if u parse them in a different way and extract the data u need
...u can index them using the extracted data.

On Tue, May 27, 2008 at 5:16 PM, Jorge Conejero Jarque <>

> I would like to make an application using the API Nutch, could extract data
> from the pages before being indexed in the index and be able to do them in
> some kind of modification or processing, because it can become something
> useful and interesting.
> The problem is that I can not find information on how to use the API part
> of the Crawl Nutch.
> Only find exercises that are executed by the console with Cygwin and that
> only explain aspects of configuration and creation of an index.
> If you could help me, with some examples.
> Thanks.
> Un saludo.
> Jorge Conejero Jarque
> Dpto. Java Technology Group
> GPM Factoría Internet
> 923100300

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message