nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "kranthi reddy" <kranthili2...@gmail.com>
Subject Re: Crawler Data
Date Tue, 27 May 2008 19:07:02 GMT
Hi,
If u are trying to extract data from web pages...then u need to work on the
"parse-html" code.
In the "src/plugin" directory u can work on other formats like "pdf,msexcel"
etc...
U need to work on parsing because pages are parsed before they are
indexed.So if u parse them in a different way and extract the data u need
...u can index them using the extracted data.
bye
kranthi

On Tue, May 27, 2008 at 5:16 PM, Jorge Conejero Jarque <jconejero@gpm.es>
wrote:

> I would like to make an application using the API Nutch, could extract data
> from the pages before being indexed in the index and be able to do them in
> some kind of modification or processing, because it can become something
> useful and interesting.
>
> The problem is that I can not find information on how to use the API part
> of the Crawl Nutch.
>
> Only find exercises that are executed by the console with Cygwin and that
> only explain aspects of configuration and creation of an index.
>
> If you could help me, with some examples.
> Thanks.
>
> Un saludo.
>
> Jorge Conejero Jarque
> Dpto. Java Technology Group
> GPM Factoría Internet
> 923100300
> http://www.gpm.es
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message