nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Selective/Configurable HTML Parsing?
Date Tue, 16 Oct 2007 19:35:33 GMT
Sagar Vibhute wrote:
> Hi,
> I need some help with understanding how the HTML parser works in nutch. I
> have to write a plugin which while crawling text will help me identify
> certain words/phrases that will be pre-specified.
> eg: I might want to index pages with a specific in case they have the name Jimi
> Hendrix occuring on them.
> In such a case, how do I write an extension that allows me to check for the
> occurence of a certain word on the page? Meaning, where do I start? I have
> read the html parser code in the nutch source files, to an extent I could
> understand it. Is there a text-library/dictionary that nutch uses while it
> parses the page content? I read the documentation on neko parser, but am
> still not able to understand it completely.

You should take a look at HtmlParseFilter interface - this is something 
that you need to implement as a plugin. The plugin will receive the 
parsed HTML document, and you can traverse the document DOM tree or 
analyze the extracted plain text of the document.

See also the documentation on the Wiki about how to write Nutch plugins.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message