nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sagar Vibhute" <>
Subject Selective/Configurable HTML Parsing?
Date Tue, 16 Oct 2007 06:57:28 GMT

I need some help with understanding how the HTML parser works in nutch. I
have to write a plugin which while crawling text will help me identify
certain words/phrases that will be pre-specified.

eg: I might want to index pages with a specific in case they have the name Jimi
Hendrix occuring on them.

In such a case, how do I write an extension that allows me to check for the
occurence of a certain word on the page? Meaning, where do I start? I have
read the html parser code in the nutch source files, to an extent I could
understand it. Is there a text-library/dictionary that nutch uses while it
parses the page content? I read the documentation on neko parser, but am
still not able to understand it completely.

- Sagar

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message