nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fredrik Andersson <fidde.anders...@gmail.com>
Subject Iterating spidered pages
Date Tue, 05 Jul 2005 08:58:33 GMT
Hi!

I'm new to this list, so hello to you all.

Here's the gig - I have crawled and indexed a bunch of pages. The HTML
Parser used in nutch only parses out the title, text, metadata and
outlinks. Is there any way to extend this set of attributes
post-crawling (i.e, without rewriting HtmlParser.java)? I'd like to
iterate all the crawled pages, access their raw data, parse out some
chunk of text and save it as a detail field or similar.

I haven't really got the full hang of the all the connections in the
API yet, so forgive a poor guy for being a newbie.

Big thanks in advance,
Fredrik

Mime
View raw message