nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Kangas <>
Subject Re: API for injecting content into Nutch?
Date Mon, 26 Sep 2005 19:48:17 GMT
Dave, you don't want to "inject" anything per-se, at least according  
to nutch terminology. Instead, you'll want create your own synthetic  
crawler. Nutch's crawler outputs one "segment file" (directory of  
files, actually) per crawler pass. It is this segment that is  
processed by the "nutch index" stage.

So, create a program that iterates through your content and writes it  
to a segment file, simulating the crawler's output. Just read the  
source for to see how it uses  
org.apache.nutch.segment.SegmentWriter and mimic that. Then follow  
the rest of the tutorial as if your segment files had fallen out of  
the real crawler.


On Sep 26, 2005, at 2:32 PM, Goldschmidt, Dave wrote:

> Hello,
> Is there an API of some sort for injecting content into Nutch  
> *without*
> using Nutch's crawler?  Or does anyone have ideas as to how to  
> approach
> this problem?  I.e. given a URL, a page of content, metadata about the
> page, links, etc., how can I inject this into Nutch without Nutch
> performing the crawl?
> Thanks in advance for your ideas and insights,
> DaveG

Matt Kangas /

View raw message