nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Lorenzo <jay.lore...@gmail.com>
Subject Re: Automating workflow using ndfs
Date Fri, 02 Sep 2005 05:22:37 GMT
Thanks, that's good information - it sounds like I need to take a closer 
look at index deployment to see what the best solution is for automating 
index management.

The initial email was more about understanding what the envisioned workflow 
would for automating the creation of those indexes in a NDFS system, meaning 
what 
choices are available for automating the workflow of 
fetchlist->crawl->updateDb->index
part of the equation when you have a node hosting a webdb, and a number of 
nodes 
crawling and indexing. 

If I use a message based system, I assume I would create new fetchlists at a 
given 
locations of the NDFS, and message the fetchers where to find the 
fetchlists. Once crawled, 
I need to then update the webdb with the links discovered during the crawl.

Maybe this is too complex of a solution, but my sense is that map-reduce 
systems still need a way 
to manage the workflow/control that needs to occur if you want to create 
pipelines that 
generate indexes.

Thanks,

Jay Lorenzo

On 8/31/05, Doug Cutting <cutting@nutch.org> wrote:
> 
> I assume that in most NDFS-based configurations the production search
> system will not run out of NDFS. Rather, indexes will be created
> offline for a deployment (i.e., merging things to create an index 
> peractually
> search node), then copied out of NDFS to the local filesystem on a
> production search node and placed in production. This can be done
> incrementally, where new indexes are deployed without re-deploying old
> indexes. In this scenario, new indexes are rotated in replacing old
> indexes, and the .del file for every index is updated, to reflect
> deduping. There is no code yet which implements this.
> 
> Is this what you were asking?
> 
> Doug
> 
> 
> Jay Lorenzo wrote:
> > I'm pretty new to nutch, but in reading through the mail lists and other
> > papers, I don't think I've really seen any discussion on using ndfs with
> > respect to automating end to end workflow for data that is going to be
> > searched (fetch->index->merge->search).
> >
> > The few crawler designs I'm familiar with typically have spiders
> > (fetchers) and
> > indexers on the same box. Once pages are crawled and indexed the indexes
> > are pipelined to merge/query boxes to complete the workflow.
> >
> > When I look at the nutch design and ndfs, I'm assuming the design intent
> > for 'pure ndfs' workflow is for the webdb to generate segments on a ndfs
> > partition, and once the updating of the webdb is completed, the segments
> > are processed 'on-disk' by the subsequent
> > fetcher/index/merge/query mechanisms. Is this a correct assumption?
> >
> > Automating this kind of continuous workflow usually is dependent on the
> > implementation of some kind of control mechanism to assure that the
> > correct sequence of operations is performed.
> >
> > Are there any recommendations on the best way to automate this
> > workflow when using ndfs? I've prototyped a continuous workflow system
> > using a traditional pipeline model with per stage work queues, and I see
> > how that could be applied to a clustered filesystem like ndfs, but I'm
> > curious to hear what the design intent or best practice is envisioned
> > for automating ndfs based implementations.
> >
> >
> > Thanks,
> >
> > Jay
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message