manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl <jan....@cominvent.com>
Subject Re: Re-sending docs to output connector
Date Tue, 24 May 2011 21:01:05 GMT
The "Refetch all ingested documents" works, but with Web crawling the problem is that it will
take almost as long as a new crawl to re-feed.

The solutions could be
A) Add a stand-alone cache in front of Solr
B) Add a caching proxy in front of MCF - will allow speedy re-crawl (but clunky to administer)
C) Extend MCF with an optional item cache. This could allow a "refeed from cache" button somewhere...

The cache in C could be realized externally to MCF, e.g. as a CouchDB cluster. To enable,
you'd add the CouchDB access into to properties.xml.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 24. mai 2011, at 15.11, Karl Wright wrote:

> ManifoldCF is designed to deal with the problem of repeated or
> continuous crawling, doing only what is needed on subsequent crawls.
> It is thus a true incremental crawler.  But in order for this to work
> for you, you need to let ManifoldCF do its job of keeping track of
> what documents (and what document versions) have been handed to the
> output connection.  For the situation where you change something in
> Solr, the ManifoldCF solution to that is the "refetch all ingested
> documents" button in the Crawler UI.  This is on the view page for the
> output connection.  Clicking that button will cause ManifoldCF to
> re-index all documents - but will also require ManifoldCF to recrawl
> them, because ManifoldCF does not keep copies of the documents it
> crawls anywhere.
> 
> If you need to avoid recrawling at all costs when you change Solr
> configurations, you may well need to put some sort of software of your
> own devising between ManifoldCF and Solr.  You basically want to
> develop a content repository which ManifoldCF outputs to which can be
> scanned to send to your Solr instance.  I actually proposed this
> design for a Solr "guaranteed delivery" mechanism, because until Solr
> commits a document it can still be lost if the Solr instance is shut
> down.  Clearly something like this is needed and would also likely
> solve your problem too.  The main issue, though, is that it would need
> to be integrated with Solr itself, because you'd really want it to
> pick up where it left off if Solr is cycled etc.  In my opinion this
> functionality really can't function as part of ManifoldCF for that
> reason.
> 
> Karl
> 
> On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl <jan.asf@cominvent.com> wrote:
>> Hi,
>> 
>> Is there an easy way to separate fetching from ingestion?
>> I'd like to first run a crawl for several days, and then feed it to my Solr output
as fast as possible.
>> Also, after schema changes in Solr, there is a need to re-feed all docs.
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>> 


Mime
View raw message