manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Re-sending docs to output connector
Date Tue, 24 May 2011 13:11:41 GMT
ManifoldCF is designed to deal with the problem of repeated or
continuous crawling, doing only what is needed on subsequent crawls.
It is thus a true incremental crawler.  But in order for this to work
for you, you need to let ManifoldCF do its job of keeping track of
what documents (and what document versions) have been handed to the
output connection.  For the situation where you change something in
Solr, the ManifoldCF solution to that is the "refetch all ingested
documents" button in the Crawler UI.  This is on the view page for the
output connection.  Clicking that button will cause ManifoldCF to
re-index all documents - but will also require ManifoldCF to recrawl
them, because ManifoldCF does not keep copies of the documents it
crawls anywhere.

If you need to avoid recrawling at all costs when you change Solr
configurations, you may well need to put some sort of software of your
own devising between ManifoldCF and Solr.  You basically want to
develop a content repository which ManifoldCF outputs to which can be
scanned to send to your Solr instance.  I actually proposed this
design for a Solr "guaranteed delivery" mechanism, because until Solr
commits a document it can still be lost if the Solr instance is shut
down.  Clearly something like this is needed and would also likely
solve your problem too.  The main issue, though, is that it would need
to be integrated with Solr itself, because you'd really want it to
pick up where it left off if Solr is cycled etc.  In my opinion this
functionality really can't function as part of ManifoldCF for that
reason.

Karl

On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl <jan.asf@cominvent.com> wrote:
> Hi,
>
> Is there an easy way to separate fetching from ingestion?
> I'd like to first run a crawl for several days, and then feed it to my Solr output as
fast as possible.
> Also, after schema changes in Solr, there is a need to re-feed all docs.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
>

Mime
View raw message