manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Continuous crawling
Date Sun, 05 Jan 2014 14:50:56 GMT
Hi Florian,

It is impossible to *guarantee* that a document will be checked, because if
load on the crawler is high enough, it will fall behind.  But I will look
into adding the feature you request.

Karl


On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
schmeddi@informatik.uni-freiburg.de> wrote:

> Hi Karl,
>
> yes, in our case it is necessary to make sure that new documents are
> discovered and indexed within a certain interval. I have created a feature
> request on that. In the meantime we will try to use a scheduled job
> instead.
>
> Thanks for your help,
> Florian
>
>
> > Hi Florian,
> >
> > What you are seeing is "dynamic crawling" behavior.  The time between
> > refetches of a document is based on the history of fetches of that
> > document.  The recrawl interval is the initial time between document
> > fetches, but if a document does not change, the interval for the document
> > increases according to a formula.
> >
> > I would need to look at the code to be able to give you the precise
> > formula, but if you need a limit on the amount of time between document
> > fetch attempts, I suggest you create a ticket and I will look into adding
> > that as a feature.
> >
> > Thanks,
> > Karl
> >
> >
> >
> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
> > schmeddi@informatik.uni-freiburg.de> wrote:
> >
> >> Hello,
> >>
> >> the parameters reseed interval and recrawl interval of a continuous
> >> crawling job are not quite clear to me. The documentation tells that the
> >> reseed interval is the time after which the seeds are checked again, and
> >> the recrawl interval is the time after which a document is checked for
> >> changes.
> >>
> >> However, we observed that the recrawl interval for a document increases
> >> after each check. On the other hand, the reseed interval seems to be set
> >> up correctly in the database metadata about the seed documents. Yet the
> >> web server does not receive requests at each time the interval elapses
> >> but
> >> only after several intervals have elapsed.
> >>
> >> We are using a web connector. The web server does not tell the client to
> >> cache the documents. Any help would be appreciated.
> >>
> >> Best regards,
> >> Florian
> >>
> >>
> >>
> >>
> >
>
>
>

Mime
View raw message