manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Continuous crawling
Date Tue, 14 Jan 2014 11:09:45 GMT
Hi Florian,

I've never noted this behavior before.  I'll see if I can reproduce it here.

Karl



On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
schmeddi@informatik.uni-freiburg.de> wrote:

> Hi Karl,
>
> the scheduled job seems to work as expecetd. However, it runs two times:
> It starts at the beginning of the scheduled time, finishes, and
> immediately starts again. After finishing the second run it waits for the
> next scheduled time. Why does it run two times? The start method is "Start
> at beginning of schedule window".
>
> Yes, you're right about the checking guarantee. Currently, our interval is
> long enough for a complete crawler run.
>
> Best,
> Florian
>
>
> > Hi Florian,
> >
> > It is impossible to *guarantee* that a document will be checked, because
> > if
> > load on the crawler is high enough, it will fall behind.  But I will look
> > into adding the feature you request.
> >
> > Karl
> >
> >
> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
> > schmeddi@informatik.uni-freiburg.de> wrote:
> >
> >> Hi Karl,
> >>
> >> yes, in our case it is necessary to make sure that new documents are
> >> discovered and indexed within a certain interval. I have created a
> >> feature
> >> request on that. In the meantime we will try to use a scheduled job
> >> instead.
> >>
> >> Thanks for your help,
> >> Florian
> >>
> >>
> >> > Hi Florian,
> >> >
> >> > What you are seeing is "dynamic crawling" behavior.  The time between
> >> > refetches of a document is based on the history of fetches of that
> >> > document.  The recrawl interval is the initial time between document
> >> > fetches, but if a document does not change, the interval for the
> >> document
> >> > increases according to a formula.
> >> >
> >> > I would need to look at the code to be able to give you the precise
> >> > formula, but if you need a limit on the amount of time between
> >> document
> >> > fetch attempts, I suggest you create a ticket and I will look into
> >> adding
> >> > that as a feature.
> >> >
> >> > Thanks,
> >> > Karl
> >> >
> >> >
> >> >
> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
> >> > schmeddi@informatik.uni-freiburg.de> wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> the parameters reseed interval and recrawl interval of a continuous
> >> >> crawling job are not quite clear to me. The documentation tells that
> >> the
> >> >> reseed interval is the time after which the seeds are checked again,
> >> and
> >> >> the recrawl interval is the time after which a document is checked
> >> for
> >> >> changes.
> >> >>
> >> >> However, we observed that the recrawl interval for a document
> >> increases
> >> >> after each check. On the other hand, the reseed interval seems to be
> >> set
> >> >> up correctly in the database metadata about the seed documents. Yet
> >> the
> >> >> web server does not receive requests at each time the interval
> >> elapses
> >> >> but
> >> >> only after several intervals have elapsed.
> >> >>
> >> >> We are using a web connector. The web server does not tell the client
> >> to
> >> >> cache the documents. Any help would be appreciated.
> >> >>
> >> >> Best regards,
> >> >> Florian
> >> >>
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >
>
>
>

Mime
View raw message