manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Florian Schmedding" <schme...@informatik.uni-freiburg.de>
Subject Re: Continuous crawling
Date Tue, 14 Jan 2014 10:36:38 GMT
Hi Karl,

the scheduled job seems to work as expecetd. However, it runs two times:
It starts at the beginning of the scheduled time, finishes, and
immediately starts again. After finishing the second run it waits for the
next scheduled time. Why does it run two times? The start method is "Start
at beginning of schedule window".

Yes, you're right about the checking guarantee. Currently, our interval is
long enough for a complete crawler run.

Best,
Florian


> Hi Florian,
>
> It is impossible to *guarantee* that a document will be checked, because
> if
> load on the crawler is high enough, it will fall behind.  But I will look
> into adding the feature you request.
>
> Karl
>
>
> On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
> schmeddi@informatik.uni-freiburg.de> wrote:
>
>> Hi Karl,
>>
>> yes, in our case it is necessary to make sure that new documents are
>> discovered and indexed within a certain interval. I have created a
>> feature
>> request on that. In the meantime we will try to use a scheduled job
>> instead.
>>
>> Thanks for your help,
>> Florian
>>
>>
>> > Hi Florian,
>> >
>> > What you are seeing is "dynamic crawling" behavior.  The time between
>> > refetches of a document is based on the history of fetches of that
>> > document.  The recrawl interval is the initial time between document
>> > fetches, but if a document does not change, the interval for the
>> document
>> > increases according to a formula.
>> >
>> > I would need to look at the code to be able to give you the precise
>> > formula, but if you need a limit on the amount of time between
>> document
>> > fetch attempts, I suggest you create a ticket and I will look into
>> adding
>> > that as a feature.
>> >
>> > Thanks,
>> > Karl
>> >
>> >
>> >
>> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
>> > schmeddi@informatik.uni-freiburg.de> wrote:
>> >
>> >> Hello,
>> >>
>> >> the parameters reseed interval and recrawl interval of a continuous
>> >> crawling job are not quite clear to me. The documentation tells that
>> the
>> >> reseed interval is the time after which the seeds are checked again,
>> and
>> >> the recrawl interval is the time after which a document is checked
>> for
>> >> changes.
>> >>
>> >> However, we observed that the recrawl interval for a document
>> increases
>> >> after each check. On the other hand, the reseed interval seems to be
>> set
>> >> up correctly in the database metadata about the seed documents. Yet
>> the
>> >> web server does not receive requests at each time the interval
>> elapses
>> >> but
>> >> only after several intervals have elapsed.
>> >>
>> >> We are using a web connector. The web server does not tell the client
>> to
>> >> cache the documents. Any help would be appreciated.
>> >>
>> >> Best regards,
>> >> Florian
>> >>
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>



Mime
View raw message