manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Florian Schmedding" <schme...@informatik.uni-freiburg.de>
Subject Re: Continuous crawling
Date Wed, 15 Jan 2014 18:04:42 GMT
Hi Karl,

these are the values:
Priority: 	5 	Start method: 	Start at beginning of schedule window
Schedule type: 	Scan every document once 	Minimum recrawl interval: 	Not
applicable
Expiration interval: 	Not applicable 	Reseed interval: 	Not applicable
Scheduled time: 	Any day of week at 12 am 1 am 2 am 3 am 4 am 5 am 6 am 7
am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 pm 7 pm 8 pm 9
pm 10 pm 11 pm
Maximum run time: 	No limit 	Job invocation: 	Complete

Maybe it is because I've changed the job from continuous crawling to this
schedule. I started it a few times manually, too. I couldn't notice
anything strange in the job setup or in the respective entries in the
database.

Regards,
Florian

> Hi Florian,
>
> I was unable to reproduce the behavior you described.
>
> Could you view your job, and post a screen shot of that page?  I want to
> see what your schedule record(s) look like.
>
> Thanks,
> Karl
>
>
>
> On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Florian,
>>
>> I've never noted this behavior before.  I'll see if I can reproduce it
>> here.
>>
>> Karl
>>
>>
>>
>> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
>> schmeddi@informatik.uni-freiburg.de> wrote:
>>
>>> Hi Karl,
>>>
>>> the scheduled job seems to work as expecetd. However, it runs two
>>> times:
>>> It starts at the beginning of the scheduled time, finishes, and
>>> immediately starts again. After finishing the second run it waits for
>>> the
>>> next scheduled time. Why does it run two times? The start method is
>>> "Start
>>> at beginning of schedule window".
>>>
>>> Yes, you're right about the checking guarantee. Currently, our interval
>>> is
>>> long enough for a complete crawler run.
>>>
>>> Best,
>>> Florian
>>>
>>>
>>> > Hi Florian,
>>> >
>>> > It is impossible to *guarantee* that a document will be checked,
>>> because
>>> > if
>>> > load on the crawler is high enough, it will fall behind.  But I will
>>> look
>>> > into adding the feature you request.
>>> >
>>> > Karl
>>> >
>>> >
>>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
>>> > schmeddi@informatik.uni-freiburg.de> wrote:
>>> >
>>> >> Hi Karl,
>>> >>
>>> >> yes, in our case it is necessary to make sure that new documents are
>>> >> discovered and indexed within a certain interval. I have created a
>>> >> feature
>>> >> request on that. In the meantime we will try to use a scheduled job
>>> >> instead.
>>> >>
>>> >> Thanks for your help,
>>> >> Florian
>>> >>
>>> >>
>>> >> > Hi Florian,
>>> >> >
>>> >> > What you are seeing is "dynamic crawling" behavior.  The time
>>> between
>>> >> > refetches of a document is based on the history of fetches of that
>>> >> > document.  The recrawl interval is the initial time between
>>> document
>>> >> > fetches, but if a document does not change, the interval for the
>>> >> document
>>> >> > increases according to a formula.
>>> >> >
>>> >> > I would need to look at the code to be able to give you the
>>> precise
>>> >> > formula, but if you need a limit on the amount of time between
>>> >> document
>>> >> > fetch attempts, I suggest you create a ticket and I will look into
>>> >> adding
>>> >> > that as a feature.
>>> >> >
>>> >> > Thanks,
>>> >> > Karl
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
>>> >> > schmeddi@informatik.uni-freiburg.de> wrote:
>>> >> >
>>> >> >> Hello,
>>> >> >>
>>> >> >> the parameters reseed interval and recrawl interval of a
>>> continuous
>>> >> >> crawling job are not quite clear to me. The documentation tells
>>> that
>>> >> the
>>> >> >> reseed interval is the time after which the seeds are checked
>>> again,
>>> >> and
>>> >> >> the recrawl interval is the time after which a document is
>>> checked
>>> >> for
>>> >> >> changes.
>>> >> >>
>>> >> >> However, we observed that the recrawl interval for a document
>>> >> increases
>>> >> >> after each check. On the other hand, the reseed interval seems
to
>>> be
>>> >> set
>>> >> >> up correctly in the database metadata about the seed documents.
>>> Yet
>>> >> the
>>> >> >> web server does not receive requests at each time the interval
>>> >> elapses
>>> >> >> but
>>> >> >> only after several intervals have elapsed.
>>> >> >>
>>> >> >> We are using a web connector. The web server does not tell
the
>>> client
>>> >> to
>>> >> >> cache the documents. Any help would be appreciated.
>>> >> >>
>>> >> >> Best regards,
>>> >> >> Florian
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>
>



Mime
View raw message