Hi Karl,
Thank you for a quick response.
It seems that I have completely misunderstood the specifications so it'd be helpful if you
could show specific examples for each Hop count mode.
Is those below my understanding correct?
- "keep unreachable documents, for now" and "... forever" is the settings that does not delete
documents from the index that were not crawled.
- hop count dependency information is like a cache of the link structure. This link structure
is not recreated in "keep unreachable documents
forever" mode, so it is faster to crawl.
The reason I am asking these question is a document was deleted that I thought it was not
going to be.
Is there any way that it does not delete? What does it "keep" in "keep unreachable document"?
Sincerely,
Issei Nishigata
On 2019/11/08 2:19, Karl Wright wrote:
> Hi Issei,
>
> The setting of "Keep unreachable documents forever" basically means that no hop count
dependency information is kept around for any crawls done
> when that setting is in place. That means that when links change or documents change
the system does not know how to recompute the hopcount
> accurately. This setting is appropriate if you want your crawl to be as fast as possible
and do not expect ever to use hop count filtering for
> the job in question.
>
> The "keep unreachable documents for now" means that enough information is kept around
that if you decided to put a hop count filter into place
> later, it would still work properly.
>
> Hope that helps.
>
> Karl
>
>
> On Thu, Nov 7, 2019 at 11:01 AM Issei Nishigata <duo.2029@gmail.com <mailto:duo.2029@gmail.com>>
wrote:
>
> Hi All,
>
>
> I use MCF2.12, and I have confused about specifications of HopFilters "Keep unreachable
documents".
>
> I understand that the "Keep unrechable documents, for now" and "Keep unreacheable
documents, forever" of HopFilter
> is an effective setting when specifying HopCount.
>
> For example, crawling all data with specifying the empty value on HopCount at first
time, and the second time,
> putting 0 in the value of HopCount with "Keep unreachable documents, for now", only
the first layer of the directory
> will be crawled and the second and deeper layers, which are not crawled, will not
be deleted from the index.
>
> However, when actually processing as the above setting, document on second layer
is deleted from index
> when processing second time and after that. It works same way when using "Keep unreacheable
documents, forever".
>
> Is there anything wrong with my understanding? and Does anyone know about difference
between these two settings,
> "Keep unrechable documents, for now" and "Keep unreacheable documents, forever"?
>
> If anyone of you knows about the specs of these settings, then it is very helpful
to share your bits of advice.
> Any clue will be very appreciated.
>
>
> Sincerely,
> Issei Nishigata
>
|