manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Issei Nishigata <duo.2...@gmail.com>
Subject Re: Specifications of HopFilters "Keep unreachable documents"
Date Fri, 08 Nov 2019 15:10:21 GMT
Hi Karl,


Thank you for a quick response.

It seems that I have completely misunderstood the specifications so it'd be helpful if you
could show specific examples for each Hop count mode.

Is those below my understanding correct?
- "keep unreachable documents, for now" and "... forever" is the settings that does not delete
documents from the index that were not crawled.
- hop count dependency information is like a cache of the link structure. This link structure
is not recreated in "keep unreachable documents 
forever" mode, so it is faster to crawl.

The reason I am asking these question is a document was deleted that I thought it was not
going to be.
Is there any way that it does not delete? What does it "keep" in "keep unreachable document"?


Sincerely,
Issei Nishigata



On 2019/11/08 2:19, Karl Wright wrote:
> Hi Issei,
> 
> The setting of "Keep unreachable documents forever" basically means that no hop count
dependency information is kept around for any crawls done 
> when that setting is in place.  That means that when links change or documents change
the system does not know how to recompute the hopcount 
> accurately.  This setting is appropriate if you want your crawl to be as fast as possible
and do not expect ever to use hop count filtering for 
> the job in question.
> 
> The "keep unreachable documents for now" means that enough information is kept around
that if you decided to put a hop count filter into place 
> later, it would still work properly.
> 
> Hope that helps.
> 
> Karl
> 
> 
> On Thu, Nov 7, 2019 at 11:01 AM Issei Nishigata <duo.2029@gmail.com <mailto:duo.2029@gmail.com>>
wrote:
> 
>     Hi All,
> 
> 
>     I use MCF2.12, and I have confused about specifications of HopFilters "Keep unreachable
documents".
> 
>     I understand that the "Keep unrechable documents, for now" and "Keep unreacheable
documents, forever" of HopFilter
>     is an effective setting when specifying HopCount.
> 
>     For example, crawling all data with specifying the empty value on HopCount at first
time, and the second time,
>     putting 0 in the value of HopCount with "Keep unreachable documents, for now", only
the first layer of the directory
>     will be crawled and the second and deeper layers, which are not crawled, will not
be deleted from the index.
> 
>     However, when actually processing as the above setting, document on second layer
is deleted from index
>     when processing second time and after that. It works same way when using "Keep unreacheable
documents, forever".
> 
>     Is there anything wrong with my understanding? and Does anyone know about difference
between these two settings,
>     "Keep unrechable documents, for now" and "Keep unreacheable documents, forever"?
> 
>     If anyone of you knows about the specs of these settings, then it is very helpful
to share your bits of advice.
>     Any clue will be very appreciated.
> 
> 
>     Sincerely,
>     Issei Nishigata
> 

Mime
View raw message