manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Specifications of HopFilters "Keep unreachable documents"
Date Fri, 08 Nov 2019 15:46:02 GMT
' The reason I am asking these question is a document was deleted that I
thought it was not going to be.'

This would only happen if you had "Delete unreachable documents" as the
selection.  Otherwise it would not happen.

It sounds to me like you just want to disable hopcount filters entirely.
In that case, leave the hop count filter value empty, and if you are sure
about this being the way you want to run the job forever, choose the "keep
unreachable documents forever" selection.

I am not in a position (I do not have the time) to work out detailed
examples.  You can explore the behavior on your own if you so desire.
Karl


On Fri, Nov 8, 2019 at 10:10 AM Issei Nishigata <duo.2029@gmail.com> wrote:

> Hi Karl,
>
>
> Thank you for a quick response.
>
> It seems that I have completely misunderstood the specifications so it'd
> be helpful if you could show specific examples for each Hop count mode.
>
> Is those below my understanding correct?
> - "keep unreachable documents, for now" and "... forever" is the settings
> that does not delete documents from the index that were not crawled.
> - hop count dependency information is like a cache of the link structure.
> This link structure is not recreated in "keep unreachable documents
> forever" mode, so it is faster to crawl.
>
> The reason I am asking these question is a document was deleted that I
> thought it was not going to be.
> Is there any way that it does not delete? What does it "keep" in "keep
> unreachable document"?
>
>
> Sincerely,
> Issei Nishigata
>
>
>
> On 2019/11/08 2:19, Karl Wright wrote:
> > Hi Issei,
> >
> > The setting of "Keep unreachable documents forever" basically means that
> no hop count dependency information is kept around for any crawls done
> > when that setting is in place.  That means that when links change or
> documents change the system does not know how to recompute the hopcount
> > accurately.  This setting is appropriate if you want your crawl to be as
> fast as possible and do not expect ever to use hop count filtering for
> > the job in question.
> >
> > The "keep unreachable documents for now" means that enough information
> is kept around that if you decided to put a hop count filter into place
> > later, it would still work properly.
> >
> > Hope that helps.
> >
> > Karl
> >
> >
> > On Thu, Nov 7, 2019 at 11:01 AM Issei Nishigata <duo.2029@gmail.com
> <mailto:duo.2029@gmail.com>> wrote:
> >
> >     Hi All,
> >
> >
> >     I use MCF2.12, and I have confused about specifications of
> HopFilters "Keep unreachable documents".
> >
> >     I understand that the "Keep unrechable documents, for now" and "Keep
> unreacheable documents, forever" of HopFilter
> >     is an effective setting when specifying HopCount.
> >
> >     For example, crawling all data with specifying the empty value on
> HopCount at first time, and the second time,
> >     putting 0 in the value of HopCount with "Keep unreachable documents,
> for now", only the first layer of the directory
> >     will be crawled and the second and deeper layers, which are not
> crawled, will not be deleted from the index.
> >
> >     However, when actually processing as the above setting, document on
> second layer is deleted from index
> >     when processing second time and after that. It works same way when
> using "Keep unreacheable documents, forever".
> >
> >     Is there anything wrong with my understanding? and Does anyone know
> about difference between these two settings,
> >     "Keep unrechable documents, for now" and "Keep unreacheable
> documents, forever"?
> >
> >     If anyone of you knows about the specs of these settings, then it is
> very helpful to share your bits of advice.
> >     Any clue will be very appreciated.
> >
> >
> >     Sincerely,
> >     Issei Nishigata
> >
>

Mime
View raw message