manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Manifoldcf - Job Deletion Process
Date Fri, 01 Nov 2019 10:51:33 GMT
There is a "Hop filters" tab in the job.  This allows you to specify the
maximum number of hops from the seed documents that are allowed.  Or you
can turn it off entirely, if you do not want this feature.

Bear in mind that documents that are unreachable by *any* means from the
seed documents will always be deleted at the end of each job run.  So if
you are relying on some special page you generate to point at all the
documents you want to crawl, make sure it has a complete list.  If you try
to make an incremental list of just the new documents, then all the old
ones will get removed.

Karl


On Fri, Nov 1, 2019 at 6:41 AM Priya Arora <priya@smartshore.nl> wrote:

> Yes, I have set Authenticity properly, as we have configured this
> setting by passing this info in header.
>
> (1) They are now unreachable, whereas they were reachable before by the
> specified number of hops from the seed documents; -But If I compared it
> with the previous index where data is not much old(like a week before),
> documents(deleted one) were ingested and when i am checking its not
> resulting in 404.
> Regarding  the specified number of hops from the seed documents;:- Can
> you please help me with little bit of elaboration
>
> Thanks
> Priya
>
> On Fri, Nov 1, 2019 at 3:43 PM Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Priya,
>>
>> ManifoldCF doesn't delete documents unless:
>> (1) They are now unreachable, whereas they were reachable before by the
>> specified number of hops from the seed documents;
>> (2) They cannot be fetched due to a 404 error, or something similar which
>> tells ManifoldCF that they are not available.
>>
>> Your site, I notice, has a "sso" page.  Are you setting up session
>> authentication properly?
>>
>> Karl
>>
>>
>> On Fri, Nov 1, 2019 at 3:59 AM Priya Arora <priya@smartshore.nl> wrote:
>>
>>> [image: del2.JPG]
>>>
>>> Screenshot the Deleted documents other than PDF's
>>>
>>> On Fri, Nov 1, 2019 at 1:28 PM Priya Arora <priya@smartshore.nl> wrote:
>>>
>>>> The jib was started as per below schedule:-
>>>> [image: job.JPG]
>>>>
>>>> And just before the completion of the job. It started the Deletion
>>>> process. Before starting the job a new index in ES was taken and the
>>>> Database was cleaned up before starting the jib.
>>>> [image: deletion.JPG]
>>>>
>>>> Records were processed and indexed successfully. When I am checking
>>>> this URL(those are Deleted) on a browser, it seems to be a valid URl and
is
>>>> accessible.
>>>> Job is to crawl around 2.25 lakhs of records so the seeded url have
>>>> many sub-links within. If we think the URL;s were already present in
>>>> Database that why somehow crawler deletes it, it should not be the case,
as
>>>> the database clean up processed has been done before run.
>>>>
>>>> If we think the crawler is deleting only documents related to PDF
>>>> extension, this is not the case, as other HTML pages are also deleted.
>>>>
>>>> Can you please suggest something on this.
>>>>
>>>> Thanks
>>>> Priya
>>>>
>>>> On Wed, Oct 30, 2019 at 3:39 PM Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> So it looks like the URL ending in 117047 was successfully processed
>>>>> and indexed, and not removed.  The URLs ending in 119200 and lang-en
were
>>>>> both unreachable and were removed.  I don't see a job end at all?  There's
>>>>> a new job start at 12:39 though.
>>>>>
>>>>> What I want to see is the lifetime of one of the documents that you
>>>>> think is getting removed for no reason.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Wed, Oct 30, 2019 at 3:13 AM Priya Arora <priya@smartshore.nl>
>>>>> wrote:
>>>>>
>>>>>> You want to test the Job whole process start, end and all events(from
>>>>>> History) by seeding on of these URL's. Below are the results:-
>>>>>> I changed seed URL to the picked one Identifier and then that
>>>>>> document was fetch and indexed in a new index and the Deletion process
>>>>>> started.
>>>>>> [image: Start.JPG]
>>>>>> [image: Indexation and Deletion.JPG]
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 30, 2019 at 12:25 PM Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok, so pick ONE of these identifiers.
>>>>>>>
>>>>>>> What I want to see is the entire lifecycle of the ONE identifier.
>>>>>>> That includes what the Web Connection logs as well as what the
indexation
>>>>>>> logs.  Ideally I'd like to see:
>>>>>>>
>>>>>>> - job start and end
>>>>>>> - web connection events
>>>>>>> - indexing events
>>>>>>>
>>>>>>> I'd like to see these for both the job that indexes the document
>>>>>>> initially as well as the job run that deletes the document.
>>>>>>>
>>>>>>> My suspicion is that on the second run the document is simply
no
>>>>>>> longer reachable from the seeds.  In other words, the seed documents
either
>>>>>>> cannot be fetched on the second run or they contain different
stuff and
>>>>>>> there's no longer a chain of links between the seeds and the
documents
>>>>>>> being deleted.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 30, 2019 at 1:50 AM Priya Arora <priya@smartshore.nl>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Indexation screenshot is as below.
>>>>>>>>
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> On Tue, Oct 29, 2019 at 7:57 PM Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I need both ingestion and deletion.
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <priya@smartshore.nl>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> History is shown as below as it does not indicates
any error.
>>>>>>>>>> [image: 12.JPG]
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Priya
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> What does the history say about these documents?
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <priya@smartshore.nl>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  it may be that (a) they weren't found, or
(b) that the
>>>>>>>>>>>> document specification in the job changed
and they are no longer included
>>>>>>>>>>>> in the job.
>>>>>>>>>>>>
>>>>>>>>>>>> URL's that were deleted are valid URL's(as
that does not
>>>>>>>>>>>> result in 404 or page not found error), and
it is not being mentioned in
>>>>>>>>>>>> Exclusion tab of job configuration.
>>>>>>>>>>>> And the URL's were getting indexed earlier
and except for index
>>>>>>>>>>>> name in Elasticsearch nothing is changed
in Job specification and in other
>>>>>>>>>>>> connectors.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Priya
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright
<daddywri@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> ManifoldCF is an incremental crawler,
which means that on
>>>>>>>>>>>>> every (non-continuous) job run it sees
which documents it can find and
>>>>>>>>>>>>> removes the ones it can't.  The history
for the documents being deleted
>>>>>>>>>>>>> should tell you why they are being deleted
-- it may be that (a) they
>>>>>>>>>>>>> weren't found, or (b) that the document
specification in the job changed
>>>>>>>>>>>>> and they are no longer included in the
job.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya
Arora <
>>>>>>>>>>>>> priya@smartshore.nl> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a query regarding ManifoldCF
Job process.I have a job
>>>>>>>>>>>>>> to crawl intranet site
>>>>>>>>>>>>>> Repository Type:- Web
>>>>>>>>>>>>>> Output Connector Type:- Elastic search.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Job have to crawl around4-5 lakhs
of total records. I have
>>>>>>>>>>>>>> discarded the previous index and
created a new index(in Elasticsearch) with
>>>>>>>>>>>>>> proper mappings and settings and
started the job again after
>>>>>>>>>>>>>> cleaning Database even(Database used
a PostgreSQL).
>>>>>>>>>>>>>> But while the job continues its ingests
the records properly
>>>>>>>>>>>>>> but just before finishing (some times
in between also), it initiates the
>>>>>>>>>>>>>> process of Deletions and also it
does not index the deleted documents again
>>>>>>>>>>>>>> in index.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you please something if I am
doing anything wrong? or is
>>>>>>>>>>>>>> this a process of manifoldcf if yes
, why its not getting ingested again.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks and regards
>>>>>>>>>>>>>> Priya
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Mime
View raw message