manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Manifoldcf - Job Deletion Process
Date Fri, 01 Nov 2019 10:13:36 GMT
Hi Priya,

ManifoldCF doesn't delete documents unless:
(1) They are now unreachable, whereas they were reachable before by the
specified number of hops from the seed documents;
(2) They cannot be fetched due to a 404 error, or something similar which
tells ManifoldCF that they are not available.

Your site, I notice, has a "sso" page.  Are you setting up session
authentication properly?

Karl


On Fri, Nov 1, 2019 at 3:59 AM Priya Arora <priya@smartshore.nl> wrote:

> [image: del2.JPG]
>
> Screenshot the Deleted documents other than PDF's
>
> On Fri, Nov 1, 2019 at 1:28 PM Priya Arora <priya@smartshore.nl> wrote:
>
>> The jib was started as per below schedule:-
>> [image: job.JPG]
>>
>> And just before the completion of the job. It started the Deletion
>> process. Before starting the job a new index in ES was taken and the
>> Database was cleaned up before starting the jib.
>> [image: deletion.JPG]
>>
>> Records were processed and indexed successfully. When I am checking this
>> URL(those are Deleted) on a browser, it seems to be a valid URl and is
>> accessible.
>> Job is to crawl around 2.25 lakhs of records so the seeded url have many
>> sub-links within. If we think the URL;s were already present in Database
>> that why somehow crawler deletes it, it should not be the case, as the
>> database clean up processed has been done before run.
>>
>> If we think the crawler is deleting only documents related to PDF
>> extension, this is not the case, as other HTML pages are also deleted.
>>
>> Can you please suggest something on this.
>>
>> Thanks
>> Priya
>>
>> On Wed, Oct 30, 2019 at 3:39 PM Karl Wright <daddywri@gmail.com> wrote:
>>
>>> So it looks like the URL ending in 117047 was successfully processed and
>>> indexed, and not removed.  The URLs ending in 119200 and lang-en were both
>>> unreachable and were removed.  I don't see a job end at all?  There's a new
>>> job start at 12:39 though.
>>>
>>> What I want to see is the lifetime of one of the documents that you
>>> think is getting removed for no reason.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Oct 30, 2019 at 3:13 AM Priya Arora <priya@smartshore.nl> wrote:
>>>
>>>> You want to test the Job whole process start, end and all events(from
>>>> History) by seeding on of these URL's. Below are the results:-
>>>> I changed seed URL to the picked one Identifier and then that document
>>>> was fetch and indexed in a new index and the Deletion process started.
>>>> [image: Start.JPG]
>>>> [image: Indexation and Deletion.JPG]
>>>>
>>>>
>>>>
>>>> On Wed, Oct 30, 2019 at 12:25 PM Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok, so pick ONE of these identifiers.
>>>>>
>>>>> What I want to see is the entire lifecycle of the ONE identifier.
>>>>> That includes what the Web Connection logs as well as what the indexation
>>>>> logs.  Ideally I'd like to see:
>>>>>
>>>>> - job start and end
>>>>> - web connection events
>>>>> - indexing events
>>>>>
>>>>> I'd like to see these for both the job that indexes the document
>>>>> initially as well as the job run that deletes the document.
>>>>>
>>>>> My suspicion is that on the second run the document is simply no
>>>>> longer reachable from the seeds.  In other words, the seed documents
either
>>>>> cannot be fetched on the second run or they contain different stuff and
>>>>> there's no longer a chain of links between the seeds and the documents
>>>>> being deleted.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Wed, Oct 30, 2019 at 1:50 AM Priya Arora <priya@smartshore.nl>
>>>>> wrote:
>>>>>
>>>>>> Indexation screenshot is as below.
>>>>>>
>>>>>> [image: image.png]
>>>>>>
>>>>>> On Tue, Oct 29, 2019 at 7:57 PM Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I need both ingestion and deletion.
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <priya@smartshore.nl>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> History is shown as below as it does not indicates any error.
>>>>>>>> [image: 12.JPG]
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Priya
>>>>>>>>
>>>>>>>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> What does the history say about these documents?
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <priya@smartshore.nl>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  it may be that (a) they weren't found, or (b) that
the document
>>>>>>>>>> specification in the job changed and they are no
longer included in the job.
>>>>>>>>>>
>>>>>>>>>> URL's that were deleted are valid URL's(as that does
not result
>>>>>>>>>> in 404 or page not found error), and it is not being
mentioned in Exclusion
>>>>>>>>>> tab of job configuration.
>>>>>>>>>> And the URL's were getting indexed earlier and except
for index
>>>>>>>>>> name in Elasticsearch nothing is changed in Job specification
and in other
>>>>>>>>>> connectors.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Priya
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> ManifoldCF is an incremental crawler, which means
that on every
>>>>>>>>>>> (non-continuous) job run it sees which documents
it can find and removes
>>>>>>>>>>> the ones it can't.  The history for the documents
being deleted should tell
>>>>>>>>>>> you why they are being deleted -- it may be that
(a) they weren't found, or
>>>>>>>>>>> (b) that the document specification in the job
changed and they are no
>>>>>>>>>>> longer included in the job.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora <priya@smartshore.nl>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> I have a query regarding ManifoldCF Job process.I
have a job to
>>>>>>>>>>>> crawl intranet site
>>>>>>>>>>>> Repository Type:- Web
>>>>>>>>>>>> Output Connector Type:- Elastic search.
>>>>>>>>>>>>
>>>>>>>>>>>> Job have to crawl around4-5 lakhs of total
records. I have
>>>>>>>>>>>> discarded the previous index and created
a new index(in Elasticsearch) with
>>>>>>>>>>>> proper mappings and settings and started
the job again after
>>>>>>>>>>>> cleaning Database even(Database used a PostgreSQL).
>>>>>>>>>>>> But while the job continues its ingests the
records properly
>>>>>>>>>>>> but just before finishing (some times in
between also), it initiates the
>>>>>>>>>>>> process of Deletions and also it does not
index the deleted documents again
>>>>>>>>>>>> in index.
>>>>>>>>>>>>
>>>>>>>>>>>> Can you please something if I am doing anything
wrong? or is
>>>>>>>>>>>> this a process of manifoldcf if yes , why
its not getting ingested again.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks and regards
>>>>>>>>>>>> Priya
>>>>>>>>>>>>
>>>>>>>>>>>>

Mime
View raw message