manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priya Arora <pr...@smartshore.nl>
Subject Re: Manifoldcf - Job Deletion Process
Date Fri, 01 Nov 2019 07:58:29 GMT
The jib was started as per below schedule:-
[image: job.JPG]

And just before the completion of the job. It started the Deletion process.
Before starting the job a new index in ES was taken and the Database was
cleaned up before starting the jib.
[image: deletion.JPG]

Records were processed and indexed successfully. When I am checking this
URL(those are Deleted) on a browser, it seems to be a valid URl and is
accessible.
Job is to crawl around 2.25 lakhs of records so the seeded url have many
sub-links within. If we think the URL;s were already present in Database
that why somehow crawler deletes it, it should not be the case, as the
database clean up processed has been done before run.

If we think the crawler is deleting only documents related to PDF
extension, this is not the case, as other HTML pages are also deleted.

Can you please suggest something on this.

Thanks
Priya

On Wed, Oct 30, 2019 at 3:39 PM Karl Wright <daddywri@gmail.com> wrote:

> So it looks like the URL ending in 117047 was successfully processed and
> indexed, and not removed.  The URLs ending in 119200 and lang-en were both
> unreachable and were removed.  I don't see a job end at all?  There's a new
> job start at 12:39 though.
>
> What I want to see is the lifetime of one of the documents that you think
> is getting removed for no reason.
>
> Karl
>
>
> On Wed, Oct 30, 2019 at 3:13 AM Priya Arora <priya@smartshore.nl> wrote:
>
>> You want to test the Job whole process start, end and all events(from
>> History) by seeding on of these URL's. Below are the results:-
>> I changed seed URL to the picked one Identifier and then that document
>> was fetch and indexed in a new index and the Deletion process started.
>> [image: Start.JPG]
>> [image: Indexation and Deletion.JPG]
>>
>>
>>
>> On Wed, Oct 30, 2019 at 12:25 PM Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Ok, so pick ONE of these identifiers.
>>>
>>> What I want to see is the entire lifecycle of the ONE identifier.  That
>>> includes what the Web Connection logs as well as what the indexation logs.
>>> Ideally I'd like to see:
>>>
>>> - job start and end
>>> - web connection events
>>> - indexing events
>>>
>>> I'd like to see these for both the job that indexes the document
>>> initially as well as the job run that deletes the document.
>>>
>>> My suspicion is that on the second run the document is simply no longer
>>> reachable from the seeds.  In other words, the seed documents either cannot
>>> be fetched on the second run or they contain different stuff and there's no
>>> longer a chain of links between the seeds and the documents being deleted.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Wed, Oct 30, 2019 at 1:50 AM Priya Arora <priya@smartshore.nl> wrote:
>>>
>>>> Indexation screenshot is as below.
>>>>
>>>> [image: image.png]
>>>>
>>>> On Tue, Oct 29, 2019 at 7:57 PM Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> I need both ingestion and deletion.
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <priya@smartshore.nl>
>>>>> wrote:
>>>>>
>>>>>> History is shown as below as it does not indicates any error.
>>>>>> [image: 12.JPG]
>>>>>>
>>>>>> Thanks
>>>>>> Priya
>>>>>>
>>>>>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> What does the history say about these documents?
>>>>>>> Karl
>>>>>>>
>>>>>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <priya@smartshore.nl>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>  it may be that (a) they weren't found, or (b) that the document
>>>>>>>> specification in the job changed and they are no longer included
in the job.
>>>>>>>>
>>>>>>>> URL's that were deleted are valid URL's(as that does not
result in
>>>>>>>> 404 or page not found error), and it is not being mentioned
in Exclusion
>>>>>>>> tab of job configuration.
>>>>>>>> And the URL's were getting indexed earlier and except for
index
>>>>>>>> name in Elasticsearch nothing is changed in Job specification
and in other
>>>>>>>> connectors.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Priya
>>>>>>>>
>>>>>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> ManifoldCF is an incremental crawler, which means that
on every
>>>>>>>>> (non-continuous) job run it sees which documents it can
find and removes
>>>>>>>>> the ones it can't.  The history for the documents being
deleted should tell
>>>>>>>>> you why they are being deleted -- it may be that (a)
they weren't found, or
>>>>>>>>> (b) that the document specification in the job changed
and they are no
>>>>>>>>> longer included in the job.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora <priya@smartshore.nl>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> I have a query regarding ManifoldCF Job process.I
have a job to
>>>>>>>>>> crawl intranet site
>>>>>>>>>> Repository Type:- Web
>>>>>>>>>> Output Connector Type:- Elastic search.
>>>>>>>>>>
>>>>>>>>>> Job have to crawl around4-5 lakhs of total records.
I have
>>>>>>>>>> discarded the previous index and created a new index(in
Elasticsearch) with
>>>>>>>>>> proper mappings and settings and started the job
again after
>>>>>>>>>> cleaning Database even(Database used a PostgreSQL).
>>>>>>>>>> But while the job continues its ingests the records
properly but
>>>>>>>>>> just before finishing (some times in between also),
it initiates the
>>>>>>>>>> process of Deletions and also it does not index the
deleted documents again
>>>>>>>>>> in index.
>>>>>>>>>>
>>>>>>>>>> Can you please something if I am doing anything wrong?
or is this
>>>>>>>>>> a process of manifoldcf if yes , why its not getting
ingested again.
>>>>>>>>>>
>>>>>>>>>> Thanks and regards
>>>>>>>>>> Priya
>>>>>>>>>>
>>>>>>>>>>

Mime
View raw message