manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priya Arora <pr...@smartshore.nl>
Subject Re: Manifoldcf - Job Deletion Process
Date Fri, 01 Nov 2019 07:59:17 GMT
[image: del2.JPG]

Screenshot the Deleted documents other than PDF's

On Fri, Nov 1, 2019 at 1:28 PM Priya Arora <priya@smartshore.nl> wrote:

> The jib was started as per below schedule:-
> [image: job.JPG]
>
> And just before the completion of the job. It started the Deletion
> process. Before starting the job a new index in ES was taken and the
> Database was cleaned up before starting the jib.
> [image: deletion.JPG]
>
> Records were processed and indexed successfully. When I am checking this
> URL(those are Deleted) on a browser, it seems to be a valid URl and is
> accessible.
> Job is to crawl around 2.25 lakhs of records so the seeded url have many
> sub-links within. If we think the URL;s were already present in Database
> that why somehow crawler deletes it, it should not be the case, as the
> database clean up processed has been done before run.
>
> If we think the crawler is deleting only documents related to PDF
> extension, this is not the case, as other HTML pages are also deleted.
>
> Can you please suggest something on this.
>
> Thanks
> Priya
>
> On Wed, Oct 30, 2019 at 3:39 PM Karl Wright <daddywri@gmail.com> wrote:
>
>> So it looks like the URL ending in 117047 was successfully processed and
>> indexed, and not removed.  The URLs ending in 119200 and lang-en were both
>> unreachable and were removed.  I don't see a job end at all?  There's a new
>> job start at 12:39 though.
>>
>> What I want to see is the lifetime of one of the documents that you think
>> is getting removed for no reason.
>>
>> Karl
>>
>>
>> On Wed, Oct 30, 2019 at 3:13 AM Priya Arora <priya@smartshore.nl> wrote:
>>
>>> You want to test the Job whole process start, end and all events(from
>>> History) by seeding on of these URL's. Below are the results:-
>>> I changed seed URL to the picked one Identifier and then that document
>>> was fetch and indexed in a new index and the Deletion process started.
>>> [image: Start.JPG]
>>> [image: Indexation and Deletion.JPG]
>>>
>>>
>>>
>>> On Wed, Oct 30, 2019 at 12:25 PM Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Ok, so pick ONE of these identifiers.
>>>>
>>>> What I want to see is the entire lifecycle of the ONE identifier.  That
>>>> includes what the Web Connection logs as well as what the indexation logs.
>>>> Ideally I'd like to see:
>>>>
>>>> - job start and end
>>>> - web connection events
>>>> - indexing events
>>>>
>>>> I'd like to see these for both the job that indexes the document
>>>> initially as well as the job run that deletes the document.
>>>>
>>>> My suspicion is that on the second run the document is simply no longer
>>>> reachable from the seeds.  In other words, the seed documents either cannot
>>>> be fetched on the second run or they contain different stuff and there's
no
>>>> longer a chain of links between the seeds and the documents being deleted.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Wed, Oct 30, 2019 at 1:50 AM Priya Arora <priya@smartshore.nl>
>>>> wrote:
>>>>
>>>>> Indexation screenshot is as below.
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> On Tue, Oct 29, 2019 at 7:57 PM Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I need both ingestion and deletion.
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <priya@smartshore.nl>
>>>>>> wrote:
>>>>>>
>>>>>>> History is shown as below as it does not indicates any error.
>>>>>>> [image: 12.JPG]
>>>>>>>
>>>>>>> Thanks
>>>>>>> Priya
>>>>>>>
>>>>>>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> What does the history say about these documents?
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <priya@smartshore.nl>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  it may be that (a) they weren't found, or (b) that the
document
>>>>>>>>> specification in the job changed and they are no longer
included in the job.
>>>>>>>>>
>>>>>>>>> URL's that were deleted are valid URL's(as that does
not result
>>>>>>>>> in 404 or page not found error), and it is not being
mentioned in Exclusion
>>>>>>>>> tab of job configuration.
>>>>>>>>> And the URL's were getting indexed earlier and except
for index
>>>>>>>>> name in Elasticsearch nothing is changed in Job specification
and in other
>>>>>>>>> connectors.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Priya
>>>>>>>>>
>>>>>>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> ManifoldCF is an incremental crawler, which means
that on every
>>>>>>>>>> (non-continuous) job run it sees which documents
it can find and removes
>>>>>>>>>> the ones it can't.  The history for the documents
being deleted should tell
>>>>>>>>>> you why they are being deleted -- it may be that
(a) they weren't found, or
>>>>>>>>>> (b) that the document specification in the job changed
and they are no
>>>>>>>>>> longer included in the job.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora <priya@smartshore.nl>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> I have a query regarding ManifoldCF Job process.I
have a job to
>>>>>>>>>>> crawl intranet site
>>>>>>>>>>> Repository Type:- Web
>>>>>>>>>>> Output Connector Type:- Elastic search.
>>>>>>>>>>>
>>>>>>>>>>> Job have to crawl around4-5 lakhs of total records.
I have
>>>>>>>>>>> discarded the previous index and created a new
index(in Elasticsearch) with
>>>>>>>>>>> proper mappings and settings and started the
job again after
>>>>>>>>>>> cleaning Database even(Database used a PostgreSQL).
>>>>>>>>>>> But while the job continues its ingests the records
properly but
>>>>>>>>>>> just before finishing (some times in between
also), it initiates the
>>>>>>>>>>> process of Deletions and also it does not index
the deleted documents again
>>>>>>>>>>> in index.
>>>>>>>>>>>
>>>>>>>>>>> Can you please something if I am doing anything
wrong? or is
>>>>>>>>>>> this a process of manifoldcf if yes , why its
not getting ingested again.
>>>>>>>>>>>
>>>>>>>>>>> Thanks and regards
>>>>>>>>>>> Priya
>>>>>>>>>>>
>>>>>>>>>>>

Mime
View raw message