manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Manifoldcf - Job Deletion Process
Date Wed, 30 Oct 2019 10:09:11 GMT
So it looks like the URL ending in 117047 was successfully processed and
indexed, and not removed.  The URLs ending in 119200 and lang-en were both
unreachable and were removed.  I don't see a job end at all?  There's a new
job start at 12:39 though.

What I want to see is the lifetime of one of the documents that you think
is getting removed for no reason.

Karl


On Wed, Oct 30, 2019 at 3:13 AM Priya Arora <priya@smartshore.nl> wrote:

> You want to test the Job whole process start, end and all events(from
> History) by seeding on of these URL's. Below are the results:-
> I changed seed URL to the picked one Identifier and then that document was
> fetch and indexed in a new index and the Deletion process started.
> [image: Start.JPG]
> [image: Indexation and Deletion.JPG]
>
>
>
> On Wed, Oct 30, 2019 at 12:25 PM Karl Wright <daddywri@gmail.com> wrote:
>
>> Ok, so pick ONE of these identifiers.
>>
>> What I want to see is the entire lifecycle of the ONE identifier.  That
>> includes what the Web Connection logs as well as what the indexation logs.
>> Ideally I'd like to see:
>>
>> - job start and end
>> - web connection events
>> - indexing events
>>
>> I'd like to see these for both the job that indexes the document
>> initially as well as the job run that deletes the document.
>>
>> My suspicion is that on the second run the document is simply no longer
>> reachable from the seeds.  In other words, the seed documents either cannot
>> be fetched on the second run or they contain different stuff and there's no
>> longer a chain of links between the seeds and the documents being deleted.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Oct 30, 2019 at 1:50 AM Priya Arora <priya@smartshore.nl> wrote:
>>
>>> Indexation screenshot is as below.
>>>
>>> [image: image.png]
>>>
>>> On Tue, Oct 29, 2019 at 7:57 PM Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> I need both ingestion and deletion.
>>>> Karl
>>>>
>>>>
>>>> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <priya@smartshore.nl>
>>>> wrote:
>>>>
>>>>> History is shown as below as it does not indicates any error.
>>>>> [image: 12.JPG]
>>>>>
>>>>> Thanks
>>>>> Priya
>>>>>
>>>>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What does the history say about these documents?
>>>>>> Karl
>>>>>>
>>>>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <priya@smartshore.nl>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>  it may be that (a) they weren't found, or (b) that the document
>>>>>>> specification in the job changed and they are no longer included
in the job.
>>>>>>>
>>>>>>> URL's that were deleted are valid URL's(as that does not result
in
>>>>>>> 404 or page not found error), and it is not being mentioned in
Exclusion
>>>>>>> tab of job configuration.
>>>>>>> And the URL's were getting indexed earlier and except for index
name
>>>>>>> in Elasticsearch nothing is changed in Job specification and
in other
>>>>>>> connectors.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Priya
>>>>>>>
>>>>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> ManifoldCF is an incremental crawler, which means that on
every
>>>>>>>> (non-continuous) job run it sees which documents it can find
and removes
>>>>>>>> the ones it can't.  The history for the documents being deleted
should tell
>>>>>>>> you why they are being deleted -- it may be that (a) they
weren't found, or
>>>>>>>> (b) that the document specification in the job changed and
they are no
>>>>>>>> longer included in the job.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora <priya@smartshore.nl>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I have a query regarding ManifoldCF Job process.I have
a job to
>>>>>>>>> crawl intranet site
>>>>>>>>> Repository Type:- Web
>>>>>>>>> Output Connector Type:- Elastic search.
>>>>>>>>>
>>>>>>>>> Job have to crawl around4-5 lakhs of total records. I
have
>>>>>>>>> discarded the previous index and created a new index(in
Elasticsearch) with
>>>>>>>>> proper mappings and settings and started the job again
after
>>>>>>>>> cleaning Database even(Database used a PostgreSQL).
>>>>>>>>> But while the job continues its ingests the records properly
but
>>>>>>>>> just before finishing (some times in between also), it
initiates the
>>>>>>>>> process of Deletions and also it does not index the deleted
documents again
>>>>>>>>> in index.
>>>>>>>>>
>>>>>>>>> Can you please something if I am doing anything wrong?
or is this
>>>>>>>>> a process of manifoldcf if yes , why its not getting
ingested again.
>>>>>>>>>
>>>>>>>>> Thanks and regards
>>>>>>>>> Priya
>>>>>>>>>
>>>>>>>>>

Mime
View raw message