manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priya Arora <pr...@smartshore.nl>
Subject Re: Manifoldcf - Job Deletion Process
Date Wed, 30 Oct 2019 05:49:41 GMT
Indexation screenshot is as below.

[image: image.png]

On Tue, Oct 29, 2019 at 7:57 PM Karl Wright <daddywri@gmail.com> wrote:

> I need both ingestion and deletion.
> Karl
>
>
> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <priya@smartshore.nl> wrote:
>
>> History is shown as below as it does not indicates any error.
>> [image: 12.JPG]
>>
>> Thanks
>> Priya
>>
>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddywri@gmail.com> wrote:
>>
>>> What does the history say about these documents?
>>> Karl
>>>
>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <priya@smartshore.nl> wrote:
>>>
>>>>
>>>>  it may be that (a) they weren't found, or (b) that the document
>>>> specification in the job changed and they are no longer included in the job.
>>>>
>>>> URL's that were deleted are valid URL's(as that does not result in 404
>>>> or page not found error), and it is not being mentioned in Exclusion tab
of
>>>> job configuration.
>>>> And the URL's were getting indexed earlier and except for index name in
>>>> Elasticsearch nothing is changed in Job specification and in other
>>>> connectors.
>>>>
>>>> Thanks
>>>> Priya
>>>>
>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> ManifoldCF is an incremental crawler, which means that on every
>>>>> (non-continuous) job run it sees which documents it can find and removes
>>>>> the ones it can't.  The history for the documents being deleted should
tell
>>>>> you why they are being deleted -- it may be that (a) they weren't found,
or
>>>>> (b) that the document specification in the job changed and they are no
>>>>> longer included in the job.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora <priya@smartshore.nl>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have a query regarding ManifoldCF Job process.I have a job to crawl
>>>>>> intranet site
>>>>>> Repository Type:- Web
>>>>>> Output Connector Type:- Elastic search.
>>>>>>
>>>>>> Job have to crawl around4-5 lakhs of total records. I have discarded
>>>>>> the previous index and created a new index(in Elasticsearch) with
proper
>>>>>> mappings and settings and started the job again after cleaning Database
>>>>>> even(Database used a PostgreSQL).
>>>>>> But while the job continues its ingests the records properly but
just
>>>>>> before finishing (some times in between also), it initiates the process
of
>>>>>> Deletions and also it does not index the deleted documents again
in index.
>>>>>>
>>>>>> Can you please something if I am doing anything wrong? or is this
a
>>>>>> process of manifoldcf if yes , why its not getting ingested again.
>>>>>>
>>>>>> Thanks and regards
>>>>>> Priya
>>>>>>
>>>>>>

Mime
View raw message