manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Manifoldcf - Job Deletion Process
Date Tue, 29 Oct 2019 14:27:11 GMT
I need both ingestion and deletion.
Karl


On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <priya@smartshore.nl> wrote:

> History is shown as below as it does not indicates any error.
> [image: 12.JPG]
>
> Thanks
> Priya
>
> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddywri@gmail.com> wrote:
>
>> What does the history say about these documents?
>> Karl
>>
>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <priya@smartshore.nl> wrote:
>>
>>>
>>>  it may be that (a) they weren't found, or (b) that the document
>>> specification in the job changed and they are no longer included in the job.
>>>
>>> URL's that were deleted are valid URL's(as that does not result in 404
>>> or page not found error), and it is not being mentioned in Exclusion tab of
>>> job configuration.
>>> And the URL's were getting indexed earlier and except for index name in
>>> Elasticsearch nothing is changed in Job specification and in other
>>> connectors.
>>>
>>> Thanks
>>> Priya
>>>
>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> ManifoldCF is an incremental crawler, which means that on every
>>>> (non-continuous) job run it sees which documents it can find and removes
>>>> the ones it can't.  The history for the documents being deleted should tell
>>>> you why they are being deleted -- it may be that (a) they weren't found,
or
>>>> (b) that the document specification in the job changed and they are no
>>>> longer included in the job.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora <priya@smartshore.nl>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I have a query regarding ManifoldCF Job process.I have a job to crawl
>>>>> intranet site
>>>>> Repository Type:- Web
>>>>> Output Connector Type:- Elastic search.
>>>>>
>>>>> Job have to crawl around4-5 lakhs of total records. I have discarded
>>>>> the previous index and created a new index(in Elasticsearch) with proper
>>>>> mappings and settings and started the job again after cleaning Database
>>>>> even(Database used a PostgreSQL).
>>>>> But while the job continues its ingests the records properly but just
>>>>> before finishing (some times in between also), it initiates the process
of
>>>>> Deletions and also it does not index the deleted documents again in index.
>>>>>
>>>>> Can you please something if I am doing anything wrong? or is this a
>>>>> process of manifoldcf if yes , why its not getting ingested again.
>>>>>
>>>>> Thanks and regards
>>>>> Priya
>>>>>
>>>>>

Mime
View raw message