manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Manifoldcf - Job Deletion Process
Date Tue, 29 Oct 2019 11:31:48 GMT
What does the history say about these documents?
Karl

On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <priya@smartshore.nl> wrote:

>
>  it may be that (a) they weren't found, or (b) that the document
> specification in the job changed and they are no longer included in the job.
>
> URL's that were deleted are valid URL's(as that does not result in 404 or
> page not found error), and it is not being mentioned in Exclusion tab of
> job configuration.
> And the URL's were getting indexed earlier and except for index name in
> Elasticsearch nothing is changed in Job specification and in other
> connectors.
>
> Thanks
> Priya
>
> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright <daddywri@gmail.com> wrote:
>
>> ManifoldCF is an incremental crawler, which means that on every
>> (non-continuous) job run it sees which documents it can find and removes
>> the ones it can't.  The history for the documents being deleted should tell
>> you why they are being deleted -- it may be that (a) they weren't found, or
>> (b) that the document specification in the job changed and they are no
>> longer included in the job.
>>
>> Karl
>>
>>
>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora <priya@smartshore.nl> wrote:
>>
>>> Hi All,
>>>
>>> I have a query regarding ManifoldCF Job process.I have a job to crawl
>>> intranet site
>>> Repository Type:- Web
>>> Output Connector Type:- Elastic search.
>>>
>>> Job have to crawl around4-5 lakhs of total records. I have discarded the
>>> previous index and created a new index(in Elasticsearch) with proper
>>> mappings and settings and started the job again after cleaning Database
>>> even(Database used a PostgreSQL).
>>> But while the job continues its ingests the records properly but just
>>> before finishing (some times in between also), it initiates the process of
>>> Deletions and also it does not index the deleted documents again in index.
>>>
>>> Can you please something if I am doing anything wrong? or is this a
>>> process of manifoldcf if yes , why its not getting ingested again.
>>>
>>> Thanks and regards
>>> Priya
>>>
>>>

Mime
View raw message