manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Manifoldcf - Job Deletion Process
Date Wed, 30 Oct 2019 06:55:11 GMT
Ok, so pick ONE of these identifiers.

What I want to see is the entire lifecycle of the ONE identifier.  That
includes what the Web Connection logs as well as what the indexation logs.
Ideally I'd like to see:

- job start and end
- web connection events
- indexing events

I'd like to see these for both the job that indexes the document initially
as well as the job run that deletes the document.

My suspicion is that on the second run the document is simply no longer
reachable from the seeds.  In other words, the seed documents either cannot
be fetched on the second run or they contain different stuff and there's no
longer a chain of links between the seeds and the documents being deleted.

Thanks,
Karl


On Wed, Oct 30, 2019 at 1:50 AM Priya Arora <priya@smartshore.nl> wrote:

> Indexation screenshot is as below.
>
> [image: image.png]
>
> On Tue, Oct 29, 2019 at 7:57 PM Karl Wright <daddywri@gmail.com> wrote:
>
>> I need both ingestion and deletion.
>> Karl
>>
>>
>> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <priya@smartshore.nl> wrote:
>>
>>> History is shown as below as it does not indicates any error.
>>> [image: 12.JPG]
>>>
>>> Thanks
>>> Priya
>>>
>>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> What does the history say about these documents?
>>>> Karl
>>>>
>>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <priya@smartshore.nl>
>>>> wrote:
>>>>
>>>>>
>>>>>  it may be that (a) they weren't found, or (b) that the document
>>>>> specification in the job changed and they are no longer included in the
job.
>>>>>
>>>>> URL's that were deleted are valid URL's(as that does not result in
>>>>> 404 or page not found error), and it is not being mentioned in Exclusion
>>>>> tab of job configuration.
>>>>> And the URL's were getting indexed earlier and except for index name
>>>>> in Elasticsearch nothing is changed in Job specification and in other
>>>>> connectors.
>>>>>
>>>>> Thanks
>>>>> Priya
>>>>>
>>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> ManifoldCF is an incremental crawler, which means that on every
>>>>>> (non-continuous) job run it sees which documents it can find and
removes
>>>>>> the ones it can't.  The history for the documents being deleted should
tell
>>>>>> you why they are being deleted -- it may be that (a) they weren't
found, or
>>>>>> (b) that the document specification in the job changed and they are
no
>>>>>> longer included in the job.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora <priya@smartshore.nl>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I have a query regarding ManifoldCF Job process.I have a job
to
>>>>>>> crawl intranet site
>>>>>>> Repository Type:- Web
>>>>>>> Output Connector Type:- Elastic search.
>>>>>>>
>>>>>>> Job have to crawl around4-5 lakhs of total records. I have discarded
>>>>>>> the previous index and created a new index(in Elasticsearch)
with proper
>>>>>>> mappings and settings and started the job again after cleaning
Database
>>>>>>> even(Database used a PostgreSQL).
>>>>>>> But while the job continues its ingests the records properly
but
>>>>>>> just before finishing (some times in between also), it initiates
the
>>>>>>> process of Deletions and also it does not index the deleted documents
again
>>>>>>> in index.
>>>>>>>
>>>>>>> Can you please something if I am doing anything wrong? or is
this a
>>>>>>> process of manifoldcf if yes , why its not getting ingested again.
>>>>>>>
>>>>>>> Thanks and regards
>>>>>>> Priya
>>>>>>>
>>>>>>>

Mime
View raw message