manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priya Arora <pr...@smartshore.nl>
Subject Re: Manifoldcf - Job Deletion Process
Date Wed, 30 Oct 2019 07:13:05 GMT
You want to test the Job whole process start, end and all events(from
History) by seeding on of these URL's. Below are the results:-
I changed seed URL to the picked one Identifier and then that document was
fetch and indexed in a new index and the Deletion process started.
[image: Start.JPG]
[image: Indexation and Deletion.JPG]



On Wed, Oct 30, 2019 at 12:25 PM Karl Wright <daddywri@gmail.com> wrote:

> Ok, so pick ONE of these identifiers.
>
> What I want to see is the entire lifecycle of the ONE identifier.  That
> includes what the Web Connection logs as well as what the indexation logs.
> Ideally I'd like to see:
>
> - job start and end
> - web connection events
> - indexing events
>
> I'd like to see these for both the job that indexes the document initially
> as well as the job run that deletes the document.
>
> My suspicion is that on the second run the document is simply no longer
> reachable from the seeds.  In other words, the seed documents either cannot
> be fetched on the second run or they contain different stuff and there's no
> longer a chain of links between the seeds and the documents being deleted.
>
> Thanks,
> Karl
>
>
> On Wed, Oct 30, 2019 at 1:50 AM Priya Arora <priya@smartshore.nl> wrote:
>
>> Indexation screenshot is as below.
>>
>> [image: image.png]
>>
>> On Tue, Oct 29, 2019 at 7:57 PM Karl Wright <daddywri@gmail.com> wrote:
>>
>>> I need both ingestion and deletion.
>>> Karl
>>>
>>>
>>> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora <priya@smartshore.nl> wrote:
>>>
>>>> History is shown as below as it does not indicates any error.
>>>> [image: 12.JPG]
>>>>
>>>> Thanks
>>>> Priya
>>>>
>>>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> What does the history say about these documents?
>>>>> Karl
>>>>>
>>>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora <priya@smartshore.nl>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>  it may be that (a) they weren't found, or (b) that the document
>>>>>> specification in the job changed and they are no longer included
in the job.
>>>>>>
>>>>>> URL's that were deleted are valid URL's(as that does not result in
>>>>>> 404 or page not found error), and it is not being mentioned in Exclusion
>>>>>> tab of job configuration.
>>>>>> And the URL's were getting indexed earlier and except for index name
>>>>>> in Elasticsearch nothing is changed in Job specification and in other
>>>>>> connectors.
>>>>>>
>>>>>> Thanks
>>>>>> Priya
>>>>>>
>>>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> ManifoldCF is an incremental crawler, which means that on every
>>>>>>> (non-continuous) job run it sees which documents it can find
and removes
>>>>>>> the ones it can't.  The history for the documents being deleted
should tell
>>>>>>> you why they are being deleted -- it may be that (a) they weren't
found, or
>>>>>>> (b) that the document specification in the job changed and they
are no
>>>>>>> longer included in the job.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora <priya@smartshore.nl>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I have a query regarding ManifoldCF Job process.I have a
job to
>>>>>>>> crawl intranet site
>>>>>>>> Repository Type:- Web
>>>>>>>> Output Connector Type:- Elastic search.
>>>>>>>>
>>>>>>>> Job have to crawl around4-5 lakhs of total records. I have
>>>>>>>> discarded the previous index and created a new index(in Elasticsearch)
with
>>>>>>>> proper mappings and settings and started the job again after
>>>>>>>> cleaning Database even(Database used a PostgreSQL).
>>>>>>>> But while the job continues its ingests the records properly
but
>>>>>>>> just before finishing (some times in between also), it initiates
the
>>>>>>>> process of Deletions and also it does not index the deleted
documents again
>>>>>>>> in index.
>>>>>>>>
>>>>>>>> Can you please something if I am doing anything wrong? or
is this a
>>>>>>>> process of manifoldcf if yes , why its not getting ingested
again.
>>>>>>>>
>>>>>>>> Thanks and regards
>>>>>>>> Priya
>>>>>>>>
>>>>>>>>

Mime
View raw message