manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Job Multiple Outputs
Date Tue, 10 Sep 2019 18:08:11 GMT
Hi Julien,
You must understand that a job with a complex pipeline is really not
running N independent jobs; it's running ONE job.  Every document is
processed through the pipeline only once.  The pipeline may have faster
components and slower components; doesn't matter; the document takes the
sum total of the time all components need to fetch and process the document.

Karl


On Tue, Sep 10, 2019 at 12:48 PM Julien Massiera <
julien.massiera@francelabs.com> wrote:

> Ok, so to be sure I understood what you are saying:
>
> suppose a job with two output connections and one of the outputs is twice
> time faster than the other one to index documents. At a given time t, both
> of the outputs will have indexed the same amount of documents, no matter if
> one output is faster than the other one.
> In other words : The fastest output will not have indexed all the crawled
> documents meanwhile the second one will still have half of them to index.
>
> Am I wrong ?
> On 10/09/2019 18:09, Karl Wright wrote:
>
> The output connection contract is that a request to index is made to the
> connector, and the connector returns when it is done.
> When there are multiple output connections, these are each handed a copy
> of the document, one after the other, and told to index it.  This is all
> done by one worker thread.  Multiple worker threads are not used for
> multiple outputs of the same document.
>
> The framework is smart enough to not hand a document to a connector if it
> hasn't changed (according to how the connector computes the
> connector-specific output version string).
>
> Karl
>
>
> On Tue, Sep 10, 2019 at 11:00 AM Julien Massiera <
> julien.massiera@francelabs.com> wrote:
>
>> Hi,
>>
>> I would like to have an explanation about the behavior of a job when
>> several outputs are configured. My main question is : for each output,
>> how is the docs ingestion managed ? More precisely, are the ingest
>> processes synchronized or not ? (in other words, is the ingestion of the
>> next document waiting for the current ingestion to be completed for both
>> outputs ?). But also, if one output is configured to send a commit at
>> the end of the job, is this commit pending until the last ingestion has
>> occured in the other output ?
>>
>> Thanks for your help,
>> Julien
>>
> --
> Julien MASSIERA
> Directeur développement produit
> France Labs – Les experts du Search
> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers Summitwww.francelabs.com
>
>

Mime
View raw message