manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: Job Multiple Outputs
Date Tue, 10 Sep 2019 18:08:11 GMT
Hi Julien,
You must understand that a job with a complex pipeline is really not
running N independent jobs; it's running ONE job.  Every document is
processed through the pipeline only once.  The pipeline may have faster
components and slower components; doesn't matter; the document takes the
sum total of the time all components need to fetch and process the document.


On Tue, Sep 10, 2019 at 12:48 PM Julien Massiera <> wrote:

> Ok, so to be sure I understood what you are saying:
> suppose a job with two output connections and one of the outputs is twice
> time faster than the other one to index documents. At a given time t, both
> of the outputs will have indexed the same amount of documents, no matter if
> one output is faster than the other one.
> In other words : The fastest output will not have indexed all the crawled
> documents meanwhile the second one will still have half of them to index.
> Am I wrong ?
> On 10/09/2019 18:09, Karl Wright wrote:
> The output connection contract is that a request to index is made to the
> connector, and the connector returns when it is done.
> When there are multiple output connections, these are each handed a copy
> of the document, one after the other, and told to index it.  This is all
> done by one worker thread.  Multiple worker threads are not used for
> multiple outputs of the same document.
> The framework is smart enough to not hand a document to a connector if it
> hasn't changed (according to how the connector computes the
> connector-specific output version string).
> Karl
> On Tue, Sep 10, 2019 at 11:00 AM Julien Massiera <
>> wrote:
>> Hi,
>> I would like to have an explanation about the behavior of a job when
>> several outputs are configured. My main question is : for each output,
>> how is the docs ingestion managed ? More precisely, are the ingest
>> processes synchronized or not ? (in other words, is the ingestion of the
>> next document waiting for the current ingestion to be completed for both
>> outputs ?). But also, if one output is configured to send a commit at
>> the end of the job, is this commit pending until the last ingestion has
>> occured in the other output ?
>> Thanks for your help,
>> Julien
> --
> Directeur développement produit
> France Labs – Les experts du Search
> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers

View raw message