crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mungre,Surbhi" <>
Subject Same processing in two m/r jobs
Date Tue, 12 Nov 2013 04:30:56 GMT
We have a crunch pipeline which is used to normalize and standardize some entities represented
as Avro. In our pipeline, we also capture some context information about the errors and warnings
which we encounter during our processing. We pass a pair of context information and Avro entities
in our pipeline. At the end of the pipeline, the context information is written to HDFS and
Avro entities are written to HFiles.

When we were trying to analyze DAG for our crunch pipeline we noticed that same processing
is done in two m/r jobs. Once it is done to capture context information and second time it
is done to generate HFiles. I wrote a test which replicates this issue with a simple example.
The test and a DAG created from this test are attached with the post. It is clear from the
DAG that S2 and S3 are processed twice. I am not sure why this processing is done twice and
if there is any way to avoid this behavior.

Surbhi Mungre
Software Engineer

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

View raw message