crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Same processing in two m/r jobs
Date Tue, 12 Nov 2013 18:04:27 GMT
I'm surprised that it's still writing out S1-- are you changing the
downstream operations (PTable.keys and GBK) to read from the union of S2
and S3? If so, I'd like to try to recreate that, it sounds like a planner

Whenever there is a dependency between two GBK operations, the planner
analyzes the operations that link those two GBKs and tries to find a "good"
place to split the pipeline into two separate MR jobs. Good is usually
based on the planner's rough estimate of how much data will be written by
each of the DoFns, which is largely determined by the value of the float
scaleFactor() function for each DoFn-- scaleFactor() > 1.0 means that the
DoFn is expected to write out more data than it reads in, scaleFactor() <
1.0 means that the function is expected to write less.

The only exception to that rule is if you are already writing a read/write
output file at some point along the chain of operations between the two
GBKs, at which point the planner will just choose that file as the split
point. Note that text outputs are write-only, Crunch does not assume that
it can read back a text file as it was written unless it is a


On Tue, Nov 12, 2013 at 9:31 AM, Mungre,Surbhi <>wrote:

>  Hey Josh,
> Thanks for the reply! I think we will be able to get around this issue
> by materializing the output of union of S2 and S3.
> However, the DAG shows that the first job is still writing output of S1 to
> the disk. Out of curiosity, how the planner decides to write output of S1
> to the disk instead of writing the output of union of S2 and S3?
>  -Surbhi
>   From: Josh Wills <>
> Reply-To: "" <>
> Date: Tuesday, November 12, 2013 10:26 AM
> To: "" <>
> Subject: Re: Same processing in two m/r jobs
>   Hey Surbhi,
>  The planner is trying to minimize the amount of data it writes to disk
> at the end of the first job; it doesn't usually worry so much about
> re-running the same computation in two different jobs if it means that less
> data will be written to disk overall, since most MR jobs aren't CPU bound.
> While that's often a useful heuristic, there are many cases where it isn't
> true, and this sounds like one of them. My advice would be to materialize
> the output of the union of S2 and S3, at which point the planner should run
> the processing of S2 and S3 once at the end of job 1, and then pick up that
> materialized output for grouping in job 2.
>  Best,
> Josh
> On Mon, Nov 11, 2013 at 8:30 PM, Mungre,Surbhi <>wrote:
>>  Background:
>> We have a crunch pipeline which is used to normalize and standardize some
>> entities represented as Avro. In our pipeline, we also capture some context
>> information about the errors and warnings which we encounter during our
>> processing. We pass a pair of context information and Avro entities in our
>> pipeline. At the end of the pipeline, the context information is written to
>> HDFS and Avro entities are written to HFiles.
>> Problem:
>> When we were trying to analyze DAG for our crunch pipeline we noticed that same processing
is done in two m/r jobs. Once it is done to capture context information and second time it
is done to generate HFiles. I wrote a test which replicates this issue with a simple example.
The test and a DAG created from this test are attached with the post. It is clear from the
DAG that S2 and S3 are processed twice. I am not sure why this processing is done twice and
if there is any way to avoid this behavior.
>> Surbhi Mungre
>> Software Engineer
>>  CONFIDENTIALITY NOTICE This message and any included attachments are
>> from Cerner Corporation and are intended only for the addressee. The
>> information contained in this message is confidential and may constitute
>> inside or non-public information under international, federal, or state
>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>> or use of such information is strictly prohibited and may be unlawful. If
>> you are not the addressee, please promptly delete this message and notify
>> the sender of the delivery error by e-mail or you may call Cerner's
>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>  --
> Director of Data Science
> Cloudera<>
> Twitter: @josh_wills<>

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message