issue: crunch-624.

link: https://issues.apache.org/jira/browse/CRUNCH-624?jql=project%20%3D%20CRUNCH

2016-10-18 13:54 GMT+08:00 Josh Wills <josh.wills@gmail.com>:
Yep, that's right-- can you file a JIRA, and I'll post the patch?

On Mon, Oct 17, 2016 at 10:52 PM, 陈竞 <cj.magina@gmail.com> wrote:
i may found the root cause in my case:
public void materializeAt(SourceTarget<S> sourceTarget) {
this.materializedAt = sourceTarget;
this.size = materializedAt.getSize(getPipeline().getConfiguration());
}

@Override
public long getSize() {
if (size < 0) {
this.size = getSizeInternal();
}
return size;
}
PColletionImpl.materializeAt(sourceTarget) this method will be invoked when node splits to create temporary table, source sourceTarget binds with the new temporary table whose size is 0, since its path was just created, the this.size will be 0. After that, when getSize() was invoked by setting reduce number, since the size is 0, it will just return 0, which makes reduce number too small.
So i think the code of materializeAt() should check sourceTarget's size, like below:
public void materializeAt(SourceTarget<S> sourceTarget) {
this.materializedAt = sourceTarget;
long size = materializedAt.getSize(getPipeline().getConfiguration());
  if (size > 0)
      this.size = size;
}


2016-10-17 11:19 GMT+08:00 David Ortiz <dpo5003@gmail.com>:

That gets tricky if you have input data that is heavily filtered though.  Perhaps play around with the scale factor on operations that may blow up data?


On Sun, Oct 16, 2016, 10:04 PM 陈竞 <cj.magina@gmail.com> wrote:
that's  a solution, but, since user may not clearly know whic step will produce tempoary table, i think setting reduce number  automatically will improve user experience. I think maybe we can set reduce number as 1/3 mapper number before submitting jobs if one of the job inputs is temporary table.

2016-10-14 18:59 GMT+08:00 David Ortiz <dpo5003@gmail.com>:

You can manually set the reducer number using the conf object among other things.


On Fri, Oct 14, 2016, 5:43 AM 陈竞 <cj.magina@gmail.com> wrote:
hi, i found that if the pipeline produce temporary table , the reduce number of the temporary table whose input table is temporary table  become to small, since temporary table has no content .



--
陈竞,中科院计算技术研究所,高性能计算机中心
Jing Chen HPCC.ICT.AC China



--
陈竞,中科院计算技术研究所,高性能计算机中心
Jing Chen HPCC.ICT.AC China




--
陈竞,中科院计算技术研究所,高性能计算机中心
Jing Chen HPCC.ICT.AC China