crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <dpo5...@gmail.com>
Subject Re: AvroParquetPathPerKeyTarget with Spark
Date Mon, 23 Sep 2019 17:40:07 GMT
Josh,

     To circle back to this after forever, I finally was able to get permission to hand this
off.  I attached it to CRUNCH-670.

Thanks,
     Dave

> On May 25, 2018, at 7:21 PM, David Ortiz <dpo5003@gmail.com> wrote:
> 
> The answer seems to be no from my employer on uploading a patch. Double checking on that
though. 
> 
> On Fri, May 25, 2018, 4:40 PM Josh Wills <josh.wills@gmail.com <mailto:josh.wills@gmail.com>>
wrote:
> CRUNCH-670 is the issue, FWIW
> 
> On Fri, May 25, 2018 at 1:39 PM, Josh Wills <josh.wills@gmail.com <mailto:josh.wills@gmail.com>>
wrote:
> Ah, of course-- nice detective work! Can you send me the code so I can patch it in?
> 
> On Fri, May 25, 2018 at 1:33 PM, David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
> Josh,
> 
>      When I dug into the code a little more, I saw that both AvroPathPerKeyOutputFormat
and AvroParquetPathPerKeyOutputFormat use "part" as a default when creating the basePath when
there is not a value for "mapreduce.output.basename".  My guess is that when running via a
SparkPipeline that value is not set.  I changed my local copy to use out0 as the defaultValue
instead of part, and the job was able to write output successfully.
> 
> Thanks,
>     Dave
> 
> On Fri, May 25, 2018 at 2:20 PM David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
> Josh,
> 
>      After cleaning up the logs a little bit I noticed this.
> 
> 18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from /tmp/crunch-1037479188/p12/out0
> 18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from /tmp/crunch-1037479188/p13/out0
> 
> When I look in those tmp directories while the job runs, they are actually writing out
to the subdirectory part rather than out0, so that would be another reason why it's having
issues.  Any thoughts on where that output path is coming from?  If you point me in the right
direction I can try to figure it out.
> 
> Thanks,
>      Dave
> 
> On Fri, May 25, 2018 at 2:01 PM David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
> Hey Josh,
> 
>      I am still messing around with it a little bit, but I still seem to be getting the
same behavior even after rebuilding with the patch.
> 
> Thanks,
>      Dave
> 
> On Thu, May 24, 2018 at 1:50 AM Josh Wills <josh.wills@gmail.com <mailto:josh.wills@gmail.com>>
wrote:
> David,
> 
> Take a look at CRUNCH-670; I think that patch fixes the problem in the most minimal way
I can think of.
> 
> https://issues.apache.org/jira/browse/CRUNCH-670 <https://issues.apache.org/jira/browse/CRUNCH-670>
> 
> J
> 
> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <josh.wills@gmail.com <mailto:josh.wills@gmail.com>>
wrote:
> I think that must be it Dave, but I can't for the life of me figure out where in the
code that's happening. Will take another look tonight.
> 
> J
> 
> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
> Josh,
> 
>      Is there any chance that somehow the output path to AvroParquetPathPerKey is getting
twisted up when it goes through the compilation step?  Watching it while it runs, the output
in the /tmp/crunch/p<stage> directory basically looks like what I would expect it to
do in the output directory.  It seems that AvroPathPerKeyTarget also was showing similar behavior
when I was messing around to see if that would work.
> 
> Thanks,
>      Dave
> 
> On Thu, May 17, 2018 at 6:03 PM Josh Wills <josh.wills@gmail.com <mailto:josh.wills@gmail.com>>
wrote:
> Hrm, got it-- now at least I know where to look (although surprised that overriding the
finalize() didn't fix it, as I ran into similar problems with my own cluster and created a
SlackPipeline class that overrides that method.)
> 
> 
> J
> 
> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
> Josh,
> 
>      Those adjustments did not appear to do anything to stop the tmp directory from being
removed at the end of the job execution (override finalize with an empty block when creating
SparkPipeline and run using pipeline.run() instead of done()).  However, I can confirm that
I see the stage output for the two output directories complete with parquet files partitioned
by key.  However, neither they, nor anything else ever make it to the output directory, which
is not even created.
> 
> Thanks,
>      Dave
> 
> On Fri, May 11, 2018 at 8:24 AM David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
> Hey Josh,
> 
>      Thanks for taking a look.  I can definitely play with that on Monday when I'm back
at work.
> 
> Thanks,
>      Dave
> 
> On Fri, May 11, 2018 at 1:46 AM Josh Wills <josh.wills@gmail.com <mailto:josh.wills@gmail.com>>
wrote:
> Hey David,
> 
> Looking at the code, the problem isn't obvious to me, but there are only two places things
could be going wrong: writing the data out of Spark into the temp directory where intermediate
outputs get stored (i.e., Spark isn't writing the data out for some reason) or moving the
data from the temp directory to the final location. The temp data is usually deleted at the
end of a Crunch run, but you can disable this by a) not calling Pipeline.cleanup or Pipeline.done
at the end of the run and b) subclassing SparkPipeline with dummy code that overrides the
finalize() method (which is implemented in the top-level DistributedPipeline abstract base
class) to be a no-op. Is that easy to try out to see if we can isolate the source of the error?
Otherwise I can play with this a bit tomorrow on my own cluster.
> 
> J
> 
> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
> Awesome.  Thanks for taking a look!
> 
> On Thu, May 10, 2018 at 5:18 PM Josh Wills <josh.wills@gmail.com <mailto:josh.wills@gmail.com>>
wrote:
> hrm, that sounds like something is wrong with the commit operation on the Spark side;
let me take a look at it this evening!
> 
> J
> 
> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
> Hello,
> 
>      Are there any known issues with the AvroParquetPathPerKeyTarget when running a Spark
pipeline?  When I run my pipeline with mapreduce, I get output, and when I run with spark,
the step before where I list my partition keys out (because we use them to add partitions
to hive) lists data being present, but the output directory remains empty.  This behavior
is occurring targeting both HDFS and S3 directly.
> 
> Thanks,
>      Dave
> 
> 
> 
> 
> 
> 
> 


Mime
View raw message