crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: AvroParquetPathPerKeyTarget with Spark
Date Mon, 23 Sep 2019 17:44:30 GMT
Yay!! Thanks Dave!

On Mon, Sep 23, 2019 at 10:40 AM David Ortiz <dpo5003@gmail.com> wrote:

> Josh,
>
>      To circle back to this after forever, I finally was able to get
> permission to hand this off.  I attached it to CRUNCH-670.
>
> Thanks,
>      Dave
>
> On May 25, 2018, at 7:21 PM, David Ortiz <dpo5003@gmail.com> wrote:
>
> The answer seems to be no from my employer on uploading a patch. Double
> checking on that though.
>
> On Fri, May 25, 2018, 4:40 PM Josh Wills <josh.wills@gmail.com> wrote:
>
>> CRUNCH-670 is the issue, FWIW
>>
>> On Fri, May 25, 2018 at 1:39 PM, Josh Wills <josh.wills@gmail.com> wrote:
>>
>>> Ah, of course-- nice detective work! Can you send me the code so I can
>>> patch it in?
>>>
>>> On Fri, May 25, 2018 at 1:33 PM, David Ortiz <dpo5003@gmail.com> wrote:
>>>
>>>> Josh,
>>>>
>>>>      When I dug into the code a little more, I saw that both
>>>> AvroPathPerKeyOutputFormat and AvroParquetPathPerKeyOutputFormat use "part"
>>>> as a default when creating the basePath when there is not a value for
>>>> "mapreduce.output.basename".  My guess is that when running via a
>>>> SparkPipeline that value is not set.  I changed my local copy to use out0
>>>> as the defaultValue instead of part, and the job was able to write output
>>>> successfully.
>>>>
>>>> Thanks,
>>>>     Dave
>>>>
>>>> On Fri, May 25, 2018 at 2:20 PM David Ortiz <dpo5003@gmail.com> wrote:
>>>>
>>>>> Josh,
>>>>>
>>>>>      After cleaning up the logs a little bit I noticed this.
>>>>>
>>>>> 18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from
>>>>> /tmp/crunch-1037479188/p12/out0
>>>>> 18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from
>>>>> /tmp/crunch-1037479188/p13/out0
>>>>>
>>>>> When I look in those tmp directories while the job runs, they are
>>>>> actually writing out to the subdirectory part rather than out0, so that
>>>>> would be another reason why it's having issues.  Any thoughts on where
that
>>>>> output path is coming from?  If you point me in the right direction I
can
>>>>> try to figure it out.
>>>>>
>>>>> Thanks,
>>>>>      Dave
>>>>>
>>>>> On Fri, May 25, 2018 at 2:01 PM David Ortiz <dpo5003@gmail.com>
wrote:
>>>>>
>>>>>> Hey Josh,
>>>>>>
>>>>>>      I am still messing around with it a little bit, but I still
seem
>>>>>> to be getting the same behavior even after rebuilding with the patch.
>>>>>>
>>>>>> Thanks,
>>>>>>      Dave
>>>>>>
>>>>>> On Thu, May 24, 2018 at 1:50 AM Josh Wills <josh.wills@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> David,
>>>>>>>
>>>>>>> Take a look at CRUNCH-670; I think that patch fixes the problem
in
>>>>>>> the most minimal way I can think of.
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-670
>>>>>>>
>>>>>>> J
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <josh.wills@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think that must be it Dave, but I can't for the life of
me figure
>>>>>>>> out where in the code that's happening. Will take another
look tonight.
>>>>>>>>
>>>>>>>> J
>>>>>>>>
>>>>>>>> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <dpo5003@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Josh,
>>>>>>>>>
>>>>>>>>>      Is there any chance that somehow the output path
to
>>>>>>>>> AvroParquetPathPerKey is getting twisted up when it goes
through the
>>>>>>>>> compilation step?  Watching it while it runs, the output
in the
>>>>>>>>> /tmp/crunch/p<stage> directory basically looks
like what I would expect it
>>>>>>>>> to do in the output directory.  It seems that AvroPathPerKeyTarget
also was
>>>>>>>>> showing similar behavior when I was messing around to
see if that would
>>>>>>>>> work.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>      Dave
>>>>>>>>>
>>>>>>>>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <josh.wills@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hrm, got it-- now at least I know where to look (although
>>>>>>>>>> surprised that overriding the finalize() didn't fix
it, as I ran into
>>>>>>>>>> similar problems with my own cluster and created
a SlackPipeline class that
>>>>>>>>>> overrides that method.)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> J
>>>>>>>>>>
>>>>>>>>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <dpo5003@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Josh,
>>>>>>>>>>>
>>>>>>>>>>>      Those adjustments did not appear to do anything
to stop the
>>>>>>>>>>> tmp directory from being removed at the end of
the job execution (override
>>>>>>>>>>> finalize with an empty block when creating SparkPipeline
and run using
>>>>>>>>>>> pipeline.run() instead of done()).  However,
I can confirm that I see the
>>>>>>>>>>> stage output for the two output directories complete
with parquet files
>>>>>>>>>>> partitioned by key.  However, neither they, nor
anything else ever make it
>>>>>>>>>>> to the output directory, which is not even created.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>      Dave
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <dpo5003@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Josh,
>>>>>>>>>>>>
>>>>>>>>>>>>      Thanks for taking a look.  I can definitely
play with that
>>>>>>>>>>>> on Monday when I'm back at work.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>      Dave
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills
<
>>>>>>>>>>>> josh.wills@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey David,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looking at the code, the problem isn't
obvious to me, but
>>>>>>>>>>>>> there are only two places things could
be going wrong: writing the data out
>>>>>>>>>>>>> of Spark into the temp directory where
intermediate outputs get stored
>>>>>>>>>>>>> (i.e., Spark isn't writing the data out
for some reason) or moving the data
>>>>>>>>>>>>> from the temp directory to the final
location. The temp data is usually
>>>>>>>>>>>>> deleted at the end of a Crunch run, but
you can disable this by a) not
>>>>>>>>>>>>> calling Pipeline.cleanup or Pipeline.done
at the end of the run and b)
>>>>>>>>>>>>> subclassing SparkPipeline with dummy
code that overrides the finalize()
>>>>>>>>>>>>> method (which is implemented in the top-level
DistributedPipeline abstract
>>>>>>>>>>>>> base class) to be a no-op. Is that easy
to try out to see if we can isolate
>>>>>>>>>>>>> the source of the error? Otherwise I
can play with this a bit tomorrow on
>>>>>>>>>>>>> my own cluster.
>>>>>>>>>>>>>
>>>>>>>>>>>>> J
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, May 10, 2018 at 2:20 PM, David
Ortiz <
>>>>>>>>>>>>> dpo5003@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Awesome.  Thanks for taking a look!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh
Wills <
>>>>>>>>>>>>>> josh.wills@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> hrm, that sounds like something
is wrong with the commit
>>>>>>>>>>>>>>> operation on the Spark side;
let me take a look at it this evening!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> J
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, May 10, 2018 at 8:56
AM, David Ortiz <
>>>>>>>>>>>>>>> dpo5003@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>      Are there any known
issues with the
>>>>>>>>>>>>>>>> AvroParquetPathPerKeyTarget
when running a Spark pipeline?  When I run my
>>>>>>>>>>>>>>>> pipeline with mapreduce,
I get output, and when I run with spark, the step
>>>>>>>>>>>>>>>> before where I list my partition
keys out (because we use them to add
>>>>>>>>>>>>>>>> partitions to hive) lists
data being present, but the output directory
>>>>>>>>>>>>>>>> remains empty.  This behavior
is occurring targeting both HDFS and S3
>>>>>>>>>>>>>>>> directly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>      Dave
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>
>>
>

Mime
View raw message