crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-557) Fix file distribution from HDFS in Crunch-on-Spark
Date Fri, 25 Sep 2015 03:10:04 GMT


Josh Wills commented on CRUNCH-557:

Ack, my bad-- I thought I had replicated the issue you were seeing in [2], but I misread the
stack trace (and then compounded the problem by posting the wrong patch.) On my end, I get
the BloomFilterStrategy to work correctly once it gets past the bug in [1] that is fixed by
557a. Are you running the code using spark-submit, or something else?

> Fix file distribution from HDFS in Crunch-on-Spark
> --------------------------------------------------
>                 Key: CRUNCH-557
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Josh Wills
>         Attachments: CRUNCH-557.patch, CRUNCH-557a.patch, CRUNCH-557b.patch
> From the user list:
> I was trying to determine effect of changing JoinStrategy on a Spark pipeline. I noticed
that my pipeline works fine with DefaultJoinStrategy, however I could not get it to working
with MapSideJoinStrategy and BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1]
on driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of the stages.
I have not tried to do any configuration changes but I did run tests with datasets of different
sizes to ensure that my PCollection is small enough to fit in memory. I am running spark in
yarn-client mode with Crunch 0.11.0-cdh5.4.2.
> [1]
> [2]
> The bug is in the SparkRuntime.distributeFiles method, which needs to include a scheme
for the URI it's handing to Spark.

This message was sent by Atlassian JIRA

View raw message