[ https://issues.apache.org/jira/browse/CRUNCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14747367#comment-14747367
]
Surbhi Mungre commented on CRUNCH-557:
--------------------------------------
The patch above addressed only one of the issues[1]. Second issue[2] is not resolved by it.
It looks like for some reason files cached in distributed cache are not even added to SparkContext.
[1] https://gist.github.com/anonymous/15d6c691b743ad392d42
[2] https://gist.github.com/anonymous/b02a82401a30a69f1cff
> Fix file distribution from HDFS in Crunch-on-Spark
> --------------------------------------------------
>
> Key: CRUNCH-557
> URL: https://issues.apache.org/jira/browse/CRUNCH-557
> Project: Crunch
> Issue Type: Bug
> Reporter: Josh Wills
> Attachments: CRUNCH-557.patch, CRUNCH-557a.patch
>
>
> From the user list:
> I was trying to determine effect of changing JoinStrategy on a Spark pipeline. I noticed
that my pipeline works fine with DefaultJoinStrategy, however I could not get it to working
with MapSideJoinStrategy and BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1]
on driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of the stages.
I have not tried to do any configuration changes but I did run tests with datasets of different
sizes to ensure that my PCollection is small enough to fit in memory. I am running spark in
yarn-client mode with Crunch 0.11.0-cdh5.4.2.
> [1] https://gist.github.com/anonymous/15d6c691b743ad392d42
> [2] https://gist.github.com/anonymous/b02a82401a30a69f1cff
> The bug is in the SparkRuntime.distributeFiles method, which needs to include a scheme
for the URI it's handing to Spark.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
|