spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
Subject Re: Join highly skewed datasets
Date Sat, 27 Jun 2015 04:23:23 GMT
This is nice. Which version of Spark has this support ? Or do I need to
build it.
I have never built Spark from git, please share instructions for Hadoop
2.4.x YARN.

I am struggling a lot to get a join work between 200G and 2TB datasets. I
am constantly getting this exception

1000s of executors are failing with

15/06/26 13:05:28 ERROR storage.ShuffleBlockFetcherIterator: Failed to get
block(s) from phxdpehdc9dn2125.stratus.phx.ebay.com:60162
java.io.IOException: Failed to connect to
executor_host_name/executor_ip_address:60162
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




On Fri, Jun 26, 2015 at 3:20 PM, Koert Kuipers <koert@tresata.com> wrote:

> we went through a similar process, switching from scalding (where
> everything just works on large datasets) to spark (where it does not).
>
> spark can be made to work on very large datasets, it just requires a
> little more effort. pay attention to your storage levels (should be
> memory-and-disk or disk-only), number of partitions (should be large,
> multiple of num executors), and avoid groupByKey
>
> also see:
> https://github.com/tresata/spark-sorted (for avoiding in memory
> operations for certain type of reduce operations)
> https://github.com/apache/spark/pull/6883 (for blockjoin)
>
>
> On Fri, Jun 26, 2015 at 5:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
> wrote:
>
>> Not far at all. On large data sets everything simply fails with Spark.
>> Worst is am not able to figure out the reason of failure,  the logs run
>> into millions of lines and i do not know the keywords to search for failure
>> reason
>>
>> On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf <nightwolfzor@gmail.com>
>> wrote:
>>
>>> How far did you get?
>>>
>>> On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>>> wrote:
>>>
>>>> We use Scoobi + MR to perform joins and we particularly use blockJoin()
>>>> API of scoobi
>>>>
>>>>
>>>> /** Perform an equijoin with another distributed list where this list
>>>> is considerably smaller
>>>> * than the right (but too large to fit in memory), and where the keys
>>>> of right may be
>>>> * particularly skewed. */
>>>>
>>>>  def blockJoin[B : WireFormat](right: DList[(K, B)]): DList[(K, (A,
>>>> B))] =
>>>>     Relational.blockJoin(left, right)
>>>>
>>>>
>>>> I am trying to do a POC and what Spark join API(s) is recommended to
>>>> achieve something similar ?
>>>>
>>>> Please suggest.
>>>>
>>>> --
>>>> Deepak
>>>>
>>>>
>>>
>>
>>
>> --
>> Deepak
>>
>>
>


-- 
Deepak

Mime
View raw message