spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Join highly skewed datasets
Date Sun, 28 Jun 2015 19:15:40 GMT
You can use the following command to build Spark after applying the pull
request:

mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package


Cheers


On Sun, Jun 28, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com> wrote:

> I see that block support did not make it to spark 1.4 release.
>
> Can you share instructions of building spark with this support for hadoop
> 2.4.x distribution.
>
> appreciate.
>
> On Fri, Jun 26, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
> wrote:
>
>> This is nice. Which version of Spark has this support ? Or do I need to
>> build it.
>> I have never built Spark from git, please share instructions for Hadoop
>> 2.4.x YARN.
>>
>> I am struggling a lot to get a join work between 200G and 2TB datasets. I
>> am constantly getting this exception
>>
>> 1000s of executors are failing with
>>
>> 15/06/26 13:05:28 ERROR storage.ShuffleBlockFetcherIterator: Failed to
>> get block(s) from phxdpehdc9dn2125.stratus.phx.ebay.com:60162
>> java.io.IOException: Failed to connect to
>> executor_host_name/executor_ip_address:60162
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>>
>> On Fri, Jun 26, 2015 at 3:20 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> we went through a similar process, switching from scalding (where
>>> everything just works on large datasets) to spark (where it does not).
>>>
>>> spark can be made to work on very large datasets, it just requires a
>>> little more effort. pay attention to your storage levels (should be
>>> memory-and-disk or disk-only), number of partitions (should be large,
>>> multiple of num executors), and avoid groupByKey
>>>
>>> also see:
>>> https://github.com/tresata/spark-sorted (for avoiding in memory
>>> operations for certain type of reduce operations)
>>> https://github.com/apache/spark/pull/6883 (for blockjoin)
>>>
>>>
>>> On Fri, Jun 26, 2015 at 5:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>>> wrote:
>>>
>>>> Not far at all. On large data sets everything simply fails with Spark.
>>>> Worst is am not able to figure out the reason of failure,  the logs run
>>>> into millions of lines and i do not know the keywords to search for failure
>>>> reason
>>>>
>>>> On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf <nightwolfzor@gmail.com>
>>>> wrote:
>>>>
>>>>> How far did you get?
>>>>>
>>>>> On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> We use Scoobi + MR to perform joins and we particularly use
>>>>>> blockJoin() API of scoobi
>>>>>>
>>>>>>
>>>>>> /** Perform an equijoin with another distributed list where this
list
>>>>>> is considerably smaller
>>>>>> * than the right (but too large to fit in memory), and where the
keys
>>>>>> of right may be
>>>>>> * particularly skewed. */
>>>>>>
>>>>>>  def blockJoin[B : WireFormat](right: DList[(K, B)]): DList[(K, (A,
>>>>>> B))] =
>>>>>>     Relational.blockJoin(left, right)
>>>>>>
>>>>>>
>>>>>> I am trying to do a POC and what Spark join API(s) is recommended
to
>>>>>> achieve something similar ?
>>>>>>
>>>>>> Please suggest.
>>>>>>
>>>>>> --
>>>>>> Deepak
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Deepak
>>>>
>>>>
>>>
>>
>>
>> --
>> Deepak
>>
>>
>
>
> --
> Deepak
>
>

Mime
View raw message