spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
Subject Re: Join highly skewed datasets
Date Sun, 28 Jun 2015 19:56:33 GMT
Running this now

 ./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0
-Phive -Phive-thriftserver -DskipTests clean package


Waiting for it to complete. There is no progress after initial log messages


//LOGS

$ ./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0
-Phive -Phive-thriftserver -DskipTests clean package

+++ dirname ./make-distribution.sh

++ cd .

++ pwd

+ SPARK_HOME=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0

+ DISTDIR=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist

+ SPARK_TACHYON=false

+ TACHYON_VERSION=0.6.4

+ TACHYON_TGZ=tachyon-0.6.4-bin.tar.gz

+ TACHYON_URL=
https://github.com/amplab/tachyon/releases/download/v0.6.4/tachyon-0.6.4-bin.tar.gz

+ MAKE_TGZ=false

+ NAME=none

+ MVN=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn

+ ((  9  ))

+ case $1 in

+ MAKE_TGZ=true

+ shift

+ ((  8  ))

+ case $1 in

+ break

+ '[' -z /Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home/
']'

+ '[' -z /Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home/
']'

++ command -v git

+ '[' /usr/bin/git ']'

++ git rev-parse --short HEAD

++ :

+ GITREV=

+ '[' '!' -z '' ']'

+ unset GITREV

++ command -v /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn

+ '[' '!' /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn ']'

++ /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn help:evaluate
-Dexpression=project.version -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0
-Phive -Phive-thriftserver -DskipTests clean package

++ grep -v INFO

++ tail -n 1

//LOGS

On Sun, Jun 28, 2015 at 12:17 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com> wrote:

> I just did that, where can i find that "spark-1.4.0-bin-hadoop2.4.tgz"
> file ?
>
> On Sun, Jun 28, 2015 at 12:15 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> You can use the following command to build Spark after applying the pull
>> request:
>>
>> mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package
>>
>>
>> Cheers
>>
>>
>> On Sun, Jun 28, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>> wrote:
>>
>>> I see that block support did not make it to spark 1.4 release.
>>>
>>> Can you share instructions of building spark with this support for
>>> hadoop 2.4.x distribution.
>>>
>>> appreciate.
>>>
>>> On Fri, Jun 26, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>>> wrote:
>>>
>>>> This is nice. Which version of Spark has this support ? Or do I need to
>>>> build it.
>>>> I have never built Spark from git, please share instructions for Hadoop
>>>> 2.4.x YARN.
>>>>
>>>> I am struggling a lot to get a join work between 200G and 2TB datasets.
>>>> I am constantly getting this exception
>>>>
>>>> 1000s of executors are failing with
>>>>
>>>> 15/06/26 13:05:28 ERROR storage.ShuffleBlockFetcherIterator: Failed to
>>>> get block(s) from phxdpehdc9dn2125.stratus.phx.ebay.com:60162
>>>> java.io.IOException: Failed to connect to
>>>> executor_host_name/executor_ip_address:60162
>>>> at
>>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>>>> at
>>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>>>> at
>>>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
>>>> at
>>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>>>> at
>>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>>>> at
>>>> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>>>> at
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:745)
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jun 26, 2015 at 3:20 PM, Koert Kuipers <koert@tresata.com>
>>>> wrote:
>>>>
>>>>> we went through a similar process, switching from scalding (where
>>>>> everything just works on large datasets) to spark (where it does not).
>>>>>
>>>>> spark can be made to work on very large datasets, it just requires a
>>>>> little more effort. pay attention to your storage levels (should be
>>>>> memory-and-disk or disk-only), number of partitions (should be large,
>>>>> multiple of num executors), and avoid groupByKey
>>>>>
>>>>> also see:
>>>>> https://github.com/tresata/spark-sorted (for avoiding in memory
>>>>> operations for certain type of reduce operations)
>>>>> https://github.com/apache/spark/pull/6883 (for blockjoin)
>>>>>
>>>>>
>>>>> On Fri, Jun 26, 2015 at 5:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Not far at all. On large data sets everything simply fails with
>>>>>> Spark. Worst is am not able to figure out the reason of failure,
 the logs
>>>>>> run into millions of lines and i do not know the keywords to search
for
>>>>>> failure reason
>>>>>>
>>>>>> On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf <nightwolfzor@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> How far did you get?
>>>>>>>
>>>>>>> On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We use Scoobi + MR to perform joins and we particularly use
>>>>>>>> blockJoin() API of scoobi
>>>>>>>>
>>>>>>>>
>>>>>>>> /** Perform an equijoin with another distributed list where
this
>>>>>>>> list is considerably smaller
>>>>>>>> * than the right (but too large to fit in memory), and where
the
>>>>>>>> keys of right may be
>>>>>>>> * particularly skewed. */
>>>>>>>>
>>>>>>>>  def blockJoin[B : WireFormat](right: DList[(K, B)]): DList[(K,
(A,
>>>>>>>> B))] =
>>>>>>>>     Relational.blockJoin(left, right)
>>>>>>>>
>>>>>>>>
>>>>>>>> I am trying to do a POC and what Spark join API(s) is recommended
>>>>>>>> to achieve something similar ?
>>>>>>>>
>>>>>>>> Please suggest.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Deepak
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Deepak
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Deepak
>>>>
>>>>
>>>
>>>
>>> --
>>> Deepak
>>>
>>>
>>
>
>
> --
> Deepak
>
>


-- 
Deepak

Mime
View raw message