maven command needs to be passed through --mvn option.

Cheers

On Sun, Jun 28, 2015 at 12:56 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com> wrote:
Running this now

 ./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package


Waiting for it to complete. There is no progress after initial log messages


//LOGS

$ ./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package

+++ dirname ./make-distribution.sh

++ cd .

++ pwd

+ SPARK_HOME=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0

+ DISTDIR=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist

+ SPARK_TACHYON=false

+ TACHYON_VERSION=0.6.4

+ TACHYON_TGZ=tachyon-0.6.4-bin.tar.gz

+ TACHYON_URL=https://github.com/amplab/tachyon/releases/download/v0.6.4/tachyon-0.6.4-bin.tar.gz

+ MAKE_TGZ=false

+ NAME=none

+ MVN=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn

+ ((  9  ))

+ case $1 in

+ MAKE_TGZ=true

+ shift

+ ((  8  ))

+ case $1 in

+ break

+ '[' -z /Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home/ ']'

+ '[' -z /Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home/ ']'

++ command -v git

+ '[' /usr/bin/git ']'

++ git rev-parse --short HEAD

++ :

+ GITREV=

+ '[' '!' -z '' ']'

+ unset GITREV

++ command -v /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn

+ '[' '!' /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn ']'

++ /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn help:evaluate -Dexpression=project.version -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package

++ grep -v INFO

++ tail -n 1

//LOGS


On Sun, Jun 28, 2015 at 12:17 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com> wrote:
I just did that, where can i find that "spark-1.4.0-bin-hadoop2.4.tgz" file ? 

On Sun, Jun 28, 2015 at 12:15 PM, Ted Yu <yuzhihong@gmail.com> wrote:
You can use the following command to build Spark after applying the pull request:
mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package

Cheers

On Sun, Jun 28, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com> wrote:
I see that block support did not make it to spark 1.4 release.

Can you share instructions of building spark with this support for hadoop 2.4.x distribution.

appreciate.

On Fri, Jun 26, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com> wrote:
This is nice. Which version of Spark has this support ? Or do I need to build it.
I have never built Spark from git, please share instructions for Hadoop 2.4.x YARN.

I am struggling a lot to get a join work between 200G and 2TB datasets. I am constantly getting this exception

1000s of executors are failing with

15/06/26 13:05:28 ERROR storage.ShuffleBlockFetcherIterator: Failed to get block(s) from phxdpehdc9dn2125.stratus.phx.ebay.com:60162
java.io.IOException: Failed to connect to executor_host_name/executor_ip_address:60162
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




On Fri, Jun 26, 2015 at 3:20 PM, Koert Kuipers <koert@tresata.com> wrote:
we went through a similar process, switching from scalding (where everything just works on large datasets) to spark (where it does not).

spark can be made to work on very large datasets, it just requires a little more effort. pay attention to your storage levels (should be memory-and-disk or disk-only), number of partitions (should be large, multiple of num executors), and avoid groupByKey

also see:
https://github.com/tresata/spark-sorted (for avoiding in memory operations for certain type of reduce operations)
https://github.com/apache/spark/pull/6883 (for blockjoin)


On Fri, Jun 26, 2015 at 5:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com> wrote:
Not far at all. On large data sets everything simply fails with Spark. Worst is am not able to figure out the reason of failure,  the logs run into millions of lines and i do not know the keywords to search for failure reason

On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf <nightwolfzor@gmail.com> wrote:
How far did you get?

On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com> wrote:
We use Scoobi + MR to perform joins and we particularly use blockJoin() API of scoobi


/** Perform an equijoin with another distributed list where this list is considerably smaller
* than the right (but too large to fit in memory), and where the keys of right may be
* particularly skewed. */

 def blockJoin[B : WireFormat](right: DList[(K, B)]): DList[(K, (A, B))] =
    Relational.blockJoin(left, right)
    

I am trying to do a POC and what Spark join API(s) is recommended to achieve something similar ?

Please suggest.

--
Deepak





--
Deepak





--
Deepak




--
Deepak





--
Deepak




--
Deepak