spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Join highly skewed datasets
Date Sun, 28 Jun 2015 21:57:13 GMT
a blockJoin spreads out one side while replicating the other. i would
suggest replicating the smaller side. so if lstgItem is smaller try 3,1 or
else 1,3. this should spread the "fat" keys out over multiple (3)
executors...


On Sun, Jun 28, 2015 at 5:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com> wrote:

> I am able to use blockjoin API and it does not throw compilation error
>
> val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary,
> Long))] = lstgItem.blockJoin(viEvents,1,1).map {
>
> }
>
> Here viEvents is highly skewed and both are on HDFS.
>
> What should be the optimal values of replication, i gave 1,1
>
>
>
> On Sun, Jun 28, 2015 at 1:47 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
> wrote:
>
>> I incremented the version of spark from 1.4.0 to 1.4.0.1 and ran
>>
>>  ./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn  -Phive
>> -Phive-thriftserver
>>
>> Build was successful but the script faild. Is there a way to pass the
>> incremented version ?
>>
>>
>> [INFO] BUILD SUCCESS
>>
>> [INFO]
>> ------------------------------------------------------------------------
>>
>> [INFO] Total time: 09:56 min
>>
>> [INFO] Finished at: 2015-06-28T13:45:29-07:00
>>
>> [INFO] Final Memory: 84M/902M
>>
>> [INFO]
>> ------------------------------------------------------------------------
>>
>> + rm -rf /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist
>>
>> + mkdir -p /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/lib
>>
>> + echo 'Spark 1.4.0.1 built for Hadoop 2.4.0'
>>
>> + echo 'Build flags: -Phadoop-2.4' -Pyarn -Phive -Phive-thriftserver
>>
>> + cp
>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/assembly/target/scala-2.10/spark-assembly-1.4.0.1-hadoop2.4.0.jar
>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/lib/
>>
>> + cp
>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/examples/target/scala-2.10/spark-examples-1.4.0.1-hadoop2.4.0.jar
>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/lib/
>>
>> + cp
>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/network/yarn/target/scala-2.10/spark-1.4.0.1-yarn-shuffle.jar
>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/lib/
>>
>> + mkdir -p
>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/examples/src/main
>>
>> + cp -r /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/examples/src/main
>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/examples/src/
>>
>> + '[' 1 == 1 ']'
>>
>> + cp
>> '/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/lib_managed/jars/datanucleus*.jar'
>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/lib/
>>
>> cp:
>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/lib_managed/jars/datanucleus*.jar:
>> No such file or directory
>>
>> LM-SJL-00877532:spark-1.4.0 dvasthimal$ ./make-distribution.sh  --tgz
>> -Phadoop-2.4 -Pyarn  -Phive -Phive-thriftserver
>>
>>
>>
>> On Sun, Jun 28, 2015 at 1:41 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> you need 1) to publish to inhouse maven, so your application can depend
>>> on your version, and 2) use the spark distribution you compiled to launch
>>> your job (assuming you run with yarn so you can launch multiple versions of
>>> spark on same cluster)
>>>
>>> On Sun, Jun 28, 2015 at 4:33 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>>> wrote:
>>>
>>>> How can i import this pre-built spark into my application via maven as
>>>> i want to use the block join API.
>>>>
>>>> On Sun, Jun 28, 2015 at 1:31 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>>>> wrote:
>>>>
>>>>> I ran this w/o maven options
>>>>>
>>>>> ./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn  -Phive
>>>>> -Phive-thriftserver
>>>>>
>>>>> I got this spark-1.4.0-bin-2.4.0.tgz in the same working directory.
>>>>>
>>>>> I hope this is built with 2.4.x hadoop as i did specify -P
>>>>>
>>>>> On Sun, Jun 28, 2015 at 1:10 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>  ./make-distribution.sh  --tgz --*mvn* "-Phadoop-2.4 -Pyarn
>>>>>> -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean
package"
>>>>>>
>>>>>>
>>>>>> or
>>>>>>
>>>>>>
>>>>>>  ./make-distribution.sh  --tgz --*mvn* -Phadoop-2.4 -Pyarn
>>>>>> -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean
package"
>>>>>> ​Both fail with
>>>>>>
>>>>>> + echo -e 'Specify the Maven command with the --mvn flag'
>>>>>>
>>>>>> Specify the Maven command with the --mvn flag
>>>>>>
>>>>>> + exit -1
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Deepak
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Deepak
>>>>
>>>>
>>>
>>
>>
>> --
>> Deepak
>>
>>
>
>
> --
> Deepak
>
>

Mime
View raw message