spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Archit Thakur <archit279tha...@gmail.com>
Subject Re: Spark context jar confusions
Date Thu, 02 Jan 2014 12:39:27 GMT
Eugen, you said spark sends the jar to each worker, if we specify it. What
if we only create a fat jar and do not do the sc.jarOfclass(class)? If we
have created a fat jar. Won't all of the class be available on the slave
node? What if access it in the code which is supposed to be executed on one
of the slave node? Eg, Object Z. which is present the fat jar and is
accessed in the map function(which is executed distributedly?). Won't it be
accessible(Coz it is at compile time) ? It usually is, Isn't it?


On Thu, Jan 2, 2014 at 6:02 PM, Archit Thakur <archit279thakur@gmail.com>wrote:

> Aureliano, It doesn't matter actually. specifying "local" as your spark
> master only does is It uses the single JVM to run whole application. Making
> a cluster and then specifying "spark://localhost:7077" runs it on a set
> of machines. Running spark in lcoal mode will be helpful for debugging
> purposes but will perform much slower than if you have a cluster of 3-4-n
> machines. If you do not have a set of machines, you can make your same
> machine as a slave and start both master and slave on the same machine. Go
> through Apache Spark home to know more about starting various node. Thx.
>
>
>
> On Thu, Jan 2, 2014 at 5:21 PM, Aureliano Buendia <buendia360@gmail.com>wrote:
>
>> How about when developing the spark application, do you use "localhost",
>> or "spark://localhost:7077" for spark context master during development?
>>
>> Using "spark://localhost:7077" is a good way to simulate the production
>> driver and it provides the web ui. When using "spark://localhost:7077", is
>> it required to create the fat jar? Wouldn't that significantly slow down
>> the development cycle?
>>
>>
>> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <cepoi.eugen@gmail.com>wrote:
>>
>>> It depends how you deploy, I don't find it so complicated...
>>>
>>> 1) To build the fat jar I am using maven (as I am not familiar with sbt).
>>>
>>> Inside I have something like that, saying which libs should be used in
>>> the fat jar (the others won't be present in the final artifact).
>>>
>>> <plugin>
>>>                 <groupId>org.apache.maven.plugins</groupId>
>>>                 <artifactId>maven-shade-plugin</artifactId>
>>>                 <version>2.1</version>
>>>                 <executions>
>>>                     <execution>
>>>                         <phase>package</phase>
>>>                         <goals>
>>>                             <goal>shade</goal>
>>>                         </goals>
>>>                         <configuration>
>>>                             <minimizeJar>true</minimizeJar>
>>>
>>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>>                             <artifactSet>
>>>                                 <includes>
>>>                                     <include>org.apache.hbase:*</include>
>>>
>>> <include>org.apache.hadoop:*</include>
>>>
>>> <include>com.typesafe:config</include>
>>>                                     <include>org.apache.avro:*</include>
>>>                                     <include>joda-time:*</include>
>>>                                     <include>org.joda:*</include>
>>>                                 </includes>
>>>                             </artifactSet>
>>>                             <filters>
>>>                                 <filter>
>>>                                     <artifact>*:*</artifact>
>>>                                     <excludes>
>>>                                         <exclude>META-INF/*.SF</exclude>
>>>                                         <exclude>META-INF/*.DSA</exclude>
>>>                                         <exclude>META-INF/*.RSA</exclude>
>>>                                     </excludes>
>>>                                 </filter>
>>>                             </filters>
>>>                         </configuration>
>>>                     </execution>
>>>                 </executions>
>>>             </plugin>
>>>
>>>
>>> 2) The App is the jar you have built, so you ship it to the driver node
>>> (it depends a lot on how you are planing to use it, debian packaging, a
>>> plain old scp, etc) to run it you can do something like:
>>>
>>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar
>>> com.myproject.MyJob
>>>
>>> where MyJob is the entry point to your job it defines a main method.
>>>
>>> 3) I don't know whats the "common way" but I am doing things this way:
>>> build the fat jar, provide some launch scripts, make debian packaging, ship
>>> it to a node that plays the role of the driver, run it over mesos using the
>>> launch scripts + some conf.
>>>
>>>
>>> 2014/1/2 Aureliano Buendia <buendia360@gmail.com>
>>>
>>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>>> has followed scala in that there are more than one way of accomplishing a
>>>> job, making scala an overcomplicated language)
>>>>
>>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>>>> is shipped with a separate sbt?
>>>>
>>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>>> shipped to the workers. Now how do we deploy the main app which is supposed
>>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>>> assembly also create that jar?
>>>>
>>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>>> cannot find any example by googling. What's the most common way that people
>>>> use?
>>>>
>>>>
>>>>
>>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <cepoi.eugen@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> This is the list of the jars you use in your job, the driver will send
>>>>> all those jars to each worker (otherwise the workers won't have the classes
>>>>> you need in your job). The easy way to go is to build a fat jar with
your
>>>>> code and all the libs you depend on and then use this utility to get
the
>>>>> path: SparkContext.jarOfClass(YourJob.getClass)
>>>>>
>>>>>
>>>>> 2014/1/2 Aureliano Buendia <buendia360@gmail.com>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I do not understand why spark context has an option for loading jars
>>>>>> at runtime.
>>>>>>
>>>>>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>>> :
>>>>>>
>>>>>> object BroadcastTest {
>>>>>>   def main(args: Array[String]) {
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> This is *the* example, or *the* application that we want to run,
what does SPARK_EXAMPLES_JAR supposed to be?
>>>>>> In this particular case, the BroadcastTest example is self-contained,
why would it want to load other unrelated example jars?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Finally, how does this help a real world spark application?
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message