spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chandeep Singh ...@chandeep.com>
Subject Re: Building Spark packages with SBTor Maven
Date Tue, 15 Mar 2016 12:15:33 GMT
Btw, just to add to the confusion ;) I use Maven as well since I moved from Java to Scala but
everyone I talk to has been recommending SBT for Scala. 

I use the Eclipse Scala IDE to build. http://scala-ide.org/ <http://scala-ide.org/>

Here is my sample PoM. You can add dependancies based on your requirement.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>spark</groupId>
	<version>1.0</version>
	<name>${project.artifactId}</name>

	<properties>
		<maven.compiler.source>1.7</maven.compiler.source>
		<maven.compiler.target>1.7</maven.compiler.target>
		<encoding>UTF-8</encoding>
		<scala.version>2.10.4</scala.version>
		<maven-scala-plugin.version>2.15.2</maven-scala-plugin.version>
	</properties>

	<repositories>
		<repository>
			<id>cloudera-repo-releases</id>
			<url>https://repository.cloudera.com/artifactory/repo/</url>
		</repository>
	</repositories>

	<dependencies>
		<dependency>
			<groupId>org.scala-lang</groupId>
			<artifactId>scala-library</artifactId>
			<version>${scala.version}</version>
		</dependency>
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-core_2.10</artifactId>
			<version>1.5.0-cdh5.5.1</version>
		</dependency>
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-mllib_2.10</artifactId>
			<version>1.5.0-cdh5.5.1</version>
		</dependency>
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-hive_2.10</artifactId>
			<version>1.5.0</version>
		</dependency>

	</dependencies>
	<build>
		<sourceDirectory>src/main/scala</sourceDirectory>
		<testSourceDirectory>src/test/scala</testSourceDirectory>
		<plugins>
			<plugin>
				<groupId>org.scala-tools</groupId>
				<artifactId>maven-scala-plugin</artifactId>
				<version>${maven-scala-plugin.version}</version>
				<executions>
					<execution>
						<goals>
							<goal>compile</goal>
							<goal>testCompile</goal>
						</goals>
					</execution>
				</executions>
				<configuration>
					<jvmArgs>
						<jvmArg>-Xms64m</jvmArg>
						<jvmArg>-Xmx1024m</jvmArg>
					</jvmArgs>
				</configuration>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-shade-plugin</artifactId>
				<version>1.6</version>
				<executions>
					<execution>
						<phase>package</phase>
						<goals>
							<goal>shade</goal>
						</goals>
						<configuration>
							<filters>
								<filter>
									<artifact>*:*</artifact>
									<excludes>
										<exclude>META-INF/*.SF</exclude>
										<exclude>META-INF/*.DSA</exclude>
										<exclude>META-INF/*.RSA</exclude>
									</excludes>
								</filter>
							</filters>
							<transformers>
								<transformer
									implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
									<mainClass>com.group.id.Launcher1</mainClass>
								</transformer>
							</transformers>
						</configuration>
					</execution>
				</executions>
			</plugin>
		</plugins>
	</build>

	<artifactId>scala</artifactId>
</project>


> On Mar 15, 2016, at 12:09 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> Ok.
> 
> Sounds like opinion is divided :)
> 
> I will try to build a scala app with Maven.
> 
> When I build with SBT I follow this directory structure
> 
> High level directory the package name like
> 
> ImportCSV
> 
> under ImportCSV I have a directory src and the sbt file ImportCSV.sbt
> 
> in directory src I have main and scala subdirectories. My scala file is in
> 
> ImportCSV/src/main/scala
> 
> called ImportCSV.scala
> 
> I then have a shell script that runs everything under ImportCSV directory
> 
> cat generic.ksh
> #!/bin/ksh
> #--------------------------------------------------------------------------------
> #
> # Procedure:    generic.ksh
> #
> # Description:  Compiles and run scala app usinbg sbt and spark-submit
> #
> # Parameters:   none
> #
> #--------------------------------------------------------------------------------
> # Vers|  Date  | Who | DA | Description
> #-----+--------+-----+----+-----------------------------------------------------
> # 1.0 |04/03/15|  MT |    | Initial Version
> #--------------------------------------------------------------------------------
> #
> function F_USAGE
> {
>    echo "USAGE: ${1##*/} -A '<Application>'"
>    echo "USAGE: ${1##*/} -H '<HELP>' -h '<HELP>'"
>    exit 10
> }
> #
> # Main Section
> #
> if [[ "${1}" = "-h" || "${1}" = "-H" ]]; then
>    F_USAGE $0
> fi
> ## MAP INPUT TO VARIABLES
> while getopts A: opt
> do
>    case $opt in
>    (A) APPLICATION="$OPTARG" ;;
>    (*) F_USAGE $0 ;;
>    esac
> done
> [[ -z ${APPLICATION} ]] && print "You must specify an application value " &&
F_USAGE $0
> ENVFILE=/home/hduser/dba/bin/environment.ksh
> if [[ -f $ENVFILE ]]
> then
>         . $ENVFILE
>         . ~/spark_1.5.2_bin-hadoop2.6.kshrc
> else
>         echo "Abort: $0 failed. No environment file ( $ENVFILE ) found"
>         exit 1
> fi
> ##FILE_NAME=`basename $0 .ksh`
> FILE_NAME=${APPLICATION}
> CLASS=`echo ${FILE_NAME}|tr "[:upper:]" "[:lower:]"`
> NOW="`date +%Y%m%d_%H%M`"
> LOG_FILE=${LOGDIR}/${FILE_NAME}.log
> [ -f ${LOG_FILE} ] && rm -f ${LOG_FILE}
> print "\n" `date` ", Started $0" | tee -a ${LOG_FILE}
> cd ../${FILE_NAME}
> print "Compiling ${FILE_NAME}" | tee -a ${LOG_FILE}
> sbt package
> print "Submiiting the job" | tee -a ${LOG_FILE}
> 
> ${SPARK_HOME}/bin/spark-submit \
>                 --packages com.databricks:spark-csv_2.11:1.3.0 \
>                 --class "${FILE_NAME}" \
>                 --master spark://50.140.197.217:7077 <http://50.140.197.217:7077/>
\
>                 --executor-memory=12G \
>                 --executor-cores=12 \
>                 --num-executors=2 \
>                 target/scala-2.10/${CLASS}_2.10-1.0.jar
> print `date` ", Finished $0" | tee -a ${LOG_FILE}
> exit
> 
> 
> So to run it for ImportCSV all I need is to do
> 
> ./generic.ksh -A ImportCSV
> 
> Now can anyone kindly give me a rough guideline on directory and location of pom.xml
to make this work using maven?
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 15 March 2016 at 10:50, Sean Owen <sowen@cloudera.com <mailto:sowen@cloudera.com>>
wrote:
> FWIW, I strongly prefer Maven over SBT even for Scala projects. The
> Spark build of reference is Maven.
> 
> On Tue, Mar 15, 2016 at 10:45 AM, Chandeep Singh <cs@chandeep.com <mailto:cs@chandeep.com>>
wrote:
> > For Scala, SBT is recommended.
> >
> > On Mar 15, 2016, at 10:42 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
> > wrote:
> >
> > Hi,
> >
> > I build my Spark/Scala packages using SBT that works fine. I have created
> > generic shell scripts to build and submit it.
> >
> > Yesterday I noticed that some use Maven and Pom for this purpose.
> >
> > Which approach is recommended?
> >
> > Thanks,
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
> >
> >
> >
> > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> >
> >
> >
> >
> 


Mime
View raw message