spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Unable to Read/Write Avro RDD on cluster.
Date Thu, 05 Mar 2015 10:04:07 GMT
Here's a workaround:

- Download and put this jar
<http://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.7.7/
avro-mapred-1.7.7-hadoop2.jar> in the SPARK_CLASSPATH in all workers
- Make sure that jar is present in the same path in all workers.

Thanks
Best Regards

On Thu, Mar 5, 2015 at 10:27 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com> wrote:

> I am trying to read RDD avro, transform and write.
> I am able to run it locally fine but when i run onto cluster, i see issues
> with Avro.
>
>
> export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export HADOOP_CONF_DIR=/apache/hadoop/conf
> export YARN_CONF_DIR=/apache/hadoop/conf
> export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
> export SPARK_LIBRARY_PATH=/apache/hadoop/lib/native
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export
>
> SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar:/home/dvasthimal/spark/avro-1.7.7.jar
> export SPARK_LIBRARY_PATH="/apache/hadoop/lib/native"
> export YARN_CONF_DIR=/apache/hadoop/conf/
>
> cd $SPARK_HOME
>
> ./bin/spark-submit --master yarn-cluster --jars
>
> /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
> --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores
> 1  --queue hdmi-spark --class com.company.ep.poc.spark.reporting.SparkApp
> /home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar
> startDate=2015-02-16 endDate=2015-02-16
> epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession
> subcommand=successevents
> outputdir=/user/dvasthimal/epdatasets/successdetail
>
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> 15/03/04 03:20:29 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to rm2
> 15/03/04 03:20:30 INFO yarn.Client: Got Cluster metric info from
> ApplicationsManager (ASM), number of NodeManagers: 2221
> 15/03/04 03:20:30 INFO yarn.Client: Queue info ... queueName: hdmi-spark,
> queueCurrentCapacity: 0.7162806, queueMaxCapacity: 0.08,
>       queueApplicationCount = 7, queueChildQueueCount = 0
> 15/03/04 03:20:30 INFO yarn.Client: Max mem capabililty of a single
> resource in this cluster 16384
> 15/03/04 03:20:30 INFO yarn.Client: Preparing Local resources
> 15/03/04 03:20:30 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 15/03/04 03:20:30 WARN hdfs.BlockReaderLocal: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
>
>
> 15/03/04 03:20:46 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token
> 7780745 for dvasthimal on 10.115.206.112:8020
> 15/03/04 03:20:46 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar to hdfs://
>
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark_reporting-1.0-SNAPSHOT.jar
> 15/03/04 03:20:47 INFO yarn.Client: Uploading
>
> file:/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
> to hdfs://
>
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark-assembly-1.0.2-hadoop2.4.1.jar
> 15/03/04 03:20:52 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar to hdfs://
>
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-mapred-1.7.7-hadoop2.jar
> 15/03/04 03:20:52 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/avro-1.7.7.jar to hdfs://
>
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-1.7.7.jar
> 15/03/04 03:20:54 INFO yarn.Client: Setting up the launch environment
> 15/03/04 03:20:54 INFO yarn.Client: Setting up container launch context
> 15/03/04 03:20:54 INFO yarn.Client: Command for starting the Spark
> ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx4096m,
> -Djava.io.tmpdir=$PWD/tmp,
> -Dspark.app.name=\"com.company.ep.poc.spark.reporting.SparkApp\",
>  -Dlog4j.configuration=log4j-spark-container.properties,
> org.apache.spark.deploy.yarn.ApplicationMaster, --class,
> com.company.ep.poc.spark.reporting.SparkApp, --jar ,
> file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar,  --args
>  'startDate=2015-02-16'  --args  'endDate=2015-02-16'  --args
>  'epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession'  --args
>  'subcommand=successevents'  --args
>  'outputdir=/user/dvasthimal/epdatasets/successdetail' , --executor-memory,
> 2048, --executor-cores, 1, --num-executors , 3, 1>, <LOG_DIR>/stdout, 2>,
> <LOG_DIR>/stderr)
> 15/03/04 03:20:54 INFO yarn.Client: Submitting application to ASM
> 15/03/04 03:20:54 INFO impl.YarnClientImpl: Submitted application
> application_1425075571333_61948
> 15/03/04 03:20:56 INFO yarn.Client: Application report from ASM:
>  application identifier: application_1425075571333_61948
>  appId: 61948
>  clientToAMToken: null
>  appDiagnostics:
>  appMasterHost: N/A
>  appQueue: hdmi-spark
>  appMasterRpcPort: -1
>  appStartTime: 1425464454263
>  yarnAppState: ACCEPTED
>  distributedFinalState: UNDEFINED
>  appTrackingUrl:
>
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/
>  appUser: dvasthimal
> 15/03/04 03:21:18 INFO yarn.Client: Application report from ASM:
>  application identifier: application_1425075571333_61948
>  appId: 61948
>  clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }
>  appDiagnostics:
>  appMasterHost: phxaishdc9dn0169.phx.company.com
>  appQueue: hdmi-spark
>  appMasterRpcPort: 0
>  appStartTime: 1425464454263
>  yarnAppState: RUNNING
>  distributedFinalState: UNDEFINED
>  appTrackingUrl:
>
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/
>  appUser: dvasthimal
> ….
> ….
> 15/03/04 03:21:22 INFO yarn.Client: Application report from ASM:
>  application identifier: application_1425075571333_61948
>  appId: 61948
>  clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }
>  appDiagnostics:
>  appMasterHost: phxaishdc9dn0169.phx.company.com
>  appQueue: hdmi-spark
>  appMasterRpcPort: 0
>  appStartTime: 1425464454263
>  yarnAppState: FINISHED
>  distributedFinalState: FAILED
>  appTrackingUrl:
>
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/A
>  appUser: dvasthimal
>
>
>
> AM failed with following exception
>
> /apache/hadoop/bin/yarn logs -applicationId application_1425075571333_61948
> 15/03/04 03:21:22 INFO NewHadoopRDD: Input split: hdfs://
>
> apollo-phx-nn.company.com:8020/user/dvasthimal/epdatasets_small/exptsession/2015/02/16/part-r-00000.avro:0+13890
> 15/03/04 03:21:22 ERROR Executor: Exception in task ID 3
> java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at
>
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 1) Having figured out the error the fix would be to put the right version
> of avro libs into AM JVM classpath. Hence i included --jars
>
> /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
> in spark-submit command. However i still see the same exception.
> 2) I tried to include these libs in SPARK_CLASSPATH. However i see the same
> exception.
>
>
> --
> Deepak
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message