spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Reading from Amazon S3
Date Tue, 03 May 2016 10:52:21 GMT

I'm going to start by letting you know two secret tools we use for diagnosing faults; one
big data at work, the other a large RDBMS behind a web UI


1. Google
2. The search field in Apache JIRA



Given this is a senior project, these foundational tools are something you are going to need
to know. It is a lot faster than asking on the mailing list, gives you real details, especially
in JIRA, including fixes.

In this problem, I would recommend taking the first string of the stack and sticking it in
google. See what turns up: it may actually be the answer you were looking for



On 2 May 2016, at 15:37, Jinan Alhajjaj <j.r.alhajjaj@hotmail.com<mailto:j.r.alhajjaj@hotmail.com>>
wrote:


Because I am doing this project for my senior project by using Java.
I try s3a URI as this:

s3a://accessId:secret@bucket/path

It show me an error :
Exception in thread "main" java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V



at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
at org.apache.spark.api.java.JavaPairRDD.reduceByKey(JavaPairRDD.scala:526)
________________________________
Date: Thu, 28 Apr 2016 11:19:08 +0100
Subject: Re: Reading from Amazon S3
From: gourav.sengupta@gmail.com<mailto:gourav.sengupta@gmail.com>
To: stevel@hortonworks.com<mailto:stevel@hortonworks.com>
CC: yuzhihong@gmail.com<mailto:yuzhihong@gmail.com>; j.r.alhajjaj@hotmail.com<mailto:j.r.alhajjaj@hotmail.com>;
user@spark.apache.org<mailto:user@spark.apache.org>

Why would you use JAVA (create a problem and then try to solve it)? Have you tried using Scala
or Python or even R?

Regards,
Gourav

On Thu, Apr 28, 2016 at 10:07 AM, Steve Loughran <stevel@hortonworks.com<mailto:stevel@hortonworks.com>>
wrote:

On 26 Apr 2016, at 18:49, Ted Yu <yuzhihong@gmail.com<mailto:yuzhihong@gmail.com>>
wrote:

Looking at the cause of the error, it seems hadoop-aws-xx.jar (corresponding to the version
of hadoop you use) was missing in classpath.

yes, that s3n was moved from hadoop-common to the new hadoop-aws, and without realising it
broke a lot of things.

you'll need hadoop-aws and jets3t on the classpath

If you are using Hadoop 2.7, I'd recommend s3a instead, which means hadoop-aws and the exact
same amazon-sdk that comes bundled with the hadoop binaries your version of spark is built
with (it's a bit brittle API-wise)


FYI

On Tue, Apr 26, 2016 at 9:06 AM, Jinan Alhajjaj <j.r.alhajjaj@hotmail.com<mailto:j.r.alhajjaj@hotmail.com>>
wrote:
Hi All,
I am trying to read a file stored in Amazon S3.
I wrote this code:
import java.util.List;
import java.util.Scanner;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
public class WordAnalysis {
public static void main(String[] args) {
    int startYear=0;
    int endyear=0;
    Scanner input = new Scanner(System.in);
    System.out.println("Please, Enter 1 if you want 1599-2008 or enter 2 for specific range:
");
    int choice=input.nextInt();

    if(choice==1)
    {
    startYear=1500;
    endyear=2008;
    }
    if(choice==2)
    {
     System.out.print("please,Enter the start year : ");
    startYear = input.nextInt();
     System.out.print("please,Enter the end year : ");
    endyear = input.nextInt();
    }
SparkConf conf = new SparkConf().setAppName("jinantry").setMaster("local");
JavaSparkContext spark = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(spark);
JavaRDD<Items> ngram = spark.textFile("s3n://google-books-ngram/1gram/googlebooks-eng-all-1gram-20120701-x.gz‏")
.map(new Function<String, Items>() {
public Items call(String line) throws Exception {
String[] parts = line.split("\t");
Items item = new Items();
if (parts.length == 4) {
item.setWord(parts[0]);
item.setYear(Integer.parseInt(parts[1]));
item.setCount(Integer.parseInt(parts[2]));
item.setVolume(Integer.parseInt(parts[3]));
return item;
} else {
item.setWord(" ");
item.setYear(Integer.parseInt(" "));
item.setCount(Integer.parseInt(" "));
item.setVolume(Integer.parseInt(" "));
return item;
}
}
});
DataFrame schemangram = sqlContext.createDataFrame(ngram, Items.class);
schemangram.registerTempTable("ngram");
String sql="SELECT word,SUM(count) FROM ngram where year >="+startYear+" AND year<="+endyear+"
And word LIKE '%_NOUN' GROUP BY word ORDER BY SUM(count) DESC";
DataFrame matchyear = sqlContext.sql(sql);
List<Row> words=matchyear.collectAsList();
int i=1;
    for (Row scholar : words) {
System.out.println(scholar);
if(i==10)
break;
i++;
  }

}

}


When I run it this error appear to me:
Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
execute, tree:
Exchange rangepartitioning(aggOrder#5L DESC,200), None
+- ConvertToSafe
   +- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0 as bigint)),mode=Final,isDistinct=false)],
output=[word#2,_c1#4L,aggOrder#5L])
      +- TungstenExchange hashpartitioning(word#2,200), None
         +- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0 as bigint)),mode=Partial,isDistinct=false)],
output=[word#2,sum#8L])
            +- Project [word#2,count#0]
               +- Filter (((year#3 >= 1500) && (year#3 <= 1600)) &&
word#2 LIKE %_NOUN)
                  +- Scan ExistingRDD[count#0,volume#1,word#2,year#3]

at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:64)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.rdd$lzycompute(DataFrame.scala:1637)
at org.apache.spark.sql.DataFrame.rdd(DataFrame.scala:1634)
at org.apache.spark.sql.DataFrame$$anonfun$collectAsList$1$$anonfun$apply$12.apply(DataFrame.scala:1493)
at org.apache.spark.sql.DataFrame$$anonfun$collectAsList$1$$anonfun$apply$12.apply(DataFrame.scala:1493)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at org.apache.spark.sql.DataFrame$$anonfun$collectAsList$1.apply(DataFrame.scala:1492)
at org.apache.spark.sql.DataFrame$$anonfun$collectAsList$1.apply(DataFrame.scala:1491)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
at org.apache.spark.sql.DataFrame.collectAsList(DataFrame.scala:1491)
at WordAnalysis.main(WordAnalysis.java:60)
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0 as bigint)),mode=Final,isDistinct=false)],
output=[word#2,_c1#4L,aggOrder#5L])
+- TungstenExchange hashpartitioning(word#2,200), None
   +- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0 as bigint)),mode=Partial,isDistinct=false)],
output=[word#2,sum#8L])
      +- Project [word#2,count#0]
         +- Filter (((year#3 >= 1500) && (year#3 <= 1600)) && word#2
LIKE %_NOUN)
            +- Scan ExistingRDD[count#0,volume#1,word#2,year#3]

at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:80)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Exchange.prepareShuffleDependency(Exchange.scala:164)
at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:254)
at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:248)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
... 33 more
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenExchange hashpartitioning(word#2,200), None
+- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0 as bigint)),mode=Partial,isDistinct=false)],
output=[word#2,sum#8L])
   +- Project [word#2,count#0]
      +- Filter (((year#3 >= 1500) && (year#3 <= 1600)) && word#2 LIKE
%_NOUN)
         +- Scan ExistingRDD[count#0,volume#1,word#2,year#3]

at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:86)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:80)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
... 47 more
Caused by: java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91)
at org.apache.spark.sql.execution.Exchange.prepareShuffleDependency(Exchange.scala:220)
at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:254)
at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:248)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
... 55 more
 could any one help me in this.
Thank you

Mime
View raw message