spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shenyan zhen <shenya...@gmail.com>
Subject read compressed hdfs files using SparkContext.textFile?
Date Tue, 08 Sep 2015 19:13:23 GMT
Hi,

For hdfs files written with below code:

rdd.saveAsTextFile(getHdfsPath(...), classOf
[org.apache.hadoop.io.compress.GzipCodec])


I can see the hdfs files been generated:


0      /lz/streaming/am/1441734600000/_SUCCESS

1.6 M  /lz/streaming/am/1441734600000/part-00000.gz

1.6 M  /lz/streaming/am/1441734600000/part-00001.gz

1.6 M  /lz/streaming/am/1441734600000/part-00002.gz

...


How do I read it using SparkContext?


My naive attempt:

val t1 = sc.textFile("/lz/streaming/am/1441734600000")

t1.take(1).head

did not work:


org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/lz/streaming/am/1441734600000

at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)

at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)

at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)


Thanks,

Shenyan

Mime
View raw message