spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: Spark output compression on HDFS
Date Wed, 02 Apr 2014 22:00:19 GMT
Is this a Scala-only<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFile>feature?


On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell <pwendell@gmail.com> wrote:

> For textFile I believe we overload it and let you set a codec directly:
>
>
> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
>
> For saveAsSequenceFile yep, I think Mark is right, you need an option.
>
>
> On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra <mark@clearstorydata.com>wrote:
>
>> http://www.scala-lang.org/api/2.10.3/index.html#scala.Option
>>
>> The signature is 'def saveAsSequenceFile(path: String, codec:
>> Option[Class[_ <: CompressionCodec]] = None)', but you are providing a
>> Class, not an Option[Class].
>>
>> Try counts.saveAsSequenceFile(output,
>> Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
>>
>>
>>
>> On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev <
>> kudryavtsev.konstantin@gmail.com> wrote:
>>
>>> Hi there,
>>>
>>>
>>> I've started using Spark recently and evaluating possible use cases in
>>> our company.
>>>
>>> I'm trying to save RDD as compressed Sequence file. I'm able to save
>>> non-compressed file be calling:
>>>
>>> counts.saveAsSequenceFile(output)
>>>
>>> where counts is my RDD (IntWritable, Text). However, I didn't manage to
>>> compress output. I tried several configurations and always got exception:
>>>
>>> counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>> <console>:21: error: type mismatch;
>>>  found   : Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>  required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
>>>               counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>
>>>  counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec])
>>> <console>:21: error: type mismatch;
>>>  found   : Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>  required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
>>>               counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>
>>> and it doesn't work even for Gzip:
>>>
>>>  counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec])
>>> <console>:21: error: type mismatch;
>>>  found   : Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>  required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
>>>               counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>
>>> Could you please suggest solution? also, I didn't find how is it
>>> possible to specify compression parameters (i.e. compression type for
>>> Snappy). I wondered if you could share code snippets for writing/reading
>>> RDD with compression?
>>>
>>> Thank you in advance,
>>> Konstantin Kudryavtsev
>>>
>>
>>
>

Mime
View raw message