spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameet Kini <ameetk...@gmail.com>
Subject Re: Saving compressed sequence files
Date Thu, 29 Aug 2013 11:27:51 GMT
Thanks Reynold, I'll take a look.

Ameet


On Thu, Aug 29, 2013 at 12:47 AM, Reynold Xin <rxin@cs.berkeley.edu> wrote:

> I don't think it's a system property.
>
> There is support for adding compression to the save function in the latest
> 0.8 code:
> https://github.com/mesos/spark/blob/master/core/src/main/scala/spark/PairRDDFunctions.scala#L609
>
> You can take a look at how that is done.
>
>
> --
> Reynold Xin, AMPLab, UC Berkeley
> http://rxin.org
>
>
>
> On Wed, Aug 28, 2013 at 6:56 AM, Ameet Kini <ameetkini@gmail.com> wrote:
>
>> Folks,
>>
>> Still stuck on this, so would greatly appreciate any pointers as to how
>> to force Spark to recognize the mapred.output.compression.type hadoop
>> parameter.
>>
>> Thanks,
>> Ameet
>>
>>
>> On Mon, Aug 26, 2013 at 6:09 PM, Ameet Kini <ameetkini@gmail.com> wrote:
>>
>>>
>>> I'm trying to use saveAsSequenceFile to output compressed sequenced
>>> files where the "value" in each key,value pair is compressed. In Hadoop, I
>>> would set this job configuration parameter:
>>> "mapred.output.compression.type=RECORD" for record level compression.
>>> Previous posts have suggested that this is possible by simply setting this
>>> parameter in the core-site.xml. I tried doing just that, and the sequence
>>> file doesn't seem to be compressed.
>>>
>>> I've also tried doing this by setting
>>> spark.hadoop.mapred.output.compression.type as a system parameter just
>>> before initializing the spark context:
>>> System.setProperty("spark.hadoop.mapred.output.compression.type",
>>> "RECORD")
>>>
>>> In both cases, I can see that the resulting configuration as per
>>> SparkContext.hadoopConfiguration has the property set to RECORD, but the
>>> resulting sequence file still has its value uncompressed.
>>>
>>> At first, I thought that this is because io.compression.codecs was set
>>> to null, so I set io.compression.codecs to the long list of codecs that is
>>> its normal default value in a Hadoop environment, but still to no avail. Am
>>> I missing a crucial step?
>>>
>>> Thanks,
>>> Ameet
>>>
>>
>>
>

Mime
View raw message