From Ameet Kini <>
Subject Saving compressed sequence files
Date Mon, 26 Aug 2013 22:09:49 GMT
I'm trying to use saveAsSequenceFile to output compressed sequenced files
where the "value" in each key,value pair is compressed. In Hadoop, I would
set this job configuration parameter:
"mapred.output.compression.type=RECORD" for record level compression.
Previous posts have suggested that this is possible by simply setting this
parameter in the core-site.xml. I tried doing just that, and the sequence
file doesn't seem to be compressed.

I've also tried doing this by setting
spark.hadoop.mapred.output.compression.type as a system parameter just
before initializing the spark context:
System.setProperty("spark.hadoop.mapred.output.compression.type", "RECORD")

In both cases, I can see that the resulting configuration as per
SparkContext.hadoopConfiguration has the property set to RECORD, but the
resulting sequence file still has its value uncompressed.

At first, I thought that this is because io.compression.codecs was set to
null, so I set io.compression.codecs to the long list of codecs that is its
normal default value in a Hadoop environment, but still to no avail. Am I
missing a crucial step?


