spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Turning kryo on does not decrease binary output
Date Fri, 03 Jan 2014 19:10:55 GMT
saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which
uses some Scala magic to become available when you have an that's RDD[Key,
Value]

https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L648

Agreed, something like Chill would make this much easier for the default
cases.


On Fri, Jan 3, 2014 at 2:04 PM, Aureliano Buendia <buendia360@gmail.com>wrote:

> RDD only defines saveAsTextFile and saveAsObjectFile. I think
> saveAsHadoopFile and saveAsNewAPIHadoopFile belong to the older versions.
>
> saveAsObjectFile definitely outputs hadoop format.
>
> I'm not trying to save big objects by saveAsObjectFile, I'm just trying to
> minimize the java serialization overhead when saving to a binary file.
>
> I can see spark can benefit from something like
> https://github.com/twitter/chill in this matter.
>
>
> On Fri, Jan 3, 2014 at 6:42 PM, Guillaume Pitel <
> guillaume.pitel@exensa.com> wrote:
>
>>  Hi,
>>
>> After a little bit of thinking, I'm not sure anymore if saveAsObjectFile
>> uses the spark.hadoop.*
>>
>> Also, I did write a mistake. The use of *.mapred.* or *.mapreduce.* does
>> not depend on the hadoop version you use, but onthe API version you use
>>
>> So, I can assure you that if you use the saveAsNewAPIHadoopFile, with the
>> spark.hadoop.mapreduce.* properties, the compression will be used.
>>
>> If you use the saveAsHadoopFile, it should be used with mapred.*
>>
>> If you use the saveAsObjectFile to a hdfs path, I'm not sure if the
>> output is compressed.
>>
>> Anyway, saveAsObjectFile should be used for small objects, in my opinion.
>>
>> Guillaume
>>
>>   Even
>>
>> someMap.saveAsTextFile("out", classOf[GzipCodec])
>>
>>  has no effect.
>>
>>  Also, I notices that saving sequence files has no compression option (my
>> original question was about compressing binary output).
>>
>>  Having said this, I still do not understand why kryo cannot be helpful
>> when saving binary output. Binary output uses java serialization, which has
>> a pretty hefty overhead.
>>
>>  How can kryo be applied to T when calling RDD[T]#saveAsObjectFile()?
>>
>>
>> --
>>    [image: eXenSa]
>>  *Guillaume PITEL, Président*
>> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>>
>>  eXenSa S.A.S. <http://www.exensa.com/>
>>  41, rue Périer - 92120 Montrouge - FRANCE
>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>>
>
>

Mime
View raw message