spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Madhu <>
Subject Re: Hadoop Writable and Spark serialization
Date Thu, 15 May 2014 01:02:02 GMT
I have done this kind of thing successfully using Hadoop serialization, e.g.
SessionContainer extends Writable and override write/readFields. I didn't
try Kyro.

It's fairly straightforward, I'll see if I can dig up the code if you really
need it.
I remember that I had to add a map transformation or something to that
effect since Hadoop sometimes gives you a mutated reference to a previous
object rather than a new one :-(

Also, I don't think you need to parallelize sampledSessions in your code
I think this will work:

   val sampledSessions = sc.sequenceFile[Text,
SessionContainer](inputPath).takeSample(false, 1000, 0)

How many small files are you getting?
I tend to think you will get as many files as partitions, which is usually
not that high.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message