Spark uses the Twitter Chill library, which registers a bunch of core Scala and Java classes by default.  I'm assuming that java.util.Date is automatically registered by that, but Joda's DateTime is not.  We could always take a look through the source to confirm too.

As far as the class name, my understanding was that it would have the class name at the start of every serialized object, not just once.  I did some tests at one point to confirm that, but it's a little fuzzy so I won't say definitely that the class name is repeated.  Can you look at the Kryo-serialized version of the classes at some point to see what actually happens?

I did not register anything explicitly based on the belief that the class name is written out in full only once. I also wondered why that problem would be specific to JodaTime and not show up with guess it is possible based on internals of Joda time.
If I remove DateTime from my RDD, the problem goes away.
I will try explicit registration(and add DateTime back to my RDD) and see if that makes things better.


The log line about the ExternalAppendOnlyMap is more of a symptom of slowness than causing slowness itself.  The ExternalAppendOnlyMap is used when a shuffle is causing too much data to be held in memory.  Rather than OOM'ing, Spark writes the data out to disk in a sorted order and reads it back from disk later on when it's needed.  That's the job of the ExternalAppendOnlyMap.

I wouldn't normally expect a conversion from Date to a Joda DateTime to take significantly more memory.  But since you're using Kryo and classes should be registered with it, may may have forgotten to register DateTime with Kryo.  If you don't register a class, it writes the class name at the beginning of every serialized instance, which for DateTime objects of size roughly 1 long, that's a ton of extra space and very inefficient.

Can you confirm that DateTime is registered with Kryo?

I changed my application to use Joda time instead of java.util.Date and I started getting this:

WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1 time so far)

What does this mean? How can I fix this? Due to this a small job takes forever.


P.S.: I am using kyro serialization, have played around with several values of sparkRddMemFraction