spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Ogren <>
Subject unable to serialize analytics pipeline
Date Tue, 22 Oct 2013 17:50:10 GMT

I have a text analytics pipeline that performs a sequence of steps (e.g. 
tokenization, part-of-speech tagging, etc.) on a line of text.  I have 
wrapped the whole pipeline up into a simple interface that allows me to 
call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass 
it a string, and get back some objects.  Now, I would like to do the 
same thing for items in a Spark RDD via a map transformation.  
Unfortunately, my pipeline is not serializable and so I get a 
NotSerializableException when I try this.  I played around with Kryo 
just now to see if that could help and I ended up with a "missing no-arg 
constructor" exception on a class I have no control over.  It seems the 
Spark framework expects that I should be able to serialize my pipeline 
when I can't (or at least don't think I can at first glance.)

Is there a workaround for this scenario?  I am imagining a few possible 
solutions that seem a bit dubious to me, so I thought I would ask for 
direction before wandering about.  Perhaps a better understanding of 
serialization strategies might help me get the pipeline to serialize.  
Or perhaps there is a way to instantiate my pipeline on demand on the 
nodes through a factory call.

Any advice is appreciated.


View raw message