spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Ogren <>
Subject Re: unable to serialize analytics pipeline
Date Tue, 22 Oct 2013 18:51:24 GMT
A simple workaround that seems to work (at least in localhost mode) is 
to mark my top-level pipeline object (inside my simple interface) as 
transient and add an initialize method.  In the method that calls the 
pipeline and returns the results, I simply call the initialize method if 
needed (i.e. if the pipeline object is null.)  This seems reasonable to 
me.  I will try it on an actual cluster next....


On 10/22/2013 11:50 AM, Philip Ogren wrote:
> I have a text analytics pipeline that performs a sequence of steps 
> (e.g. tokenization, part-of-speech tagging, etc.) on a line of text.  
> I have wrapped the whole pipeline up into a simple interface that 
> allows me to call it from Scala as a POJO - i.e. I instantiate the 
> pipeline, I pass it a string, and get back some objects.  Now, I would 
> like to do the same thing for items in a Spark RDD via a map 
> transformation.  Unfortunately, my pipeline is not serializable and so 
> I get a NotSerializableException when I try this.  I played around 
> with Kryo just now to see if that could help and I ended up with a 
> "missing no-arg constructor" exception on a class I have no control 
> over.  It seems the Spark framework expects that I should be able to 
> serialize my pipeline when I can't (or at least don't think I can at 
> first glance.)
> Is there a workaround for this scenario?  I am imagining a few 
> possible solutions that seem a bit dubious to me, so I thought I would 
> ask for direction before wandering about.  Perhaps a better 
> understanding of serialization strategies might help me get the 
> pipeline to serialize.  Or perhaps there is a way to instantiate my 
> pipeline on demand on the nodes through a factory call.
> Any advice is appreciated.
> Thanks,
> Philip

View raw message