spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <>
Subject Re: unable to serialize analytics pipeline
Date Tue, 22 Oct 2013 17:57:14 GMT
If you distribute the needed jar(s) to your Workers, you may well be able
to instantiate what you need using mapPartitions, mapPartitionsWithIndex,
mapWith, flatMapWith, etc.  Be careful, though, about teardown of any
resource allocation that you may need to do within each partition.

On Tue, Oct 22, 2013 at 10:50 AM, Philip Ogren <>wrote:

> I have a text analytics pipeline that performs a sequence of steps (e.g.
> tokenization, part-of-speech tagging, etc.) on a line of text.  I have
> wrapped the whole pipeline up into a simple interface that allows me to
> call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass it a
> string, and get back some objects.  Now, I would like to do the same thing
> for items in a Spark RDD via a map transformation.  Unfortunately, my
> pipeline is not serializable and so I get a NotSerializableException when I
> try this.  I played around with Kryo just now to see if that could help and
> I ended up with a "missing no-arg constructor" exception on a class I have
> no control over.  It seems the Spark framework expects that I should be
> able to serialize my pipeline when I can't (or at least don't think I can
> at first glance.)
> Is there a workaround for this scenario?  I am imagining a few possible
> solutions that seem a bit dubious to me, so I thought I would ask for
> direction before wandering about.  Perhaps a better understanding of
> serialization strategies might help me get the pipeline to serialize.  Or
> perhaps there is a way to instantiate my pipeline on demand on the nodes
> through a factory call.
> Any advice is appreciated.
> Thanks,
> Philip

View raw message