Thanks, but I should have been more clear that I'm trying to do this in PySpark, not Scala. Using an example I found on SO, I was able to implement a Pipeline step in Python, but it seems it is more difficult (perhaps currently impossible) to make it persist to disk (I tried implementing _to_java method to no avail). Any ideas about that?

On Sun, Aug 14, 2016 at 6:02 PM Jacek Laskowski <> wrote:

It should just work if you followed the Transformer interface [1].
When you have the transformers, creating a Pipeline is a matter of
setting them as additional stages (using Pipeline.setStages [2]).


Jacek Laskowski
Mastering Apache Spark 2.0
Follow me at

On Fri, Aug 12, 2016 at 9:19 AM, evanzamir <> wrote:
> I'm building an LDA Pipeline, currently with 4 steps, Tokenizer,
> StopWordsRemover, CountVectorizer, and LDA. I would like to add more steps,
> for example, stemming and lemmatization, and also 1-gram and 2-grams (which
> I believe is not supported by the default NGram class). Is there a way to
> add these steps? In sklearn, you can create classes with fit() and
> transform() methods, and that should be enough. Is that true in Spark ML as
> well (or something similar)?
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe e-mail: