spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From femibyte <>
Subject MLeap and Spark ML SQLTransformer
Date Fri, 03 Jan 2020 05:52:27 GMT

I have a question. I am trying to serialize a PySpark ML model to mleap.
However, the model makes use of the SQLTransformer to do some column-based
transformations e.g. adding log-scaled versions of some columns. As we all
know, Mleap doesn't support SQLTransformer - see here : so I've implemented the former
of these 2 suggestions:

For non-row operations, move the SQL out of the ML Pipeline that you plan to
serialize For row-based operations, use the available ML transformers or
write a custom transformer <- this is where the custom transformer
documentation will help. I've externalized the SQL transformation on the
training data used to build the model, and I do the same for the input data
when I run the model for evaluation.

The problem I'm having is that I'm unable to obtain the same results across
the 2 models.

*Model 1 *- Pure Spark ML model containing

SQLTransformer + later transformations : StringIndexer -> 
 OneHotEncoderEstimator -> VectorAssembler -> RandomForestClassifier

*Model 2* - Externalized version with SQL queries run on training data in
building the model. 
The transformations are everything after SQLTransformer in Model 1:

   /StringIndexer -> OneHotEncoderEstimator -> VectorAssembler ->

I'm wondering how I could go about debugging this problem. Is there a way to
somehow compare the results after each stage to see where the differences
show up ? Any suggestions are appreciated.

Sent from:

To unsubscribe e-mail:

View raw message