When you say the RDDs support point prediction, I think you mean that those older models expose a method to score a Vector. They are not somehow exposing distributed point prediction. You could add this to the newer models, but it raises the question of how to make the Row to feed it? the .mllib punts on this and assumes you can construct the Vector.
AK: In my mind, punting is exactly the right solution - no overhead, full control to the user
I think this sweeps a lot under the rug in assuming that there can just be a "local" version of every Transformer -- but, even if there could be, consider how much extra implementation that is. Lots of them probably could be but I'm not sure that all can.
AK: I'm not aware of models for which this is not possible - there are no Spark-only algorithms that I'm aware of. The work to convert Spark to Local models may be more involved for some implementations, sure, but I don't think any would be too bad. However if there is something that's impossible, then that's fine too. I'm not sure we have to commit to having local versions for every single model
The bigger problem in my experience is the Pipelines don't generally encapsulate the entire pipeline from source data to score. They encapsulate the part after computing underlying features. That is, if one of your features is "total clicks from this user", that's the product of a DataFrame operation that precedes a Pipeline. This can't be turned into a non-distributed, non-Spark local version.
AK: That's a great point, and a really good argument for keeping any local pipeline logic outside of Spark
Solving subsets of this problem could still be useful, and you've highlighted some external projects that try. I'd also highlight PMML as an established interchange format for just the model part, and for cases that don't involve much or any pipeline, it's a better fit paired with a library that can score from PMML.
AK: The problem with solutions like PMML is that they can tell you WHAT to do, but not HOW EXACTLY to do it. At the end of the day, the best model-description possible would be the metadata+ the code itself. That's the crux of my proposal - expose the implementation so users can use Spark models with the same exact code that was used to train
I think this is one of those things that could live outside the project, because it's more not-Spark than Spark. Remember too that building a solution into the project blesses one at the expense of others.