So you will need to convert your input DataFrame into something with vectors and labels to train on - the Spark ML documentation has examples (although the website seems to be having some issues mid update to Spark 2.0 so if you want to read it right now )

As for why some algorithms are available in the RDD API and not the DataFrame API yet - simply development time. The DataFrame/Pipeline time will be the actively developed API going forward.


Holden :)

On Tuesday, July 26, 2016, Shi Yu <> wrote:

Question 1: I am new to Spark. I am trying to train classification model on Spark DataFrame. I am using PySpark.  And aFrame object in df:ted a Spark DataFrame object in df:

from pyspark.sql.types import *

query = """select * from table"""

df = sqlContext.sql(query)
My question is how to continue extend the code to train models (e.g., classification model etc.) on object df?  I have checked many online resources and haven't seen any similar approach like the following:
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel =
Is it a feasible way to train the model? If yes, where could I find the reference code?
Question 2:  Why in MLib dataframe based API there is no SVM model support, however, in RDD-based APIs there was SVM model? 
Thanks a lot!



Cell : 425-233-8271