spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <>
Subject Re: pass unique ID to mllib algorithms pyspark
Date Wed, 05 Nov 2014 03:21:21 GMT
The proposed new set of APIs (SPARK-3573, SPARK-3530) will address
this issue. We "carry over" extra columns with training and prediction
and then leverage on Spark SQL's execution plan optimization to decide
which columns are really needed. For the current set of APIs, we can
add `predictOnValues` to models, which carries over the input keys.
StreamingKMeans and StreamingLinearRegression implement this method.

On Tue, Nov 4, 2014 at 2:30 AM, jamborta <> wrote:
> Hi all,
> There are a few algorithms in pyspark where the prediction part is
> implemented in scala (e.g. ALS, decision trees) where it is not very easy to
> manipulate the prediction methods.
> I think it is a very common scenario that the user would like to generate
> prediction for a datasets, so that each predicted value is identifiable
> (e.g. have a unique id attached to it). this is not possible in the current
> implementation as predict functions take a feature vector and return the
> predicted values where, I believe, the order is not guaranteed, so there is
> no way to join it back with the original data the predictions are generated
> from.
> Is there a way around this at the moment?
> thanks,
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message