spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Saif.A.Ell...@wellsfargo.com>
Subject MLLIB RDD segmentation for logistic regression
Date Mon, 13 Jul 2015 21:30:01 GMT
Hello all,

I have one big RDD, in which there is a column of groups A1, A2, B1, B2, B3, C1, D1, ...,
XY.
Out of it, I am using map() to transform into RDD[LabeledPoint] with dense vectors for later
use into Logistic Regression, which takes RDD[LabeledPoint]
I would like to run a logistic regression for each one of this N groups (which is NOT part
of any features used in the model itself), but I could not find a proper way.

1.      Can't programatically create sub RDDs with a loop: org.apache.spark.SparkException:
RDD transformations and actions can only be invoked by the driver, not inside of other transformations;

2.      Can't create RDDs manually with split() since unknown and large number of groups

3.      Pair RDDs seemed a tempting choice with some reduce/combine/values bykey functions,
but non of them return a data-type valuable as a RDD[LabeledPoint] which is lately an input
for Logistic Regressions. Any programatical way to get sub-RDDs get me back to item 1.

The logit is a simple binary dependant variable out of n features, I just need to run one
logit for each group.
There may be some mathematical equivalent to run this in one big regression, but so far, im
out of ideas.

Saif


Mime
View raw message