Hello all,
I have one big RDD, in which there is a column of groups A1, A2, B1, B2, B3, C1, D1, ...,
XY.
Out of it, I am using map() to transform into RDD[LabeledPoint] with dense vectors for later
use into Logistic Regression, which takes RDD[LabeledPoint]
I would like to run a logistic regression for each one of this N groups (which is NOT part
of any features used in the model itself), but I could not find a proper way.
1. Can't programatically create sub RDDs with a loop: org.apache.spark.SparkException:
RDD transformations and actions can only be invoked by the driver, not inside of other transformations;
2. Can't create RDDs manually with split() since unknown and large number of groups
3. Pair RDDs seemed a tempting choice with some reduce/combine/values bykey functions,
but non of them return a datatype valuable as a RDD[LabeledPoint] which is lately an input
for Logistic Regressions. Any programatical way to get subRDDs get me back to item 1.
The logit is a simple binary dependant variable out of n features, I just need to run one
logit for each group.
There may be some mathematical equivalent to run this in one big regression, but so far, im
out of ideas.
Saif
