spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Allen <n...@nickallen.org>
Subject Re: Spark random forest - string data
Date Fri, 16 Jan 2015 21:59:39 GMT
An alternative approach would be to translate your categorical variables
into dummy variables.  If your strings represent N classes/categories you
would generate N-1 dummy variables containing 0/1 values.

Auto-magically creating dummy variables from categorical data definitely
comes in handy.  I assume this is what SPARK-1216 is referring to, but I am
not sure from the description.

https://issues.apache.org/jira/browse/SPARK-1216

Auto-magically doing the scheme that Sean mentioned is referenced in
SPARK-4081, I believe.

https://issues.apache.org/jira/browse/SPARK-4081



On Fri, Jan 16, 2015 at 4:45 PM, Sean Owen <sowen@cloudera.com> wrote:

> The implementation accepts an RDD of LabeledPoint only, so you
> couldn't feed in strings from a text file directly. LabeledPoint is a
> wrapper around double values rather than strings. How were you trying
> to create the input then?
>
> No, it only accepts numeric values, although you can encode
> categorical values as 0, 1, 2 ... and tell the implementation about
> your categorical features to use categorical features.
>
> On Fri, Jan 16, 2015 at 9:25 PM, Asaf Lahav <asaf.lahav@gmail.com> wrote:
> > Hi,
> >
> > I have been playing around with the new version of Spark MLlib Random
> forest
> > implementation, and while in the process, tried it with a file with
> String
> > Features.
> > While training, it fails with:
> > java.lang.NumberFormatException: For input string.
> >
> >
> > Is MBLib Random forest adapted to run on top of numeric data only?
> >
> > Thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Nick Allen <nick@nickallen.org>

Mime
View raw message