spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-4872) Provide sample format of training/test data in MLlib programming guide
Date Wed, 04 Mar 2015 23:10:38 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347745#comment-14347745
] 

Joseph K. Bradley commented on SPARK-4872:
------------------------------------------

Can this issue be closed?

> Provide sample format of training/test data in MLlib programming guide
> ----------------------------------------------------------------------
>
>                 Key: SPARK-4872
>                 URL: https://issues.apache.org/jira/browse/SPARK-4872
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 1.1.1
>            Reporter: zhang jun wei
>              Labels: documentation
>
> I suggest: in samples of the online programming guide of MLlib, it's better to give examples
in the real life data, and list the translated data format for the model to consume. 
> The problem blocking me is how to translate the real life data into the format which
MLLib  can understand correctly. 
> Here is one sample, I want to use NaiveBayes to train and predict tennis-play decision,
the original data is:
> Weather | Temperature | Humidity | Wind  => Decision to play tennis
> Sunny     | Hot               | High       | No     => No
> Sunny     | Hot               | High       | Yes    => No
> Cloudy    | Normal         | Normal   | No     => Yes
> Rainy      | Cold             | Normal   | Yes    => No
> Now, from my understanding, one potential translation is:
> 1) put every feature value word into a line:
> Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No
> 2) map them to numbers:
> 1 2 3 4 5 6 7 8 9 10
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) set the value to 1 if it appears, or 0 if not, for the above example, here is the
data format for MLUtils.loadLibSVMFile to use:
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0
> 1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1
> 0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0
> ==> Is this a correct understanding?
> And another way I can image is:
> 1) put every feature name into a line:
> Weather  Temperature  Humidity  Wind
> 2) map them to numbers:
> 1 2 3 4 
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2, Rainy to
3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to 1, No to 2) for the above
example, here is the data format for MLUtils.loadLibSVMFile to use:
> 0 1:1 2:1 3:1 4:2
> 0 1:1 2:1 3:1 4:1
> 1 1:2 2:2 3:2 4:2
> 0 1:3 2:3 3:2 4:1
> ==> but when I read the source code in NaiveBayes.scala, seems this is not correct,
I am not sure though...
> So which data format translation way is correct?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message