spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3573) Dataset
Date Wed, 29 Oct 2014 20:56:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188980#comment-14188980
] 

Joseph K. Bradley commented on SPARK-3573:
------------------------------------------

[~sparks]  Trying to simplify things, am I right that the main question is:
    _Should ML data instances/examples/rows be flat vectors or have structure?_
Breaking this down,
(1) Should we allow structure?
(2) Should we encourage flatness or structure, and how?
(3) How does a Dataset used in a full ML pipeline resemble/differ from a Dataset used by a
specific ML algorithm?

My thoughts:
(1) We should allow structure.  For general (complicated) pipelines, it will be important
to provide structure to make it easy to select groups of features.
(2) We should encourage flatness where possible; e.g., unigram features from a document should
be stored as a Vector instead of a bunch of Doubles in the Schema.  We should encourage structure
where meaningful; e.g., the output of a learning algorithm should be appended as a new column
(new element in the Schema) by default, rather than being appended to a big Vector of features.
(3) As in my comment for (2), a Dataset for a full pipeline should have structure where meaningful.
 However, I agree that most common ML algorithms expect flat Vectors of features.  There needs
to be an easy way to select relevant features and transform them to a Vector, LabeledPoint,
etc.  Having structured Datasets in the pipeline should be useful for selecting relevant features.
 To transform the selection, it will be important to provide helper methods for mushing the
data into Vectors or other common formats.

The big challenge in my mind is (2): Figuring out default behavior and perhaps column naming/selection
conventions which make it easy to select subsets of features (or even have an implicit selection
if possible).

What do you think?

> Dataset
> -------
>
>                 Key: SPARK-3573
>                 URL: https://issues.apache.org/jira/browse/SPARK-3573
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific
metadata embedded in its schema.
> .Sample code
> Suppose we have training events stored on HDFS and user/ad features in Hive, we want
to assemble features for training and then apply decision tree.
> The proposed pipeline with dataset looks like the following (need more refinements):
> {code}
> sqlContext.jsonFile("/path/to/training/events", 0.01).registerTempTable("event")
> val training = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, event.action
AS label,
>          user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures,
>          ad.targetGender AS targetGender
>     FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;""").cache()
> val indexer = new Indexer()
> val interactor = new Interactor()
> val fvAssembler = new FeatureVectorAssembler()
> val treeClassifer = new DecisionTreeClassifer()
> val paramMap = new ParamMap()
>   .put(indexer.features, Map("userCountryIndex" -> "userCountry"))
>   .put(indexer.sortByFrequency, true)
>   .put(interactor.features, Map("genderMatch" -> Array("userGender", "targetGender")))
>   .put(fvAssembler.features, Map("features" -> Array("genderMatch", "userCountryIndex",
"userFeatures")))
>   .put(fvAssembler.dense, true)
>   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes "features" and
"label" columns.
> val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier)
> val model = pipeline.fit(training, paramMap)
> sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event")
> val test = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
>          user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures,
>          ad.targetGender AS targetGender
>     FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;""")
> val prediction = model.transform(test).select('eventId, 'prediction)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message