spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evan Sparks (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3573) Dataset
Date Wed, 29 Oct 2014 20:13:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188919#comment-14188919
] 

Evan Sparks commented on SPARK-3573:
------------------------------------

This comment originally appeared on the PR associated with this feature. (https://github.com/apache/spark/pull/2919):

I've looked at the code here, and it basically seems reasonable. One high-level concern I
have is around the programming pattern that this encourages: complex nesting of otherwise
simple structure that may make it difficult to program against Datasets for sufficiently complicated
applications.

A 'dataset' is now a collection of Row, where we have the guarantee that all rows in a Dataset
conform to the same schema. A schema is a list of (name, type) pairs which describe the attributes
available in the dataset. This seems like a good thing to me, and is pretty much what we described
in MLI (and how conventional databases have been structured forever). So far, so good.

The concern that I have is that we are now encouraging these attributes to be complex types.
For example, where I might have had 
val x = Schema(('a', classOf[String]), ('b', classOf[Double]), ..., ("z", classOf[Double]))
This would become
val x = Schema(('a', classOf[String]), ('bGroup', classOf[Vector]), .., ("zGroup", classOf[Vector]))

So, great, my schema now has these vector things in them, which I can create separately, pass
around, etc.

This clearly has its merits:
1) Features are groups together logically based on the process that creates them.
2) Managing one short schema where each record is comprised of a few large objects (say, 4
vectors, each of length 1000) is probably easier than managing a really big schema comprised
of lots small objects (say, 4000 doubles).

But, there are some major drawbacks
1) Why only stop at one level of nesting? Why not have Vector[Vector]? 
2) How do learning algorithms, like SVM or PCA deal with these Datasets? Is there an implicit
conversion that flattens these things to RDD[LabeledPoint]? Do we want to guarantee these
semantics?
3) Manipulating and subsetting nested schemas like this might be tricky. Where before I might
be able to write:

val x: Dataset = input.select(Seq(0,1,2,4,180,181,1000,1001,1002))
now I might have to write
val groupSelections = Seq(Seq(0,1,2,4),Seq(0,1),Seq(0,1,2))
val x: Dataset = groupSelections.zip(input.columns).map {case (gs, col) => col(gs) }

Ignoring raw syntax and semantics of how you might actually map an operation over the columns
of a Dataset and get back a well-structured dataset, I think this makes two conflicting points:
1) In the first example - presumably all the work goes into figuring out what the subset of
features you want is in this really wide feature space.
2) In the second example - there’s a lot of gymnastics that goes into subsetting feature
groups. I think it’s clear that working with lots of feature groups might get unreasonable
pretty quickly.

If we look at R or pandas/scikit-learn as examples of projects that have (arguably quite successfully)
dealt with these interface issues, there is one basic pattern: learning algorithms expect
big tables of numbers as input. Even here, there are some important differences:

For example, in scikit-learn, categorical features aren’t supported directly by most learning
algorithms. Instead, users are responsible for getting data from “table with heterogenously
typed columns” to “table of numbers.” with something like OneHotEncoder and other feature
transformers. In R, on the other hand, categorical features are (sometimes frustratingly)
first class citizens by virtue of the “factor” data type - which is essentially and enum.
Most out-of-the-box learning algorithms (like glm()) accept data frames with categorical inputs
and handle them sensibly - implicitly one hot encoding (or creating dummy variables, if you
prefer) the categorical features.

While I have a slight preference for representing things as big flat tables, I would be fine
coding either way - but I wanted to raise the issue for discussion here before the interfaces
are set in stone.

> Dataset
> -------
>
>                 Key: SPARK-3573
>                 URL: https://issues.apache.org/jira/browse/SPARK-3573
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific
metadata embedded in its schema.
> .Sample code
> Suppose we have training events stored on HDFS and user/ad features in Hive, we want
to assemble features for training and then apply decision tree.
> The proposed pipeline with dataset looks like the following (need more refinements):
> {code}
> sqlContext.jsonFile("/path/to/training/events", 0.01).registerTempTable("event")
> val training = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, event.action
AS label,
>          user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures,
>          ad.targetGender AS targetGender
>     FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;""").cache()
> val indexer = new Indexer()
> val interactor = new Interactor()
> val fvAssembler = new FeatureVectorAssembler()
> val treeClassifer = new DecisionTreeClassifer()
> val paramMap = new ParamMap()
>   .put(indexer.features, Map("userCountryIndex" -> "userCountry"))
>   .put(indexer.sortByFrequency, true)
>   .put(interactor.features, Map("genderMatch" -> Array("userGender", "targetGender")))
>   .put(fvAssembler.features, Map("features" -> Array("genderMatch", "userCountryIndex",
"userFeatures")))
>   .put(fvAssembler.dense, true)
>   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes "features" and
"label" columns.
> val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier)
> val model = pipeline.fit(training, paramMap)
> sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event")
> val test = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
>          user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures,
>          ad.targetGender AS targetGender
>     FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;""")
> val prediction = model.transform(test).select('eventId, 'prediction)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message