spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weiqiang Zhuang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-10894) Add 'drop' support for DataFrame's subset function
Date Thu, 01 Oct 2015 17:28:26 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940099#comment-14940099
] 

Weiqiang Zhuang commented on SPARK-10894:
-----------------------------------------

First, the inconsistence exists in SparkR itself between these two subset functions:

> class(df$Sepal_Width)
[1] "Column"
attr(,"package")
[1] "SparkR"

while df[, "Sepal_Width"] is a DataFrame class. So this needs to be fixed either way in my
opinion.

Second, I understand that SparkR does not have the vector like data type, but from our Big
R customers experience, we do see a need for processing on vector. After all, vector is another
popular R data type used in many R functions. And I think we can improve 'Column' to support
those functions (TBD). For example, we will need as.vector() function to collect just one
column of data from the DataFrame. Everything is a DataFrame is cool but it will need extra
check (whether containing only 1 column) when implementing such function. Another example
is the R table() function (not the SparkR table() function).

Thanks.

> Add 'drop' support for DataFrame's subset function
> --------------------------------------------------
>
>                 Key: SPARK-10894
>                 URL: https://issues.apache.org/jira/browse/SPARK-10894
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>            Reporter: Weiqiang Zhuang
>
> SparkR DataFrame can be subset to get one or more columns of the dataset. The current
'[' implementation does not support 'drop' when is asked for just one column. This is not
consistent with the R syntax:
> x[i, j, ... , drop = TRUE]
> # in R, when drop is FALSE, remain as data.frame
> > class(iris[, "Sepal.Width", drop=F])
> [1] "data.frame"
> # when drop is TRUE (default), drop to be a vector
> > class(iris[, "Sepal.Width", drop=T])
> [1] "numeric"
> > class(iris[,"Sepal.Width"])
> [1] "numeric"
> > df <- createDataFrame(sqlContext, iris)
> # in SparkR, 'drop' argument has no impact
> > class(df[,"Sepal_Width", drop=F])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> # should have dropped to be a Column class instead
> > class(df[,"Sepal_Width", drop=T])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> > class(df[,"Sepal_Width"])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> We should add the 'drop' support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message