spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hossein Falaki <hoss...@databricks.com>
Subject Re: SparkR DataFrame Column Casts esp. from CSV Files
Date Wed, 03 Jun 2015 17:55:34 GMT
Yes, spark-csv does not infer types yet, but it is planned to be implemented soon.

To work around the current limitations (of spark-csv and SparkR), you can specify the schema
in read.df() to get your desired types from spark-csv. For example:

myschema <- structType(structField(“id", "integer"), structField(“name", "string”),
structField(“location”, “string”))
df <- read.df(sqlContext, "path/to/file.csv", source = “com.databricks.spark.csv”,
schema = myschema)

—Hossein

> On Jun 3, 2015, at 10:29 AM, Shivaram Venkataraman <shivaram@eecs.berkeley.edu>
wrote:
> 
> cc Hossein who knows more about the spark-csv options
> 
> You are right that the default CSV reader options end up creating all columns as string.
I know that the JSON reader infers the schema [1] but I don't know if the CSV reader has any
options to do that.  Regarding the SparkR syntax to cast columns, I think there is a simpler
way to do it by just assigning to the same column name. For example I have a flights DataFrame
with the `year` column typed as string. To cast it to int I just use
> 
> flights$year <- cast(flights$year, "int")
> 
> Now the dataframe has the same number of columns as before and you don't need a selection.
> 
> However this still doesn't address the part about casting multiple columns -- Could you
file a new JIRA to track the need for casting multiple columns or rather being able to set
the schema after loading a DF ?
> 
> Thanks
> Shivaram
> 
> [1] http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets <http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets>

> 
> On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <Alek.Eskilson@cerner.com <mailto:Alek.Eskilson@cerner.com>>
wrote:
> It appears that casting columns remains a bit of a trick in Spark’s DataFrames. This
is an issue because tools like spark-csv will set column types to String by default and will
not attempt to infer types. Although spark-csv supports specifying  types for columns in its
options, it’s not clear how that might be integrated into SparkR (when loading the spark-csv
package into the R session). 
> 
> Looking at the column.R spec we can cast a column to a different data type with the cast
function [1], but it’s notable that this is not a mutator, and it returns a column object
as opposed to a DataFrame. It appears the column cast can only be ‘applied’ by using the
withColumn() or mutate() (an alias for withColumn). 
> 
> The other way to cast with Spark DataFrames is to write UDFs that operate on a column
value and return a coerced value. It looks like SparkR doesn’t have UDFs just yet [2], but
it seems like they’d be necessary to do a natural one-off column cast in R, something like
> 
> df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x) as.numeric(x)))
> 
> (where col1 was originally ‘character’ type)
> 
> Currently it seems one has to
> df.col1cast <- cast(df$col1, “int”)
> df.col1toInt <- withColumn(df, df.col1cast)
> 
> If we wanted just our casted columns and not the original column from the data frame,
we’d still have to do a select. There was a conversation about CSV files just yesterday.
Types are already problematic, but they’re a very common data source in R, even at scale.

> 
> But only being able to coerce one column at a time is really unwieldy. Can the current
spark-csv SQL API for specifying types [3] be extended SparkR? And are there any thoughts
on implementing some kind of type inferencing perhaps based on a sampling of some number of
rows (an implementation I’ve seen before)? R’s read.csv() and read.delim() get types by
inferring from the whole file. Getting something that can achieve that functionality via explicit
definition of types or sampling will probably be necessary to work with CSV files that have
enough columns to merit R at Spark’s scale.
> 
> Regards,
> Alek Eskilson
> 
> [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190 <https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190>
> [2] - https://issues.apache.org/jira/browse/SPARK-6817 <https://issues.apache.org/jira/browse/SPARK-6817>
> [3] - https://github.com/databricks/spark-csv#sql-api <https://github.com/databricks/spark-csv#sql-api>
> CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024
<tel:%28%2B1%29%20%28816%29221-1024>.
> 


Mime
View raw message