spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
Subject Re: SparkR DataFrame Column Casts esp. from CSV Files
Date Wed, 03 Jun 2015 17:29:12 GMT
cc Hossein who knows more about the spark-csv options

You are right that the default CSV reader options end up creating all
columns as string. I know that the JSON reader infers the schema [1] but I
don't know if the CSV reader has any options to do that.  Regarding the
SparkR syntax to cast columns, I think there is a simpler way to do it by
just assigning to the same column name. For example I have a flights
DataFrame with the `year` column typed as string. To cast it to int I just
use

flights$year <- cast(flights$year, "int")

Now the dataframe has the same number of columns as before and you don't
need a selection.

However this still doesn't address the part about casting multiple columns
-- Could you file a new JIRA to track the need for casting multiple columns
or rather being able to set the schema after loading a DF ?

Thanks
Shivaram

[1]
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets


On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <
Alek.Eskilson@cerner.com> wrote:

>  It appears that casting columns remains a bit of a trick in Spark’s
> DataFrames. This is an issue because tools like spark-csv will set column
> types to String by default and will not attempt to infer types. Although
> spark-csv supports specifying  types for columns in its options, it’s not
> clear how that might be integrated into SparkR (when loading the spark-csv
> package into the R session).
>
>  Looking at the column.R spec we can cast a column to a different data
> type with the cast function [1], but it’s notable that this is not a
> mutator, and it returns a column object as opposed to a DataFrame. It
> appears the column cast can only be ‘applied’ by using the withColumn() or
> mutate() (an alias for withColumn).
>
>  The other way to cast with Spark DataFrames is to write UDFs that
> operate on a column value and return a coerced value. It looks like SparkR
> doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do
> a natural one-off column cast in R, something like
>
>  df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x)
> as.numeric(x)))
>
>  (where col1 was originally ‘character’ type)
>
>  Currently it seems one has to
> df.col1cast <- cast(df$col1, “int”)
> df.col1toInt <- withColumn(df, df.col1cast)
>
>  If we wanted just our casted columns and not the original column from
> the data frame, we’d still have to do a select. There was a conversation
> about CSV files just yesterday. Types are already problematic, but they’re
> a very common data source in R, even at scale.
>
>  But only being able to coerce one column at a time is really unwieldy.
> Can the current spark-csv SQL API for specifying types [3] be extended
> SparkR? And are there any thoughts on implementing some kind of type
> inferencing perhaps based on a sampling of some number of rows (an
> implementation I’ve seen before)? R’s read.csv() and read.delim() get types
> by inferring from the whole file. Getting something that can achieve that
> functionality via explicit definition of types or sampling will probably be
> necessary to work with CSV files that have enough columns to merit R at
> Spark’s scale.
>
>  Regards,
> Alek Eskilson
>
>  [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190
> [2] - https://issues.apache.org/jira/browse/SPARK-6817
> [3] - https://github.com/databricks/spark-csv#sql-api
>
>  CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>

Mime
View raw message