spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yin Huai <huaiyin....@gmail.com>
Subject Re: SparkSQL DataType mappings
Date Thu, 02 Oct 2014 15:32:29 GMT
Hi Costin,

I am answering your questions below.

1. You can find  Spark SQL data type reference at here
<http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#spark-sql-datatype-reference>.
It explains the underlying data type for a Spark SQL data type for Scala,
Java, and Python APIs. For example, in Scala API, the underlying Scala type
of MapType is scala.collection.Map. While, in Java API, it is
java.util.Map. For StructType, yes, it should be cast to Row.

2. Interfaces like getFloat and getInteger are for primitive data types. For
other types, you can access values by ordinal. For example, row(1). Right
now, you have to cast values accessed by ordinal. Once
https://github.com/apache/spark/pull/1759 is in, accessing values in a row
will be much easier.

3. We are working on supporting CSV files (
https://github.com/apache/spark/pull/1351). Right now, you can use our
programatic
APIs
<http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#programmatically-specifying-the-schema>
to
create SchemaRDDs. Basically, you first define the schema (represented by a
StructType) of the SchemaRDD. Then, convert your RDD (for example,
RDD[String]) directly to RDD[Row]. Finally, use applySchema provided in
SQLContext/HiveContext to apply the defined schema to the RDD[Row]. The
return value of applySchema is the SchemaRDD you want.

Thanks,

Yin

On Tue, Sep 30, 2014 at 5:05 AM, Costin Leau <costin.leau@gmail.com> wrote:

> Hi,
>
> I'm working on supporting SchemaRDD in Elasticsearch Hadoop [1] but I'm
> having some issues with the SQL API, in particular in what the DataTypes
> translate to.
>
> 1. A SchemaRDD is composed of a Row and StructType - I'm using the latter
> to decompose a Row into primitives. I'm not clear however how to deal with
> _rich_ types, namely array, map and struct.
> MapType gives me type information about the key and its value however
> what's the actual Map object? j.u.Map, scala.Map?
> For example assuming row(0) has a MapType associated with it, to what do I
> cast row(0)?
> Same goes for StructType; if row(1) has a StructType associated with it,
> do I cast the value to Row?
>
> 2. Similar to the above, I've noticed the Row interface has cast methods
> so ideally one should use row(index).getFloat|Integer|Boolean etc... but
> I didn't see any methods for Binary or Decimal. Also the _rich_ types are
> missing; I presume this is for pluggability reasons however whats the
> generic way to access/unwrap the generic Any/Object in this case to the
> desired DataType?
>
> 3. On a separate note, for RDDs containing just values (think CSV,TSV
> files) is there an option to have a header associated with it without
> having to wrap each row with a case class? As each entry has exactly the
> same structure, the wrapping is just overhead that doesn't provide any
> extra information (you know the structure of one row, you know it for all
> of them).
>
> Thanks,
>
> [1] github.com/elasticsearch/elasticsearch-hadoop
> --
> Costin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message