spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Costin Leau <costin.l...@gmail.com>
Subject Re: SparkSQL DataType mappings
Date Thu, 02 Oct 2014 21:59:32 GMT
Hi Yin,

Thanks for the reply. I've found the section as well, a couple of days ago and managed to
integrate es-hadoop with Spark 
SQL [1]

Cheers,

[1] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html

On 10/2/14 6:32 PM, Yin Huai wrote:
> Hi Costin,
>
> I am answering your questions below.
>
> 1. You can find  Spark SQL data type reference at here
> <http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#spark-sql-datatype-reference>.
It explains the underlying
> data type for a Spark SQL data type for Scala, Java, and Python APIs. For example, in
Scala API, the underlying Scala
> type of MapType is scala.collection.Map. While, in Java API, it is java.util.Map. For
StructType, yes, it should be cast
> to Row.
>
> 2. Interfaces like getFloat and getInteger are for primitive data types. For other types,
you can access values by
> ordinal. For example, row(1). Right now, you have to cast values accessed by ordinal.
Once
> https://github.com/apache/spark/pull/1759 is in, accessing values in a row will be much
easier.
>
> 3. We are working on supporting CSV files (https://github.com/apache/spark/pull/1351).
Right now, you can use our
> programatic APIs
> <http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#programmatically-specifying-the-schema>
to create
> SchemaRDDs. Basically, you first define the schema (represented by a StructType) of the
SchemaRDD. Then, convert your
> RDD (for example, RDD[String]) directly to RDD[Row]. Finally, use applySchema provided
in SQLContext/HiveContext to
> apply the defined schema to the RDD[Row]. The return value of applySchema is the SchemaRDD
you want.
>
> Thanks,
>
> Yin
>
> On Tue, Sep 30, 2014 at 5:05 AM, Costin Leau <costin.leau@gmail.com <mailto:costin.leau@gmail.com>>
wrote:
>
>     Hi,
>
>     I'm working on supporting SchemaRDD in Elasticsearch Hadoop [1] but I'm having some
issues with the SQL API, in
>     particular in what the DataTypes translate to.
>
>     1. A SchemaRDD is composed of a Row and StructType - I'm using the latter to decompose
a Row into primitives. I'm
>     not clear however how to deal with _rich_ types, namely array, map and struct.
>     MapType gives me type information about the key and its value however what's the
actual Map object? j.u.Map, scala.Map?
>     For example assuming row(0) has a MapType associated with it, to what do I cast row(0)?
>     Same goes for StructType; if row(1) has a StructType associated with it, do I cast
the value to Row?
>
>     2. Similar to the above, I've noticed the Row interface has cast methods so ideally
one should use
>     row(index).getFloat|Integer|__Boolean etc... but I didn't see any methods for Binary
or Decimal. Also the _rich_
>     types are missing; I presume this is for pluggability reasons however whats the generic
way to access/unwrap the
>     generic Any/Object in this case to the desired DataType?
>
>     3. On a separate note, for RDDs containing just values (think CSV,TSV files) is there
an option to have a header
>     associated with it without having to wrap each row with a case class? As each entry
has exactly the same structure,
>     the wrapping is just overhead that doesn't provide any extra information (you know
the structure of one row, you
>     know it for all of them).
>
>     Thanks,
>
>     [1] github.com/elasticsearch/__elasticsearch-hadoop <http://github.com/elasticsearch/elasticsearch-hadoop>
>     --
>     Costin
>
>     ------------------------------__------------------------------__---------
>     To unsubscribe, e-mail: user-unsubscribe@spark.apache.__org <mailto:user-unsubscribe@spark.apache.org>
>     For additional commands, e-mail: user-help@spark.apache.org <mailto:user-help@spark.apache.org>
>
>

-- 
Costin

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message