spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Austin Nothaft <fnoth...@berkeley.edu>
Subject Re: Is it possible to use Parquet with Dremel encoding
Date Fri, 26 Sep 2014 14:51:41 GMT
Hi Matthes,

Can you post an example of your schema? When you refer to nesting, are you referring to optional
columns, nested schemas, or tables where there are repeated values? Parquet uses run-length
encoding to compress down columns with repeated values, which is the case that your example
seems to refer to. The point Matt is making in his post is that if you have a Parquet files
with contain records with a nested schema, e.g.:

record MyNestedSchema {
  int nestedSchemaField;
}

record MySchema {
  int nonNestedField;
  MyNestedSchema nestedRecord;
}

Not all systems support queries against these schemas. If you want to load the data directly
into Spark, it isn’t an issue. I’m not familiar with how SparkSQL is handling this, but
I believe the bit you quoted is saying that support for nested queries (e.g., select ... from
… where nestedRecord.nestedSchemaField == 0) will be added in Spark 1.0.1 (which is currently
available, BTW).

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

On Sep 26, 2014, at 7:38 AM, matthes <mdiekstall@sensenetworks.com> wrote:

> Thank you Jey,
> 
> That is a nice introduction but it is a may be to old (AUG 21ST, 2013)
> 
> "Note: If you keep the schema flat (without nesting), the Parquet files you
> create can be read by systems like Shark and Impala. These systems allow you
> to query Parquet files as tables using SQL-like syntax. The Parquet files
> created by this sample application could easily be queried using Shark for
> example."
> 
> But in this post
> (http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-td8377.html)
> I found this: Nested parquet is not supported in 1.0, but is part of the
> upcoming 1.0.1 release.
> 
> So the question now is, can I use it in the benefit way of nested parquet
> files to find fast with sql or do I have to write a special map/reduce job
> to transform and find my data?
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15234.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message