spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr>
Subject Re: ORC v/s Parquet for Spark 2.0
Date Tue, 26 Jul 2016 11:52:35 GMT
So did you tried actually to run your use case with spark 2.0 and orc files?
It’s hard to understand your ‘apparently..’.

Best,
Ovidiu
> On 26 Jul 2016, at 13:10, Gourav Sengupta <gourav.sengupta@gmail.com> wrote:
> 
> If you have ever tried to use ORC via SPARK you will know that SPARK's promise of accessing
ORC files is a sham. SPARK cannot access partitioned tables via HIVEcontext which are ORC,
SPARK cannot stripe through ORC faster and what more, if you are using SQL and have thought
of using HIVE with ORC on TEZ, then it runs way better, faster and leaner than SPARK. 
> 
> I can process almost a few billion records close to a terabyte in a cluster with around
100GB RAM and 40 cores in a few hours, and find it a challenge doing the same with SPARK.

> 
> But apparently, everything is resolved in SPARK 2.0.
> 
> 
> Regards,
> Gourav Sengupta
> 
> On Tue, Jul 26, 2016 at 11:50 AM, Ofir Manor <ofir.manor@equalum.io <mailto:ofir.manor@equalum.io>>
wrote:
> One additional point specific to Spark 2.0 - for the alpha Structured Streaming API (only),
 the file sink only supports Parquet format (I'm sure that limitation will be lifted in a
future release before Structured Streaming is GA):
>      "File sink - Stores the output to a directory. As of Spark 2.0, this only supports
Parquet file format, and Append output mode."
>      http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here
<http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here>
> 
> ​
> 


Mime
View raw message