spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: ORC v/s Parquet for Spark 2.0
Date Tue, 26 Jul 2016 11:10:32 GMT
If you have ever tried to use ORC via SPARK you will know that SPARK's
promise of accessing ORC files is a sham. SPARK cannot access partitioned
tables via HIVEcontext which are ORC, SPARK cannot stripe through ORC
faster and what more, if you are using SQL and have thought of using HIVE
with ORC on TEZ, then it runs way better, faster and leaner than SPARK.

I can process almost a few billion records close to a terabyte in a cluster
with around 100GB RAM and 40 cores in a few hours, and find it a challenge
doing the same with SPARK.

But apparently, everything is resolved in SPARK 2.0.


Regards,
Gourav Sengupta

On Tue, Jul 26, 2016 at 11:50 AM, Ofir Manor <ofir.manor@equalum.io> wrote:

> One additional point specific to Spark 2.0 - for the alpha Structured
> Streaming API (only),  the file sink only supports Parquet format (I'm sure
> that limitation will be lifted in a future release before Structured
> Streaming is GA):
>      "File sink - Stores the output to a directory. As of Spark 2.0, this
> only supports Parquet file format, and Append output mode."
>
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here
>
> ‚Äč
>

Mime
View raw message