spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Zhang <zjf...@gmail.com>
Subject Re: Faster Spark on ORC with Apache ORC
Date Fri, 14 Jul 2017 00:37:17 GMT
Awesome, Dong Joon, It's a great improvement. Looking forward its merge.





Dong Joon Hyun <dhyun@hortonworks.com>于2017年7月12日周三 上午6:53写道:

> Hi, All.
>
>
>
> Since Apache Spark 2.2 vote passed successfully last week,
>
> I think it’s a good time for me to ask your opinions again about the
> following PR.
>
>
>
> https://github.com/apache/spark/pull/17980  (+3,887, −86)
>
>
>
> It’s for the following issues.
>
>
>
>    - SPARK-20728: Make ORCFileFormat configurable between sql/hive and
>    sql/core
>    - SPARK-20682: Support a new faster ORC data source based on Apache ORC
>
>
>
> Basically, the approach is trying to use the latest Apache ORC 1.4.0
> officially.
>
> You can switch between the legacy ORC data source and new ORC datasource.
>
>
>
> Could you help me to progress this in order to improve Apache Spark 2.3?
>
>
>
> Bests,
>
> Dongjoon.
>
>
>
> *From: *Dong Joon Hyun <dhyun@hortonworks.com>
>
>
> *Date: *Tuesday, May 9, 2017 at 6:15 PM
> *To: *"dev@spark.apache.org" <dev@spark.apache.org>
> *Subject: *Faster Spark on ORC with Apache ORC
>
>
>
> Hi, All.
>
>
>
> Apache Spark always has been a fast and general engine, and
>
> since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with
> Hive dependency.
>
>
>
> With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC
> faster and get some benefits.
>
>
>
>     - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together
> which means full vectorization support.
>
>
>
>     - Stability: Apache ORC 1.4.0 already has many fixes and we can depend
> on ORC community effort in the future.
>
>
>
>     - Usability: Users can use `ORC` data sources without hive module
> (-Phive)
>
>
>
>     - Maintainability: Reduce the Hive dependency and eventually remove
> some old legacy code from `sql/hive` module.
>
>
>
> As a first step, I made a PR adding a new ORC data source into `sql/core`
> module.
>
>
>
> https://github.com/apache/spark/pull/17924  (+ 3,691 lines, -0)
>
>
>
> Could you give some opinions on this approach?
>
>
>
> Bests,
>
> Dongjoon.
>

Mime
View raw message