spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <dbt...@dbtsai.com>
Subject Re: parquet predicate / projection pushdown into unionAll
Date Wed, 01 Oct 2014 09:25:12 GMT
Hi Cody and Michael,

We ran into the same issue. Each day of data we have is stored into
one parquet, and we want to query it against monthly parquet data. The
data for each data is around 600GB, and we use 300 executors with 8GB
memory for each executor. Without the patch, it took forever, and
crashed in the end.

With patch, for table obtained by unionAll with 5 parquets (around
3TB), it takes 199.126 seconds to execute a simple HIVE query, while
it takes 64.36 seconds for individual parquet file. Great work, this
patch solves the issue. Hopefully to see this in next 1.1 series.

Thanks.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Sep 12, 2014 at 9:07 PM, Michael Armbrust
<michael@databricks.com> wrote:
> Yeah, thanks for implementing it!
>
> Since Spark SQL is an alpha component and moving quickly the plan is to
> backport all of master into the next point release in the 1.1 series.
>
> On Fri, Sep 12, 2014 at 9:27 AM, Cody Koeninger <cody@koeninger.org> wrote:
>
>> Cool, thanks for your help on this.  Any chance of adding it to the 1.1.1
>> point release, assuming there ends up being one?
>>
>> On Wed, Sep 10, 2014 at 11:39 AM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> Hey Cody,
>>>
>>> Thanks for doing this!  Will look at your PR later today.
>>>
>>> Michael
>>>
>>> On Wed, Sep 10, 2014 at 9:31 AM, Cody Koeninger <cody@koeninger.org>
>>> wrote:
>>>
>>>> Tested the patch against a cluster with some real data.  Initial results
>>>> seem like going from one table to a union of 2 tables is now closer to a
>>>> doubling of query time as expected, instead of 5 to 10x.
>>>>
>>>> Let me know if you see any issues with that PR.
>>>>
>>>> On Wed, Sep 10, 2014 at 8:19 AM, Cody Koeninger <cody@koeninger.org>
>>>> wrote:
>>>>
>>>>> So the obvious thing I was missing is that the analyzer has already
>>>>> resolved attributes by the time the optimizer runs, so the references
in
>>>>> the filter / projection need to be fixed up to match the children.
>>>>>
>>>>> Created a PR, let me know if there's a better way to do it.  I'll see
>>>>> about testing performance against some actual data sets.
>>>>>
>>>>> On Tue, Sep 9, 2014 at 6:09 PM, Cody Koeninger <cody@koeninger.org>
>>>>> wrote:
>>>>>
>>>>>> Ok, so looking at the optimizer code for the first time and trying
the
>>>>>> simplest rule that could possibly work,
>>>>>>
>>>>>> object UnionPushdown extends Rule[LogicalPlan] {
>>>>>>   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
>>>>>>     // Push down filter into
>>>>>> union
>>>>>>     case f @ Filter(condition, u @ Union(left, right)) =>
>>>>>>
>>>>>>       u.copy(left = f.copy(child = left), right = f.copy(child =
>>>>>> right))
>>>>>>
>>>>>>
>>>>>>     // Push down projection into
>>>>>> union
>>>>>>     case p @ Project(projectList, u @ Union(left, right)) =>
>>>>>>       u.copy(left = p.copy(child = left), right = p.copy(child =
>>>>>> right))
>>>>>>
>>>>>> }
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>> If I try manually applying that rule to a logical plan in the repl,
it
>>>>>> produces the query shape I'd expect, and executing that plan results
in
>>>>>> parquet pushdowns as I'd expect.
>>>>>>
>>>>>> But adding those cases to ColumnPruning results in a runtime exception
>>>>>> (below)
>>>>>>
>>>>>> I can keep digging, but it seems like I'm missing some obvious initial
>>>>>> context around naming of attributes.  If you can provide any pointers
to
>>>>>> speed me on my way I'd appreciate it.
>>>>>>
>>>>>>
>>>>>> java.lang.AssertionError: assertion failed: ArrayBuffer() +
>>>>>> ArrayBuffer() != WrappedArray(name#6, age#7), List(name#9, age#10,
>>>>>> phones#11)
>>>>>>         at scala.Predef$.assert(Predef.scala:179)
>>>>>>         at
>>>>>> org.apache.spark.sql.parquet.ParquetTableScan.<init>(ParquetTableOperations.scala:75)
>>>>>>         at
>>>>>> org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$$anonfun$9.apply(SparkStrategies.scala:234)
>>>>>>         at
>>>>>> org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$$anonfun$9.apply(SparkStrategies.scala:234)
>>>>>>         at
>>>>>> org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:367)
>>>>>>         at
>>>>>> org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:230)
>>>>>>         at
>>>>>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>>>>>>         at
>>>>>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>>>>>>         at
>>>>>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>>>>>         at
>>>>>> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
>>>>>>         at
>>>>>> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>>>>>>         at
>>>>>> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$$anonfun$12.apply(SparkStrategies.scala:282)
>>>>>>         at
>>>>>> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$$anonfun$12.apply(SparkStrategies.scala:282)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>>>>>         at scala.collection.immutable.List.foreach(List.scala:318)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>>>>>         at
>>>>>> scala.collection.AbstractTraversable.map(Traversable.scala:105)
>>>>>>         at
>>>>>> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:282)
>>>>>>         at
>>>>>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>>>>>>         at
>>>>>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>>>>>>         at
>>>>>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>>>>>         at
>>>>>> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
>>>>>>         at
>>>>>> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
>>>>>>         at
>>>>>> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
>>>>>>         at
>>>>>> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
>>>>>>         at
>>>>>> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
>>>>>>         at
>>>>>> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:431)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 9, 2014 at 3:02 PM, Michael Armbrust <
>>>>>> michael@databricks.com> wrote:
>>>>>>
>>>>>>> What Patrick said is correct.  Two other points:
>>>>>>>  - In the 1.2 release we are hoping to beef up the support for
>>>>>>> working with partitioned parquet independent of the metastore.
>>>>>>>  - You can actually do operations like INSERT INTO for parquet
tables
>>>>>>> to add data.  This creates new parquet files for each insertion.
 This will
>>>>>>> break if there are multiple concurrent writers to the same table.
>>>>>>>
>>>>>>> On Tue, Sep 9, 2014 at 12:09 PM, Patrick Wendell <pwendell@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think what Michael means is people often use this to read
existing
>>>>>>>> partitioned Parquet tables that are defined in a Hive metastore
>>>>>>>> rather
>>>>>>>> than data generated directly from within Spark and then reading
it
>>>>>>>> back as a table. I'd expect the latter case to become more
common,
>>>>>>>> but
>>>>>>>> for now most users connect to an existing metastore.
>>>>>>>>
>>>>>>>> I think you could go this route by creating a partitioned
external
>>>>>>>> table based on the on-disk layout you create. The downside
is that
>>>>>>>> you'd have to go through a hive metastore whereas what you
are doing
>>>>>>>> now doesn't need hive at all.
>>>>>>>>
>>>>>>>> We should also just fix the case you are mentioning where
a union is
>>>>>>>> used directly from within spark. But that's the context.
>>>>>>>>
>>>>>>>> - Patrick
>>>>>>>>
>>>>>>>> On Tue, Sep 9, 2014 at 12:01 PM, Cody Koeninger <cody@koeninger.org>
>>>>>>>> wrote:
>>>>>>>> > Maybe I'm missing something, I thought parquet was generally
a
>>>>>>>> write-once
>>>>>>>> > format and the sqlContext interface to it seems that
way as well.
>>>>>>>> >
>>>>>>>> > d1.saveAsParquetFile("/foo/d1")
>>>>>>>> >
>>>>>>>> > // another day, another table, with same schema
>>>>>>>> > d2.saveAsParquetFile("/foo/d2")
>>>>>>>> >
>>>>>>>> > Will give a directory structure like
>>>>>>>> >
>>>>>>>> > /foo/d1/_metadata
>>>>>>>> > /foo/d1/part-r-1.parquet
>>>>>>>> > /foo/d1/part-r-2.parquet
>>>>>>>> > /foo/d1/_SUCCESS
>>>>>>>> >
>>>>>>>> > /foo/d2/_metadata
>>>>>>>> > /foo/d2/part-r-1.parquet
>>>>>>>> > /foo/d2/part-r-2.parquet
>>>>>>>> > /foo/d2/_SUCCESS
>>>>>>>> >
>>>>>>>> > // ParquetFileReader will fail, because /foo/d1 is a
directory,
>>>>>>>> not a
>>>>>>>> > parquet partition
>>>>>>>> > sqlContext.parquetFile("/foo")
>>>>>>>> >
>>>>>>>> > // works, but has the noted lack of pushdown
>>>>>>>> >
>>>>>>>> sqlContext.parquetFile("/foo/d1").unionAll(sqlContext.parquetFile("/foo/d2"))
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Is there another alternative?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Tue, Sep 9, 2014 at 1:29 PM, Michael Armbrust <
>>>>>>>> michael@databricks.com>
>>>>>>>> > wrote:
>>>>>>>> >
>>>>>>>> >> I think usually people add these directories as
multiple
>>>>>>>> partitions of the
>>>>>>>> >> same table instead of union.  This actually allows
us to
>>>>>>>> efficiently prune
>>>>>>>> >> directories when reading in addition to standard
column pruning.
>>>>>>>> >>
>>>>>>>> >> On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf <
>>>>>>>> malouf.gary@gmail.com>
>>>>>>>> >> wrote:
>>>>>>>> >>
>>>>>>>> >>> I'm kind of surprised this was not run into
before.  Do people
>>>>>>>> not
>>>>>>>> >>> segregate their data by day/week in the HDFS
directory structure?
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>> On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust
<
>>>>>>>> michael@databricks.com>
>>>>>>>> >>> wrote:
>>>>>>>> >>>
>>>>>>>> >>>> Thanks!
>>>>>>>> >>>>
>>>>>>>> >>>> On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger
<
>>>>>>>> cody@koeninger.org>
>>>>>>>> >>>> wrote:
>>>>>>>> >>>>
>>>>>>>> >>>> > Opened
>>>>>>>> >>>> >
>>>>>>>> >>>> > https://issues.apache.org/jira/browse/SPARK-3462
>>>>>>>> >>>> >
>>>>>>>> >>>> > I'll take a look at ColumnPruning and
see what I can do
>>>>>>>> >>>> >
>>>>>>>> >>>> > On Tue, Sep 9, 2014 at 12:46 PM, Michael
Armbrust <
>>>>>>>> >>>> michael@databricks.com>
>>>>>>>> >>>> > wrote:
>>>>>>>> >>>> >
>>>>>>>> >>>> >> On Tue, Sep 9, 2014 at 10:17 AM,
Cody Koeninger <
>>>>>>>> cody@koeninger.org>
>>>>>>>> >>>> >> wrote:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Is there a reason in general
not to push projections and
>>>>>>>> predicates
>>>>>>>> >>>> down
>>>>>>>> >>>> >>> into the individual ParquetTableScans
in a union?
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>
>>>>>>>> >>>> >> This would be a great case to add
to ColumnPruning.  Would
>>>>>>>> be awesome
>>>>>>>> >>>> if
>>>>>>>> >>>> >> you could open a JIRA or even a
PR :)
>>>>>>>> >>>> >>
>>>>>>>> >>>> >
>>>>>>>> >>>> >
>>>>>>>> >>>>
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message