spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shay Elbaz (Jira)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-30399) Bucketing does not compatible with partitioning in practice
Date Tue, 31 Dec 2019 19:38:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-30399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shay Elbaz updated SPARK-30399:
-------------------------------
    Description: 
When using Spark Bucketed table, Spark would use as many partitions as the number of buckets
for the map-side join (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static"
tables, but quite disastrous for _time-partitioned_ tables. In our use case, a daily partitioned
key-value table is added 100GB of data every day. So in 100 days there are 10TB of data we
want to join with. Aiming to this scenario, we need thousands of buckets if we want every
task to successfully *read and sort* all of it's data in a map-side join. But in such case,
every daily increment would emit thousands of small files, leading to other big issues.

In practice, and with a hope for some hidden optimization, we set the number of buckets to
1000 and backfilled such a table with 10TB. When trying to join with the smallest input, every
executor was killed by Yarn due to over allocating memory in the sorting phase. Even without
such failures, it would take every executor unreasonably amount of time to locally sort all
its data.

A question on SO remained unanswered for a while, so I thought asking here - is it by design
that buckets cannot be used in time-partitioned table, or am I doing something wrong?

  was:
When using Spark Bucketed table, Spark would use as many partitions as the number of buckets
for the map-side join (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static"
tables, but quite disastrous for _time-partitioned_ tables. In our use case, a daily partitioned
key-value table is added 100GB of data every day. So in 100 days there are 10TB of data we
want to join with - aiming to this scenario, we need thousands of buckets if we want every
task to successfully *read and sort* all of it's data in a map-side join. But in such case,
every daily increment would emit thousands of small files, leading to other big issues. 

In practice, and with a hope for some hidden optimization, we set the number of buckets to
1000 and backfilled such a table with 10TB. When trying to join with the smallest input, every
executor was killed by Yarn due to over allocating memory in the sorting phase. Even without
such failures, it would take every executor unreasonably amount of time to locally sort all
its data.

A question on SO remained unanswered for a while, so I thought asking here - is it by design
that buckets cannot be used in time-partitioned table, or am I doing something wrong?


> Bucketing does not compatible with partitioning in practice
> -----------------------------------------------------------
>
>                 Key: SPARK-30399
>                 URL: https://issues.apache.org/jira/browse/SPARK-30399
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>         Environment: HDP 2.7
>            Reporter: Shay Elbaz
>            Priority: Minor
>
> When using Spark Bucketed table, Spark would use as many partitions as the number of
buckets for the map-side join (_FileSourceScanExec.createBucketedReadRDD_). This works great
for "static" tables, but quite disastrous for _time-partitioned_ tables. In our use case,
a daily partitioned key-value table is added 100GB of data every day. So in 100 days there
are 10TB of data we want to join with. Aiming to this scenario, we need thousands of buckets
if we want every task to successfully *read and sort* all of it's data in a map-side join.
But in such case, every daily increment would emit thousands of small files, leading to other
big issues.
> In practice, and with a hope for some hidden optimization, we set the number of buckets
to 1000 and backfilled such a table with 10TB. When trying to join with the smallest input,
every executor was killed by Yarn due to over allocating memory in the sorting phase. Even
without such failures, it would take every executor unreasonably amount of time to locally
sort all its data.
> A question on SO remained unanswered for a while, so I thought asking here - is it by
design that buckets cannot be used in time-partitioned table, or am I doing something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message