spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruce Robbins <bersprock...@gmail.com>
Subject Re: Hive Hash in Spark
Date Tue, 07 May 2019 21:01:46 GMT
Mildly off-topic:

>From a *correctness* perspective only, it seems Spark can read bucketed
Hive tables just fine. I am ignoring the fact that Spark doesn't take
advantage of the bucketing.

Is that a fair assessment? Or is it more complicated than that?

Also, Spark has code to prevent an application from accidentally writing to
a bucketed Hive table (except it as a hole
<https://issues.apache.org/jira/browse/SPARK-27498>). Except for that hole,
the write case is covered.

Spark apps reading bucketed Hive tables seems to be common, so I hope it
works (as it seems to).


On Thu, Mar 7, 2019 at 12:58 PM <tcondie@gmail.com> wrote:

> Thanks Ryan and Reynold for the information!
>
>
>
> Cheers,
>
> Tyson
>
>
>
> *From:* Ryan Blue <rblue@netflix.com>
> *Sent:* Wednesday, March 6, 2019 3:47 PM
> *To:* Reynold Xin <rxin@databricks.com>
> *Cc:* tcondie@gmail.com; Spark Dev List <dev@spark.apache.org>
> *Subject:* Re: Hive Hash in Spark
>
>
>
> I think this was needed to add support for bucketed Hive tables. Like
> Tyson noted, if the other side of a join can be bucketed the same way, then
> Spark can use a bucketed join. I have long-term plans to support this in
> the DataSourceV2 API, but I don't think we are very close to implementing
> it yet.
>
>
>
> rb
>
>
>
> On Wed, Mar 6, 2019 at 1:57 PM Reynold Xin <rxin@databricks.com> wrote:
>
> I think they might be used in bucketing? Not 100% sure.
>
>
>
>
>
> On Wed, Mar 06, 2019 at 1:40 PM, <tcondie@gmail.com> wrote:
>
> Hi,
>
>
>
> I noticed the existence of a Hive Hash partitioning implementation in
> Spark, but also noticed that it’s not being used, and that the Spark hash
> partitioning function is presently hardcoded to Murmur3. My question is
> whether Hive Hash is dead code or are their future plans to support reading
> and understanding data the has been partitioned using Hive Hash? By
> understanding, I mean that I’m able to avoid a full shuffle join on Table A
> (partitioned by Hive Hash) when joining with a Table B that I can shuffle
> via Hive Hash to Table A.
>
>
>
> Thank you,
>
> Tyson
>
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>

Mime
View raw message