spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tejas Patil (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17495) Hive hash implementation
Date Wed, 01 Mar 2017 06:57:45 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889645#comment-15889645
] 

Tejas Patil commented on SPARK-17495:
-------------------------------------

>> Is it possible to figure out the hashing function based on file names? 

The way datasource files are named is different than hive so this will work. I was thinking
about more simpler way: use hive-hash only when writing to hive bucketed tables. Since Spark
doesn't support hive bucketing at the moment, any old data would be generated has to be from
Hive .... so, this will not cause breakages for users.

>> 3. In general it'd be useful to allow users to configure which actual hash function
"hash" maps to. This can be a dynamic config.

For any operations related to hive bucketed tables, we should not let users be able to change
the hashing function and do the right thing underneath. Else, users can shoot themselves in
foot (eg. joining two hive tables both bucketed but using different hashing function). One
option was to store the hashing function used to populate a table in metastore. But this won't
be compatible with Hive and mess things in environments where both Spark and Hive is used
together.

As for simple `hash()` UDF / function is concerned, I am a bit conservative about adding a
dynamic config as I feel it might cause problems. Say you start off a session with default
murmur3 hash and compute some data, cache it. Later on user switches to use hive hash and
reusing the cached data as-is now wont be right thing to do. Keeping it static for a session
would save from such problem.

> Hive hash implementation
> ------------------------
>
>                 Key: SPARK-17495
>                 URL: https://issues.apache.org/jira/browse/SPARK-17495
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>            Priority: Minor
>             Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from the one used
by Hive. For queries which use bucketing this leads to different results if one tries the
same query on both engines. For us, we want users to have backward compatibility to that one
can switch parts of applications across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message