spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Bryński (JIRA) <j...@apache.org>
Subject [jira] [Commented] (SPARK-16996) Hive ACID delta files not seen
Date Mon, 09 Apr 2018 19:32:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431100#comment-16431100
] 

Maciej Bryński commented on SPARK-16996:
----------------------------------------

[~stevel@apache.org]
Are you prepared to lot of problems in HDP3 ?
{quote}
ACID-Based Tables Enabled by Default

ACID properties of Hive facilitate database transactions. ACID (which stands Atomicity, Consistency,
Isolation, and Durability) is turned on for Hive tables by default starting with this HDP
release, which means Hive tables do not require special flags or configurations to accept
updates (in particular configurations and bucketing).
{quote}
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/bk_hive-performance-tuning/content/ch_wn-hptg.html

> Hive ACID delta files not seen
> ------------------------------
>
>                 Key: SPARK-16996
>                 URL: https://issues.apache.org/jira/browse/SPARK-16996
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0
>         Environment: Hive 1.2.1, Spark 1.5.2
>            Reporter: Benjamin BONNET
>            Priority: Critical
>
> spark-sql seems not to see data stored as delta files in an ACID Hive table.
> Actually I encountered the same problem as describe here : http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp
> For example, create an ACID table with HiveCLI and insert a row :
> {code}
> set hive.support.concurrency=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.compactor.initiator.on=true;
> set hive.compactor.worker.threads=1;
>  CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 BUCKETS
>     ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>     STORED AS 
>       INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>       OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>     TBLPROPERTIES ('transactional'='true');
> INSERT INTO deltas VALUES("a","a");
> {code}
> Then make a query with spark-sql CLI :
> {code}
> SELECT * FROM deltas;
> {code}
> That query gets no result and there are no errors in logs.
> If you go to HDFS to inspect table files, you find only deltas
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxr-x---   - me hdfs          0 2016-08-10 14:03 /apps/hive/warehouse/deltas/delta_0020943_0020943
> {code}
> Then if you run compaction on that table (in HiveCLI) :
> {code}
> ALTER TABLE deltas COMPACT 'MAJOR';
> {code}
> As a result, the delta will be compute into a base file :
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxrwxrwx   - me hdfs          0 2016-08-10 15:25 /apps/hive/warehouse/deltas/base_0020943
> {code}
> Go back to spark-sql and the same query gets a result :
> {code}
> SELECT * FROM deltas;
> a       a
> Time taken: 0.477 seconds, Fetched 1 row(s)
> {code}
> But next time you make an insert into Hive table : 
> {code}
> INSERT INTO deltas VALUES("b","b");
> {code}
> spark-sql will immediately see changes : 
> {code}
> SELECT * FROM deltas;
> a       a
> b       b
> Time taken: 0.122 seconds, Fetched 2 row(s)
> {code}
> Yet there was no other compaction, but spark-sql "sees" the base AND the delta file :
> {code}
> ~> hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 2 items
> drwxrwxrwx   - valdata hdfs          0 2016-08-10 15:25 /apps/hive/warehouse/deltas/base_0020943
> drwxr-x---   - valdata hdfs          0 2016-08-10 15:31 /apps/hive/warehouse/deltas/delta_0020956_0020956
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message