hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <>
Subject [jira] [Commented] (HIVE-11317) ACID: Improve transaction Abort logic due to timeout
Date Wed, 05 Aug 2015 17:26:04 GMT


Alan Gates commented on HIVE-11317:

Why did you decide to go with a separate thread rather than integrating this with the initiator
or the cleaner?  The functionality here is pretty simple and it seems like it would be easy
to integrate with either of those.

TxnHandler line 1730 (in heartbeatTxn) you added code to check if the heartbeat failed because
the txn was already committed.  A comment to make clear what you're checking for here would
be helpful.

TxnHandler, new method performTimeouts.  You run a query with a hard coded limit (of 2500)
and then have do{}while loop to add those values to the list to be deleted until you've reached
your batch size.  Once you reach the batch size you call abortTxns, and then go rerun the
query.  So why the limit clause and the do/while loop.  Why not just ask up front for the
number of entries in batch with the limit clause?

Tests in general:  I have found tests that rely on sleeps to be flaky.  They will usually
work locally, but placed on an EC2 box as part of the auto-patch testing they fail because
the box is so busy the timeouts are no longer large enough.  In the other compactor threads
I've put in flags to make sure the thread ran once rather than relying on timeouts.  This
has produced much more reliable results.

> ACID: Improve transaction Abort logic due to timeout
> ----------------------------------------------------
>                 Key: HIVE-11317
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore, Transactions
>    Affects Versions: 1.0.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>              Labels: triage
>         Attachments: HIVE-11317.2.patch, HIVE-11317.patch
> the logic to Abort transactions that have stopped heartbeating is in
> TxnHandler.timeOutTxns()
> This is only called when DbTxnManger.getValidTxns() is called.
> So if there is a lot of txns that need to be timed out and the there are not SQL clients
talking to the system, there is nothing to abort dead transactions, and thus compaction can't
clean them up so garbage accumulates in the system.
> Also, streaming api doesn't call DbTxnManager at all.
> Need to move this logic into Initiator (or some other metastore side thread).
> Also, make sure it is broken up into multiple small(er) transactions against metastore
> Also more timeOutLocks() locks there as well.
> see about adding TXNS.COMMENT field which can be used for "Auto aborted due to timeout"
for example.

This message was sent by Atlassian JIRA

View raw message