cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Morton (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-4601) Ensure unique commit log file names
Date Sun, 02 Sep 2012 22:43:07 GMT


Aaron Morton updated CASSANDRA-4601:

    Affects Version/s: 0.8.10
> Ensure unique commit log file names
> -----------------------------------
>                 Key: CASSANDRA-4601
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.10, 1.0.11, 1.1.4
>         Environment: Sun JVM 1.6.33 / Ubuntu 10.04.4 LTS 
>            Reporter: Aaron Morton
>            Assignee: Aaron Morton
>            Priority: Critical
> The commit log segment name uses System.nanoTime() as part of the file name. There is
no guarantee that successive calls to nanoTime() will return different values. And on less
than optimal hypervisors this happens a lot. 
> I observed the following in the wild:
> {code:java}
> ERROR [COMMIT-LOG-ALLOCATOR] 2012-08-31 15:56:49,815 (line
134) Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main]
> java.lang.AssertionError: attempted to delete non-existing file CommitLog-13926764209796414.log
>         at
>         at org.apache.cassandra.db.commitlog.CommitLogSegment.discard(
>         at org.apache.cassandra.db.commitlog.CommitLogAllocator$
>         at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(
>         at
>         at Source)
> {code}
> My _assumption_ is that it was because of duplicate file names. As this is on a hypervisor
that is less than optimal. 
> After a while (about 30 minutes) mutations stopped being processed and the pending count
sky rocketed. I _think_ this was because log writing was blocked trying to get a new segment
and writers could not submit to the commit log queue. The only way to stop the affected nodes
was kill -9. 
> Over about 24 hours this happened 5 times. I have deployed a patch that has been running
for 12 hours without incident, will attach. 
> The affected nodes could still read, and I'm checking logs to see how the other nodes
handled the situation.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message