cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Akhtar Hussain (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-8352) Strange problem regarding Cassandra nodes
Date Fri, 21 Nov 2014 05:59:34 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Akhtar Hussain updated CASSANDRA-8352:
--------------------------------------
    Since Version: 2.0.3
      Description: 
We have a Geo-red setup with 2 Data centers having 3 nodes each. When we bring down a single
Cassandra node down in DC2 by kill -9 <Cassandra-pid>, reads fail on DC1 with TimedOutException
for a brief amount of time (15-20 sec~). 

Questions:
1.	We need to understand why reads fail on DC1 when a node in another DC i.e. DC2 fails? As
we are using LOCAL_QUORUM for both reads/writes in DC1, request should return once 2 nodes
in local DC have replied instead of timing out because of node in remote DC.
2.	We want to make sure that no Cassandra requests fail in case of node failures. We used
rapid read protection of ALWAYS/99percentile/10ms as mentioned in http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2.
But nothing worked. How to ensure zero request failures in case a node fails?
3.	What is the right way of handling HTimedOutException exceptions in Hector?
4.	Please confirm are we using public private hostnames as expected?

We are using Cassandra 2.0.3.



      Environment: Unix, Cassandra 2.0.3
           Labels: DataCenter GEO-Red  (was: )
          Summary: Strange problem regarding Cassandra nodes  (was: trange problem regarding
Cassandra)

Exception in Application Logs:
2014-11-20 15:36:50.653 WARN  m.p.c.connection.HConnectionManager - Exception: 
me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
                at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:286)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:269)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:132)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:290)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:101)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at com.ericsson.rm.cassandra.xa.keyspace.row.KeyedRowQuery.execute(KeyedRowQuery.java:77)
[com.ericsson.bss.common.cassandra.xa_3.4.12.jar:na]
                at com.ericsson.rm.voucher.traffic.persistence.cassandra.CassandraPersistence.getRow(CassandraPersistence.java:765)
[com.ericsson.bss.voucher.traffic.persistence.cassandra_4.7.11.jar:na]
                at com.ericsson.rm.voucher.traffic.persistence.cassandra.CassandraPersistence.deleteVoucher(CassandraPersistence.java:400)
[com.ericsson.bss.voucher.traffic.persistence.cassandra_4.7.11.jar:na]
                at com.ericsson.rm.voucher.traffic.VoucherTraffic.commit(VoucherTraffic.java:647)
[com.ericsson.bss.voucher.traffic_4.7.11.jar:na]
                at com.ericsson.bss.voucher.traffic.proxy.VoucherTrafficDeproxy.callCommit(VoucherTrafficDeproxy.java:448)
[com.ericsson.bss.voucher.traffic.proxy_4.7.11.jar:na]
                at com.ericsson.bss.voucher.traffic.proxy.VoucherTrafficDeproxy.call(VoucherTrafficDeproxy.java:312)
[com.ericsson.bss.voucher.traffic.proxy_4.7.11.jar:na]
                at com.ericsson.rm.cluster.router.jgroups.destination.RouterDestination$RouterMessageTask.run(RouterDestination.java:333)
[com.ericsson.bss.common.cluster.router.jgroups_3.4.12.jar:na]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_51]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_51]
                at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
Caused by: org.apache.cassandra.thrift.TimedOutException: null
                at org.apache.cassandra.thrift.Cassandra$get_slice_result$get_slice_resultStandardScheme.read(Cassandra.java:11504)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at org.apache.cassandra.thrift.Cassandra$get_slice_result$get_slice_resultStandardScheme.read(Cassandra.java:11453)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:11379)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:653)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:637)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:274)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]

Exception in system logs of Cassandra
DEBUG [Thrift:4] 2014-11-20 15:36:50,652 ReadCallback.java (line 100) Read timeout: org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 5 responses.
DEBUG [Thrift:4] 2014-11-20 15:36:50,652 Tracing.java (line 159) request complete
TRACE [Thrift:49] 2014-11-20 15:36:50,653 AbstractReadExecutor.java (line 109) reading digest
from /10.61.16.18
DEBUG [Thrift:4] 2014-11-20 15:36:50,653 CustomTThreadPoolServer.java (line 204) Thrift transport
error occurred during processing of message.
org.apache.thrift.transport.TTransportException: Cannot read. Remote side has closed. Tried
to read 4 bytes, but only got 0 bytes. (This is often indicative of an internal error on the
server side. Please check your server logs.)
                at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
                at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
                at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284)
                at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191)
                at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
                at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:194)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                at java.lang.Thread.run(Thread.java:744)

Cassandra.yaml configuration on all nodes
Rpc_address: private hostname
Listen_address: public hostname
Seeds: public hostnames of all 6 nodes in both Data centers

Cassandra Topology file
host2_pub=DC1:RAC1
host3_pub=DC1:RAC1
host1_pub=DC1:RAC1
geo1_host=DC2:RAC1
geo2_host=DC2:RAC1
geo3_host=DC2:RAC1
default= DC1:RAC1 (for DC1 nodes) / default= DC2 :RAC1 (for DC2 nodes)

host<n>_pub= public hostname
geo<n>_host= public hostname of nodes in remote DC

Keyspace configuration

CREATE KEYSPACE vs WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'DC2': '3',
  'DC1': '3'
};

Cassandra Version: 2.0.3
Hector: 1.1.0.E001

> Strange problem regarding Cassandra nodes
> -----------------------------------------
>
>                 Key: CASSANDRA-8352
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8352
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Unix, Cassandra 2.0.3
>            Reporter: Akhtar Hussain
>              Labels: DataCenter, GEO-Red
>
> We have a Geo-red setup with 2 Data centers having 3 nodes each. When we bring down a
single Cassandra node down in DC2 by kill -9 <Cassandra-pid>, reads fail on DC1 with
TimedOutException for a brief amount of time (15-20 sec~). 
> Questions:
> 1.	We need to understand why reads fail on DC1 when a node in another DC i.e. DC2 fails?
As we are using LOCAL_QUORUM for both reads/writes in DC1, request should return once 2 nodes
in local DC have replied instead of timing out because of node in remote DC.
> 2.	We want to make sure that no Cassandra requests fail in case of node failures. We
used rapid read protection of ALWAYS/99percentile/10ms as mentioned in http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2.
But nothing worked. How to ensure zero request failures in case a node fails?
> 3.	What is the right way of handling HTimedOutException exceptions in Hector?
> 4.	Please confirm are we using public private hostnames as expected?
> We are using Cassandra 2.0.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message