cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "siddharth verma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-11680) Inconsistent data while paging through a table
Date Thu, 28 Apr 2016 11:10:12 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

siddharth verma updated CASSANDRA-11680:
----------------------------------------
    Description: 
We have the following table structure:
CREATE TABLE keyspace.book_properties (
book_id text,
group_id bigint,
property_display_name text,
created timestamp,
property_name text,
property_uuid uuid,
property_value text,
updated timestamp,
PRIMARY KEY (book_id, group_id, property_display_name)
) WITH CLUSTERING ORDER BY (group_id ASC, property_display_name ASC);

We have lucene indexes on group_id, property_display_name, created, property_name, property_uuid,
updated

When we run a full table scan. Below is the sample code snippet

boundStatement = new BoundStatement(session.prepare("select * from keyspace.book_properties");
boundStatement.setConsistencyLevel(ConsistencyLevel.ALL);
boundStatement.setFetchSize(fetchSize);
PagingState currentPageInfo = null;
do {
try {
if (currentPageInfo != null)
{ boundStatement.setPagingState(currentPageInfo); }

ResultSet rs = session.execute(boundStatement);
processResultSet(rs);
currentPageInfo = rs.getExecutionInfo().getPagingState();
} catch (NoHostAvailableException e) {
}
} while (currentPageInfo != null);
......
processResultSet(ResultSet rs){
int remaining = rs.getAvailableWithoutFetching();
if (remaining != 0) {
for (Row row : rs) {
processCassandraRow(row);
if (--remaining == 0)
{ break; }
}
}
}

Many a time, we got corrupted data in this process.
1. property_uuid was returned as null in many cases, when actual data had a value for it.
2. returned value for property_uuid in table scan was different from property_uuid as seen
from cqlsh
3. returned value for group_id in table scan was different from group_id as seen from cqlsh

book_properties has around 140 million records.

book_properties has heavy read, write and update requests while paging is in process

Cassandra version dsc3.0.3

Side Note:
For one of the inconsistent column, we specifically checked the writetime(..) to make sure
data hadn't been changed while the job was in process. This was not the case
checked for case 2 : select property_uuid, writetime(property_uuid) from book_properties where
book_id = 'BOOK31263786';

Edit1:
->when we do "select * from book_properties where book_id = 'BOOK31263786';" we get two
records
->when while pagination job, I match and print Row where book_id = 'BOOK31263786', and
we got 4 records.
It is a speculation from our side, that other two might have been deleted some time back(definitely
not during the job). Again, it is a speculation, not sure.


  was:
We have the following table structure:
CREATE TABLE keyspace.book_properties (
book_id text,
group_id bigint,
property_display_name text,
created timestamp,
property_name text,
property_uuid uuid,
property_value text,
updated timestamp,
PRIMARY KEY (book_id, group_id, property_display_name)
) WITH CLUSTERING ORDER BY (group_id ASC, property_display_name ASC);

We have lucene indexes on group_id, property_display_name, created, property_name, property_uuid,
updated

When we run a full table scan. Below is the sample code snippet

boundStatement = new BoundStatement(session.prepare("select * from keyspace.book_properties");
boundStatement.setConsistencyLevel(ConsistencyLevel.ALL);
boundStatement.setFetchSize(fetchSize);
PagingState currentPageInfo = null;
do {
try {
if (currentPageInfo != null)
{ boundStatement.setPagingState(currentPageInfo); }

ResultSet rs = session.execute(boundStatement);
processResultSet(rs);
currentPageInfo = rs.getExecutionInfo().getPagingState();
} catch (NoHostAvailableException e) {
}
} while (currentPageInfo != null);
......
processResultSet(ResultSet rs){
int remaining = rs.getAvailableWithoutFetching();
if (remaining != 0) {
for (Row row : rs) {
processCassandraRow(row);
if (--remaining == 0)
{ break; }
}
}
}

Many a time, we got corrupted data in this process.
1. property_uuid was returned as null in many cases, when actual data had a value for it.
2. returned value for property_uuid in table scan was different from property_uuid as seen
from cqlsh
3. returned value for group_id in table scan was different from group_id as seen from cqlsh

book_properties has around 140 million records.

book_properties has heavy read, write and update requests while paging is in process

Cassandra version dsc3.0.3

Side Note:
For one of the inconsistent column, we specifically checked the writetime(..) to make sure
data hadn't been changed while the job was in process. This was not the case
checked for case 2 : select property_uuid, writetime(property_uuid) from book_properties where
book_id = 'BOOK31263786';



> Inconsistent data while paging through a table
> ----------------------------------------------
>
>                 Key: CASSANDRA-11680
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11680
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: siddharth verma
>
> We have the following table structure:
> CREATE TABLE keyspace.book_properties (
> book_id text,
> group_id bigint,
> property_display_name text,
> created timestamp,
> property_name text,
> property_uuid uuid,
> property_value text,
> updated timestamp,
> PRIMARY KEY (book_id, group_id, property_display_name)
> ) WITH CLUSTERING ORDER BY (group_id ASC, property_display_name ASC);
> We have lucene indexes on group_id, property_display_name, created, property_name, property_uuid,
updated
> When we run a full table scan. Below is the sample code snippet
> boundStatement = new BoundStatement(session.prepare("select * from keyspace.book_properties");
> boundStatement.setConsistencyLevel(ConsistencyLevel.ALL);
> boundStatement.setFetchSize(fetchSize);
> PagingState currentPageInfo = null;
> do {
> try {
> if (currentPageInfo != null)
> { boundStatement.setPagingState(currentPageInfo); }
> ResultSet rs = session.execute(boundStatement);
> processResultSet(rs);
> currentPageInfo = rs.getExecutionInfo().getPagingState();
> } catch (NoHostAvailableException e) {
> }
> } while (currentPageInfo != null);
> ......
> processResultSet(ResultSet rs){
> int remaining = rs.getAvailableWithoutFetching();
> if (remaining != 0) {
> for (Row row : rs) {
> processCassandraRow(row);
> if (--remaining == 0)
> { break; }
> }
> }
> }
> Many a time, we got corrupted data in this process.
> 1. property_uuid was returned as null in many cases, when actual data had a value for
it.
> 2. returned value for property_uuid in table scan was different from property_uuid as
seen from cqlsh
> 3. returned value for group_id in table scan was different from group_id as seen from
cqlsh
> book_properties has around 140 million records.
> book_properties has heavy read, write and update requests while paging is in process
> Cassandra version dsc3.0.3
> Side Note:
> For one of the inconsistent column, we specifically checked the writetime(..) to make
sure data hadn't been changed while the job was in process. This was not the case
> checked for case 2 : select property_uuid, writetime(property_uuid) from book_properties
where book_id = 'BOOK31263786';
> Edit1:
> ->when we do "select * from book_properties where book_id = 'BOOK31263786';" we get
two records
> ->when while pagination job, I match and print Row where book_id = 'BOOK31263786',
and we got 4 records.
> It is a speculation from our side, that other two might have been deleted some time back(definitely
not during the job). Again, it is a speculation, not sure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message