cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format
Date Mon, 14 Mar 2016 10:41:33 GMT


Stefania commented on CASSANDRA-11206:

bq. IndexInfo is also used from {{UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound}}
(CASSANDRA-8180) - not sure whether it's worth to deserialize the index for this functionality,
*as it is currently restricted to the entries that are present in the key cache*. I tend to
remove this access. 

If I am not mistaken when the sstable iterator is created, the partition should be added to
the key cache if not already present. Please have a look at BigTableReader {{iterator()}}
and {{getPosition()}} to confirm. The reason we need the index info is that the lower bounds
in the sstable metatdata do not work for tombstones. This is the only lower bound we have
for tombstones. If it's removed then the optimization of CASSANDRA-8180 no longer works in
the presence of tombstones (whether this is acceptable is up for discussion). 

Can't we add the partition bounds to the offset map? 

For completeness, I also add that we don't necessarily need a lower bound for the partion,
it can be a lower bound for the entire sstable if easier. However it should work for tombstones,
that is it should be an instance of {{ClusteringPrefix}} rather than an array of {{ByteBuffer}}
as it is currently stored in the sstable metadata. 

> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>                 Key: CASSANDRA-11206
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Robert Stupp
>             Fix For: 3.x
> Cassandra saves a sample of IndexInfo objects that store the offset within each partition
of every 64KB (by default) range of rows.  To find a row, we binary search this sample, then
scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, we deserialize
the entire set of IndexInfo, which both creates a lot of GC overhead (as noted in CASSANDRA-9754)
but is also non-negligible i/o activity (relative to reading a single 64KB row range) as partitions
get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform the IndexInfo
bsearch while only deserializing IndexInfo that we need to compare against, i.e. log(N) deserializations.

This message was sent by Atlassian JIRA

View raw message