cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format
Date Mon, 14 Mar 2016 09:37:34 GMT


Robert Stupp commented on CASSANDRA-11206:

Quick progress status:
* refactored the code to be able to handle "flat byte structures" (i.e. a {{byte[]}} at the
moment - as a pre-requisite to directly access the index file)
* IndexInfo is only used from {{AbstractSSTableIterator.IndexState}} - an instance to an open
index-file is available, so removing the {{byte[]}} and accessing the index file directly
is the next step.
* unit and dtests are mostly passing (i.e. there are some flakey ones on cassci, which are
passing locally). Still need to identify what's going on with the failing paging dtests.
* cstar tests show similar results compared to current trunk
* IndexInfo is also used from {{UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound}}
(CASSANDRA-8180) - not sure whether it's worth to deserialize the index for this functionality,
as it is currently restricted to the entries that are present in the key cache. I tend to
remove this access. (/cc [~Stefania])

* accesses to IndexInfo objects are "random" during the binary search operation (as expected)
* accesses to IndexInfo objects are "nearly sequential" during scan operations - "nearly"
means, it accesses index N, then index N-1, then index N+1 before it actually moves ahead
- but does some random accesses to previously accessed IndexInfo instances afterwards. Therefore
{{IndexState}} "caches" the already deserialised {{IndexInfo}} objects. These should stay
in new-gen as these are only referenced during the lifetime of the actual read. Alternatively
it is possible to use a plain & boring LRU like cache for the 10 last IndexInfo objects
in IndexState.
* index-file writes (flushes/compactions) also used {{IndexInfo}} objects - replaced with
a buffered write ({{DataOutputBuffer}})

* heap pressure due to the vast amount of {{IndexInfo}} objects is already handled by this
patch (exchanged to one {{byte[]}} at the moment) both for reads and flushes/compactions
* after replacing the {{byte[]}} with index file access, we could lower the (default) key-cache
size since we then no longer cache {{IndexInfo}} objects on heap

So the next step is to remove the {{byte[]}} from {{IndexedEntry}} and replace it with index-file
access from {{IndexState}}.

> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>                 Key: CASSANDRA-11206
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Robert Stupp
>             Fix For: 3.x
> Cassandra saves a sample of IndexInfo objects that store the offset within each partition
of every 64KB (by default) range of rows.  To find a row, we binary search this sample, then
scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, we deserialize
the entire set of IndexInfo, which both creates a lot of GC overhead (as noted in CASSANDRA-9754)
but is also non-negligible i/o activity (relative to reading a single 64KB row range) as partitions
get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform the IndexInfo
bsearch while only deserializing IndexInfo that we need to compare against, i.e. log(N) deserializations.

This message was sent by Atlassian JIRA

View raw message