cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format
Date Sun, 27 Mar 2016 10:40:25 GMT


Robert Stupp updated CASSANDRA-11206:
    Status: Patch Available  (was: In Progress)

Pushed the latest version to the [git branch|].
CI results ([testall|],
and cstar results (see below) look good.

The initial approach was to “ban” all {{IndexInfo}} instances from the key cache. Although
this is a great option for big partitions, “moderately” sized partitions suffer from that
approach (see “0kB” cstar run below). So, as a compromise a new {{cassandra.yaml}} option
{{column_index_cache_size_in_kb}} that defines when {{IndexInfo}} objects should not be kept
on heap has been introduced. The new option defaults to 2 kB. It is possible to tune it to
lower values (down to 0) and higher values. Some thoughts about both directions:
* Setting the value to 0 or some other very low value will obviously reduce GC pressure at
the cost of high I/O
* The cost of accessing index samples on disk is two-folded: First, there’s the obvious
I/O cost via a {{RandomAccessReader}}. Second, that each {{RandomAccessReader}} instance has
its own buffer (which can be off- or on-heap, but seems to default to off-heap) - so there
seems to be some (quite small) overhead to borrow/release that buffer.
* The higher the value of {{column_index_cache_size_in_kb}}, the more objects will be in the
heap - therefore: GC pressure
* Note that the parameter refers to the _serialized_ size and not the _amount_ of {{IndexInfo}}
objects. This was chosen to get some more obvious relation between the size of {{IndexInfo}}
objects to the amount of consumed heap - size of {{IndexInfo}} objects is mostly related to
the size of the clustering keys.
* Also note that some internal system/schema tables - especially those for LWTs - use clustering
keys and therefore index samples.
* For workloads with a huge amount of random reads against a large data set, small values
for {{column_index_cache_size_in_kb}} (like the default value) are beneficial if the key cache
is always full (i.e. it is evicting a lot).

Some local tests with the new {{LargePartitionTest}} on my Macbook (time machine + indexing
turned off) indicate that caching seems to work for shallow indexed entries.

I’ve scheduled some cstar runs against the _trades_ workload. Only the result with {{column_index_cache_size_in_kb:
0}} (which means, that no {{IndexInfo}} will be kept on heap (and in the key cache) shows
a performance regression. The default value of 2kb for {{column_index_cache_size_in_kb}} was
chosen as a result of this experiment.
* {{column_index_cache_size_in_kb: 0}} - [cstar result|]
* {{column_index_cache_size_in_kb: 2}} - [cstar result|]
* {{column_index_cache_size_in_kb: 4}} - [cstar result|]
* {{column_index_cache_size_in_kb: 8}} - [cstar result|]

Other cstar runs ([here|],
and [here|])
have shown that there’s no change for some plain workloads.

Daily regression tests show a similar performance: [compaction|],
[1 MV|],
[3 MV|], [rolling upgrade|]

Summary of the changes:
* Ungenerified {{RowIndexEntry}}
* {{RowIndexEntry}} now has a method to create an accessor to the {{IndexInfo}} objects on
disk - that accessor requires an instance of {{FileDataInput}}
* {{RowIndexEntry}} now has three subclasses: {{ShallowIndexedEntry}}, which is basically
the old {{IndexedEntry}} with the {{IndexInfo}} array-list removed but only responsible for
index files with an offsets-table, and {{LegacyShallowIndexedEntry}} which is responsible
for index files without an offsets-table (so pre-3.0). {{IndexedEntry}} keeps the {{IndexInfo}}
objects in an array - used if the serialized size of the RIE’s payload is less than the
new cassandra.yaml parameter {{column_index_cache_size_in_kb}}.
* {{RowIndexEntry.IndexInfoRetriever}} is the interface to access {{IndexInfo}} on disk using
a {{FileDataInput}}. It has concrete implementations: one for sstable versions with offsets
and one for legacy sstable versions. This one is only used from {{AbstractSSTableIterator.IndexState}}.

* Added “cache” of already deserialized {{IndexInfo}} instances in the base class of {{IndexInfoRetriever}}
for “shallow” indexed entries. This is not necessary for binary-search but for many other
access patterns, which sometimes appear to “jump around” in the {{IndexInfo}} objects.
Since {{IndexState}} is a short lived object, these cached {{IndexInfo}} instances get garbage
collected early.
* Writing of index files is also changed. It now switches to serialization into a byte buffer
instead of collecting an array-list of {{IndexInfo}} objects, when {{column_index_cache_size_in_kb}}
is hit.
* Bumped version of serialized key-cache from {{d}} to {{e}}. The key cache and its serialized
form no longer contain {{IndexInfo}} objects for indexed entries that exceed {{column_index_cache_size_in_kb}}
but need the position in the index file. Therefore, the serialized format of the key cache
has changed.
* {{Serializers}} (which is an instance per {{CFMetaData}}) keeps a “singleton” {{IndexInfo.Serializer}}
instance for {{BigFormat.latestVersion}} and constructs and keeps instances for other versions.
For “shallow” RIEs we need an instance of {{IndexInfo.Serializer}} to read {{IndexInfo}}
from disk - a “singleton” further reduces the number of objects on heap. TBC we create(d)
a lot of these instances (roughly one per {{IndexInfo}} instance/operation). We could also
reduce the number of {{IndexSerializer}} instances in the future - but it felt not to be necessary
for this ticket.
* Merged RIE’s {{IndexSerializer}} interface and {{Serializer}} class (that interface had
only a single implementation)
* Added methods to {{IndexSerializer}} to handle the special serialization for saved key caches
* Added specialized {{deserializePosition}} method to {{IndexSerializer}} as some invocations
just need the position in the data file.
* Moved {{IndexInfo}} binary-search into {{AbstractSSTableIterator.IndexState}} class (the
only place, where it’s used)
* Added some more {{skip}} methods in various places. These are required to calculate the
offsets array for legacy sstable versions.
* Classes {{ColumnIndex}} and {{IndexHelpler}} have been removed (functionality moved), {{IndexInfo}}
is now a top-level class.
* Added some {{Pre_11206_*}} classes that are copies of the previous implementations into
* Added new {{PagingQueryTest}} to test paged queries
* Added new {{LargePartitionsTest}} to test/compare various partition sizes (to be run explicitly,
otherwise ignored)
* Added test methods in {{KeyCacheTest}} and {{KeyCacheCqlTest}} for shallow/non-shallow indexed
* Also re-added behavior of CASSANDRA-8180 for {{IndexedEntry}} (but not for ShallowIndexedEntry}})

> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>                 Key: CASSANDRA-11206
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Robert Stupp
>             Fix For: 3.x
>         Attachments: 11206-gc.png, trunk-gc.png
> Cassandra saves a sample of IndexInfo objects that store the offset within each partition
of every 64KB (by default) range of rows.  To find a row, we binary search this sample, then
scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, we deserialize
the entire set of IndexInfo, which both creates a lot of GC overhead (as noted in CASSANDRA-9754)
but is also non-negligible i/o activity (relative to reading a single 64KB row range) as partitions
get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform the IndexInfo
bsearch while only deserializing IndexInfo that we need to compare against, i.e. log(N) deserializations.

This message was sent by Atlassian JIRA

View raw message