cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DOAN DuyHai (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format
Date Sun, 20 Mar 2016 09:37:33 GMT


DOAN DuyHai commented on CASSANDRA-11206:

I have some questions related to the outcome of this JIRA.

 Since 2.1 incremental repair only repairs *chunks* of a partition (e.g. the chunks that are
in the un-repaired SSTables set) so even in case of mismatch we no longer stream the *entire*
partition. And using paging we can read through very wide partitions. With the improvement
brought by this JIRA, does it mean that now we can handle *virtually* unbounded or partitions
exceeding 2.10^9 physical columns ?

 I'm asking because it will impact greatly the way we model data. There are still some points
that can cause trouble with ultra-wide partitions:

 - bootstrapping/adding new nodes to the cluster --> streaming of an ultra-wide partitions.
What happens if the streaming fails in the middle ? Do we restart the streaming of the whole
partition or can we *resume* at the last clustering ?
 - compaction. With LCS, ultra wide partitions can create overly huge SSTables. In general,
how compaction ultra wide partitions will impact node stability ? 
 - read path with STCS --> more SSTables to touch on disk

> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>                 Key: CASSANDRA-11206
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Robert Stupp
>             Fix For: 3.x
> Cassandra saves a sample of IndexInfo objects that store the offset within each partition
of every 64KB (by default) range of rows.  To find a row, we binary search this sample, then
scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, we deserialize
the entire set of IndexInfo, which both creates a lot of GC overhead (as noted in CASSANDRA-9754)
but is also non-negligible i/o activity (relative to reading a single 64KB row range) as partitions
get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform the IndexInfo
bsearch while only deserializing IndexInfo that we need to compare against, i.e. log(N) deserializations.

This message was sent by Atlassian JIRA

View raw message