cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Kołaczkowski (JIRA) <>
Subject [jira] [Commented] (CASSANDRA-7688) Add data sizing to a system table
Date Mon, 01 Dec 2014 14:21:13 GMT


Piotr Kołaczkowski commented on CASSANDRA-7688:

We only need estimates, not exact values. Factor 1.5x error is considered an awesome estimate,
factor 3x is still fairly good. 
Also note that Spark/Hadoop does many token range scans. Maybe collecting some statistics
on the fly, during the scans (or during the compaction) would be viable?  And running a full
compaction to get statistics more accurate - why not? You need to do it anyway to get top
speed when scanning data in Spark, because a full table scan is doing kind-of implicit compaction
anyway, isn't it? 

Also, one more thing - it would be good to have those values per column (sorry for making
it even harder, I know it is not an easy task). At least to know that a column is responsible
for xx% of data in the table - knowing such thing would make a huge difference when estimating
data size, because we're not always fetching all columns and they may vary in size a lot (e.g.
collections!). Some sampling on insert would probably be enough.

> Add data sizing to a system table
> ---------------------------------
>                 Key: CASSANDRA-7688
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jeremiah Jordan
>             Fix For: 2.1.3
> Currently you can't implement something similar to describe_splits_ex purely from the
a native protocol driver. is open to expose
easily getting ownership information to a client in the java-driver.  But you still need the
data sizing part to get splits of a given size.  We should add the sizing information to a
system table so that native clients can get to it.

This message was sent by Atlassian JIRA

View raw message