cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Kołaczkowski (JIRA) <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-7688) Add data sizing to a system table
Date Mon, 01 Dec 2014 12:02:13 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229701#comment-14229701
] 

Piotr Kołaczkowski edited comment on CASSANDRA-7688 at 12/1/14 12:01 PM:
-------------------------------------------------------------------------

It would be nice to know also the average partition size in the given table, both in bytes
and in number of CQL rows. This would be useful to set appropriate fetch.size. Additionally,
current split generation API does not allow to set split size in terms of data size in bytes
or number of CQL rows, but only by number of partitions. Number of partitions doesn't make
a nice default, as partitions can vary greatly in size and are extremely use-case dependent.
So please, don't just copy current describe_splits_ex functionality to the new driver, but
*improve this*. 

We really don't need the driver / Cassandra to do the splitting for us. Instead we need to
know:

1. estimate of total amount of data in the table in bytes
2. estimate of total number of CQL rows in the table
3. estimate of total number of partitions in the table

We're interested both in totals (whole cluster; logical sizes; i.e. without replicas), and
split by token-ranges by node (physical; incuding replicas).

Note that this information is useful not just for Spark/Hadoop split generation, but also
things like e.g. SparkSQL optimizer so it knows how much data will it have to process.

The next  step would be providing column data histograms to guide predicate selectivity. 


was (Author: pkolaczk):
It would be nice to know also the average partition size in the given table, both in bytes
and in number of CQL rows. This would be useful to set appropriate fetch.size. Additionally,
current split generation API does not allow to set split size in terms of data size in bytes
or number of CQL rows, but only by number of partitions. Number of partitions doesn't make
a nice default, as partitions can vary greatly in size and are extremely use-case dependent.
So please, don't just copy current describe_splits_ex functionality to the new driver, but
*improve this*. 

We really don't need the driver / Cassandra to do the splitting for us. Instead we need to
know:

1. estimate of total amount of data in the table in bytes
2. estimate of total number of CQL rows in the table
3. estimate of total number of partitions in the table

We're interested both in totals (whole cluster; logical sizes; i.e. without replicas), and
split by token-ranges by node (physical; incuding replicas).

> Add data sizing to a system table
> ---------------------------------
>
>                 Key: CASSANDRA-7688
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7688
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jeremiah Jordan
>             Fix For: 2.1.3
>
>
> Currently you can't implement something similar to describe_splits_ex purely from the
a native protocol driver.  https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose
easily getting ownership information to a client in the java-driver.  But you still need the
data sizing part to get splits of a given size.  We should add the sizing information to a
system table so that native clients can get to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message