cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Capriolo (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
Date Sun, 09 Jun 2013 14:56:22 GMT


Edward Capriolo commented on CASSANDRA-4175:

2995 says 

It could be advantageous for Cassandra to make the storage engine pluggable. This could allow
Cassandra to

    deal with potential use cases where maybe the current sstables are not the best fit
    allow several types of internal storage formats (at the same time) optimized for different
data types

Since this issue talks about reducing disk space it will be changing how data is written,
this seems to benefit people with mostly static column. It sounds right on the money with
2995. However it goes beyond storage layer changes.

The feature makes a ton of sense and does not only benefit the cql3 case. Many people have
static columns and since 0.7 standard column families have had schema as well.

If cassandra had a 'plugable storage format'. One of the things it the 'ColumnMapIdStorageFormat'
could do is write the known schema to a small file loaded in memory with each sstable, (like
the bloom filter) that would contain the mappings. In the end I think you would have to store
this anyway because the mappings would change over time and what is in the schema now may
not be fully accurate for old slushed tables. This would only save storage as mentioned and
the internode traffic could not be optimized with plugable storage alone.

For compare and swap, well whatever, it's just one feature and no one has to use it if they
do not want to. However requiring all schema changes to need zk is crazy scary to me. It is
true that schema always needed to propagate before it can be used. I personally do not want
to have to install zk side by side with all my cassandra installs, and I do not want to rely
on it for schema changes. 

Architecturally building on zk is a house of cards. This was originally why I chose cassandra
over hbase (hbase had meta data on hdfs, and state information with zk). The WORST think that
ever happens to cassandra is a node has a corrupt schema or a disagreement. I restart/decommission
rejoin the node and it is fixed.

If we start storing bits of information (column ids, schema in zookeeper) we become totally
reliant on it, nodes may or may not be able to start up without it, we may or not be able
to make schema changes without it, and MOST IMPORTANTLY, ITS AN SPOF THAT WHEN  IT GOES CORRUPT
will likely cause the entire cluster to * die, or likely function in a way worse then death,
something like writing (corrupt ids column to files and hopelessly corrupting everything).

No thanks to any ZK integration. ZK and centrally managed meta data = hbase.

> Reduce memory, disk space, and cpu usage with a column name/id map
> ------------------------------------------------------------------
>                 Key: CASSANDRA-4175
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>             Fix For: 2.1
> We spend a lot of memory on column names, both transiently (during reads) and more permanently
(in the row cache).  Compression mitigates this on disk but not on the heap.
> The overhead is significant for typical small column values, e.g., ints.
> Even though we intern once we get to the memtable, this affects writes too via very high
allocation rates in the young generation, hence more GC activity.
> Now that CQL3 provides us some guarantees that column names must be defined before they
are inserted, we could create a map of (say) 32-bit int column id, to names, and use that
internally right up until we return a resultset to the client.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message