cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Cassandra Wiki] Update of "Operations_ZH" by jian.huang
Date Tue, 03 Aug 2010 03:10:30 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "Operations_ZH" page has been changed by jian.huang.


New page:
== 硬件 ==
See [[CassandraHardware_ZH|Cassandra的硬件]]

== 性能调优 ==
See [[PerformanceTuning|Cassandra的性能调优]]

== 数据结构管理 ==
See [[LiveSchemaUpdates|联机数据结构跟新]] [refers to functionality in 0.7]

== 令牌环管理 ==
每台 !Cassandra 服务器 [节点] 被指定为一个唯一的令牌,作为键值的第一个副本。


在使用时会随机分配从 0 到 2 ** 127 之间的整数令牌。键将转换为此范围的
MD5 哈希算法的令牌比较。(因此,键始终是转换为令牌,但是相反并不总是如此。)

=== 令牌选择 ===
`i * (2**127 / N)` for i = 1 .. N。

With order preserving partitioners, your key distribution will be application-dependent. 
You should still take your best guess at specifying initial tokens (guided by sampling actual
data, if possible), but you will be more dependent on active load balancing (see below) and/or
adding new nodes to hot spots.

Once data is placed on the cluster, the partitioner may not be changed without wiping and
starting over.

=== Replication ===
A Cassandra cluster always divides up the key space into ranges delimited by Tokens as described
above, but additional replica placement is customizable via IReplicaPlacementStrategy in the
configuration file.  The standard strategies are

 * !RackUnawareStrategy: replicas are always placed on the next (in increasing Token order)
N-1 nodes along the ring
 * !RackAwareStrategy: replica 2 is placed in the first node along the ring the belongs in
'''another''' data center than the first; the remaining N-2 replicas, if any, are placed on
the first nodes along the ring in the '''same''' rack as the first

Note that with !RackAwareStrategy, succeeding nodes along the ring should alternate data centers
to avoid hot spots.  For instance, if you have nodes A, B, C, and D in increasing Token order,
and instead of alternating you place A and B in DC1, and C and D in DC2, then nodes C and
A will have disproportionately more data on them because they will be the replica destination
for every Token range in the other data center.

 * The corollary to this is, if you want to start with a single DC and add another later,
when you add the second DC you should add as many nodes as you have in the first rather than
adding a node or two at a time gradually.

Replication factor is not really intended to be changed in a live cluster either, but increasing
it may be done if you (a) read at ConsistencyLevel.QUORUM or ALL (depending on your existing
replication factor) to make sure that a replica that actually has the data is consulted, (b)
are willing to accept downtime while anti-entropy repair runs (see below), or (c) are willing
to live with some clients potentially being told no data exists if they read from the new
replica location(s) until repair is done.

The same options apply to changing replication strategy.

Reducing replication factor is easily done and only requires running cleanup afterwards to
remove extra replicas.

=== Network topology ===
Besides datacenters, you can also tell Cassandra which nodes are in the same rack within a
datacenter.  Cassandra will use this to route both reads and data movement for Range changes
to the nearest replicas.  This is configured by a user-pluggable !EndpointSnitch class in
the configuration file.

!EndpointSnitch is related to, but distinct from, replication strategy itself: !RackAwareStrategy
needs a properly configured Snitch to place replicas correctly, but even absent a Strategy
that cares about datacenters, the rest of Cassandra will still be location-sensitive.

There is an example of a custom Snitch implementation in

== Range changes ==
=== Bootstrap ===
Adding new nodes is called "bootstrapping."

To bootstrap a node, turn !AutoBootstrap on in the configuration file, and start it.

If you explicitly specify an !InitialToken in the configuration, the new node will bootstrap
to that position on the ring.  Otherwise, it will pick a Token that will give it half the
keys from the node with the most disk space used, that does not already have another node
bootstrapping into its Range.

Important things to note:

 1. You should wait long enough for all the nodes in your cluster to become aware of the bootstrapping
node via gossip before starting another bootstrap.  The new node will log "Bootstrapping"
when this is safe, 2 minutes after starting.  (90s to make sure it has accurate load information,
and 30s waiting for other nodes to start sending it inserts happening in its to-be-assumed
part of the token ring.)
 1. Relating to point 1, one can only boostrap N nodes at a time with automatic token picking,
where N is the size of the existing cluster. If you need to more than double the size of your
cluster, you have to wait for the first N nodes to finish until your cluster is size 2N before
bootstrapping more nodes. So if your current cluster is 5 nodes and you want add 7 nodes,
bootstrap 5 and let those finish before boostrapping the last two.
 1. As a safety measure, Cassandra does not automatically remove data from nodes that "lose"
part of their Token Range to a newly added node.  Run `nodetool cleanup` on the source node(s)
(neighboring nodes that shared the same subrange) when you are satisfied the new node is up
and working. If you do not do this the old data will still be counted against the load on
that node and future bootstrap attempts at choosing a location will be thrown off.
 1. When bootstrapping a new node, existing nodes have to divide the key space before beginning
replication.  This can take awhile, so be patient.
 1. During bootstrap, a node will drop the Thrift port and will not be accessible from `nodetool`.
 1. Bootstrap can take many hours when a lot of data is involved.  See [[Streaming]] for how
to monitor progress.

Cassandra is smart enough to transfer data from the nearest source node(s), if your !EndpointSnitch
is configured correctly.  So, the new node doesn't need to be in the same datacenter as the
primary replica for the Range it is bootstrapping into, as long as another replica is in the
datacenter with the new one.

Bootstrap progress can be monitored using `nodetool` with the `streams` argument.

During bootstrap `nodetool` may report that the new node is not receiving nor sending any
streams, this is because the sending node will copy out locally the data they will send to
the receiving one, which can be seen in the sending node through the the "AntiCompacting...
AntiCompacted" log messages.

== Moving or Removing nodes ==
=== Removing nodes entirely ===
You can take a node out of the cluster with `nodetool decommission` to a live node, or `nodetool
removetoken` (to any other machine) to remove a dead one.  This will assign the ranges the
old node was responsible for to other nodes, and replicate the appropriate data there. If
`decommission` is used, the data will stream from the decommissioned node. If `removetoken`
is used, the data will stream from the remaining replicas.

No data is removed automatically from the node being decommissioned, so if you want to put
the node back into service at a different token on the ring, it should be removed manually.

=== Moving nodes ===
`nodetool move`: move the target node to a given Token. Moving is essentially a convenience
over decommission + bootstrap.

As with bootstrap, see [[Streaming]] for how to monitor progress.

=== Load balancing ===
`nodetool loadbalance`: also essentially a convenience over decommission + bootstrap, only
instead of telling the target node where to move on the ring it will choose its location based
on the same heuristic as Token selection on bootstrap.

The status of move and balancing operations can be monitored using `nodetool` with the `streams`

== Consistency ==
Cassandra allows clients to specify the desired consistency level on reads and writes.  (See
[[API]].)  If R + W > N, where R, W, and N are respectively the read replica count, the
write replica count, and the replication factor, all client reads will see the most recent
write.  Otherwise, readers '''may''' see older versions, for periods of typically a few ms;
this is called "eventual consistency."  See
and for more.

See below about consistent backups.

=== Repairing missing or inconsistent data ===
Cassandra repairs data in two ways:

 1. Read Repair: every time a read is performed, Cassandra compares the versions at each replica
(in the background, if a low consistency was requested by the reader to minimize latency),
and the newest version is sent to any out-of-date replicas.
 1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Merkle tree of the data
on that node, and compares it with the versions on other replicas, to catch any out of sync
data that hasn't been read recently.  This is intended to be run infrequently (e.g., weekly)
since computing the Merkle tree is relatively expensive in disk i/o and CPU, since it scans
ALL the data on the machine (but it is is very network efficient).  

Running `nodetool repair`:
Like all nodetool operations, repair is non-blocking; it sends the command to the given node,
but does not wait for the repair to actually finish.  You can tell that repair is finished
when (a) there are no active or pending tasks in the CompactionManager, and after that when
(b) there are no active or pending tasks on AE-SERVICE-STAGE.

Repair should be run against one machine at a time.  (This limitation will be fixed in 0.7.)

=== Handling failure ===
If a node goes down and comes back up, the ordinary repair mechanisms will be adequate to
deal with any inconsistent data.  Remember though that if a node is down longer than your
configured GCGraceSeconds (default: 10 days), it could have missed remove operations permanently.
 Unless your application performs no removes, you should wipe its data directory, re-bootstrap
it, and removetoken its old entry in ghe ring (see below).

If a node goes down entirely, then you have two options:

 1. (Recommended approach) Bring up the replacement node with a new IP address, and !AutoBootstrap
set to true in storage-conf.xml. This will place the replacement node in the cluster and find
the appropriate position automatically. Then the bootstrap process begins. While this process
runs, the node will not receive reads until finished. Once this process is finished on the
replacement node, run `nodetool removetoken` once, supplying the token of the dead node, and
`nodetool cleanup` on each node.
 1. You can obtain the dead node's token by running `nodetool ring` on any live node, unless
there was some kind of outage, and the others came up but not the down one -- in that case,
you can retrieve the token from the live nodes' system tables.

 1. (Alternative approach) Bring up a replacement node with the same IP and token as the old,
and run `nodetool repair`. Until the repair process is complete, clients reading only from
this node may get no data back.  Using a higher !ConsistencyLevel on reads will avoid this.

The reason why you run `nodetool cleanup` on all live nodes is to remove old Hinted Handoff
writes stored for the dead node.

== Backing up data ==
Cassandra can snapshot data while online using `nodetool snapshot`.  You can then back up
those snapshots using any desired system, although leaving them where they are is probably
the option that makes the most sense on large clusters.

With some combinations of operating system/jvm you may receive an error related to the inability
to create a process during the snapshotting, such as this on Linux

Exception in thread "main" Cannot run program "ln":
error=12, Cannot allocate memory
Remain calm. The operating system is trying to allocate for the "ln" process a memory space
as large as the parent process (the cassandra server), even if '''it's not going to use it'''.
So if you have a machine with 8GB of RAM and no swap, and you gave 6 to the cassandra server,
it will fail during this because the operating system will wan 12 GB of memory before allowing
you to create the process.

This can be worked around depending on the operating system by either creating a swap file,
snapshotting, turning it off or by turning on "memory overcommit". Since the child process
memory is the same as the parent, until it performs an `exec("ln")` call the operating system
will not use the new memory and will just refer to the old one, and everything will work.

Currently, only flushed data is snapshotted (not data that only exists in the commitlog).
 Run `nodetool flush` first and wait for that to complete, to make sure you get '''all'''
data in the snapshot.

To revert to a snapshot, shut down the node, clear out the old commitlog and sstables, and
move the sstables from the snapshot location to the live data directory.

=== Consistent backups ===
You can get an eventually consistent backup by flushing all nodes and snapshotting; no individual
node's backup is guaranteed to be consistent but if you restore from that snapshot then clients
will get eventually consistent behavior as usual.

There is no such thing as a consistent view of the data in the strict sense, except in the
trivial case of writes with consistency level = ALL.

=== Import / export ===
As an alternative to taking snapshots it's possible to export SSTables to JSON format using
the `bin/sstable2json` command:

Usage: sstable2json [-f outfile] <sstable> [-k key [-k key [...]]]
`bin/sstable2json` accepts as a required argument, the full path to an SSTable data file,
(files ending in -Data.db), and an optional argument for an output file (by default, output
is written to stdout). You can also pass the names of specific keys using the `-k` argument
to limit what is exported.

Note: If you are not running the exporter on in-place SSTables, there are a couple of things
to keep in mind.

 1. The corresponding configuration must be present (same as it would be to run a node).
 1. SSTables are expected to be in a directory named for the keyspace (same as they would
be on a production node).

JSON exported SSTables can be "imported" to create new SSTables using `bin/json2sstable`:

Usage: json2sstable -K keyspace -c column_family <json> <sstable>
`bin/json2sstable` takes arguments for keyspace and column family names, and full paths for
the JSON input file and the destination SSTable file name.

You can also import pre-serialized rows of data using the BinaryMemtable interface.  This
is useful for importing via Hadoop or another source where you want to do some preprocessing
of the data to import.

NOTE: Starting with version 0.7, json2sstable and sstable2json must be run in such a way that
the schema can be loaded from system tables.  This means that cassandra.yaml must be found
in the classpath and refer to valid storage directories.

== Monitoring ==
Cassandra exposes internal metrics as JMX data.  This is a common standard in the JVM world;
OpenNMS, Nagios, and Munin at least offer some level of JMX support. The specifics of the
JMX Interface are documented at JmxInterface.

Running `nodetool cfstats` can provide an overview of each Column Family, and important metrics
to graph your cluster. Some folks prefer having to deal with non-jmx clients, there is a JMX-to-REST
bridge available at

Important metrics to watch on a per-Column Family basis would be: '''Read Count, Read Latency,
Write Count and Write Latency'''. '''Pending Tasks''' tell you if things are backing up. These
metrics can also be exposed using any JMX client such as `jconsole`

You can also use jconsole, and the MBeans tab to look at PendingTasks for thread pools. If
you see one particular thread backing up, this can give you an indication of a problem. One
example would be ROW-MUTATION-STAGE indicating that write requests are arriving faster than
they can be handled. A more subtle example is the FLUSH stages: if these start backing up,
cassandra is accepting writes into memory fast enough, but the sort-and-write-to-disk stages
are falling behind.

If you are seeing a lot of tasks being built up, your hardware or configuration tuning is
probably the bottleneck.

Running `nodetool tpstats` will dump all of those threads to console if you don't want to
use jconsole. Example:

Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0            119
MESSAGING-SERVICE-POOL            3         4       81594002
STREAM-STAGE                      0         0              3
RESPONSE-STAGE                    0         0       48353537
ROW-READ-STAGE                    0         0          13754
LB-OPERATIONS                     0         0              0
COMMITLOG                         1         0       78080398
GMFD                              0         0        1091592
MESSAGE-DESERIALIZER-POOL         0         0      126022919
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0           2899
ROW-MUTATION-STAGE                1         2       81719765
MESSAGE-STREAMING-POOL            0         0            129
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0            218
MEMTABLE-POST-FLUSHER             0         0            218
COMPACTION-POOL                   0         0            464
FLUSH-WRITER-POOL                 0         0            218
HINTED-HANDOFF-POOL               0         0            154

View raw message