cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "ArchitectureInternals" by JonathanEllis
Date Tue, 24 Nov 2009 17:02:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "ArchitectureInternals" page has been changed by JonathanEllis.
http://wiki.apache.org/cassandra/ArchitectureInternals?action=diff&rev1=1&rev2=2

--------------------------------------------------

- =General=
+ = General =
   * Configuration file is parsed by !DatabaseDescriptor (which also has all the default values,
if any)
   * Thrift generates an API interface in Cassandra.java; the implementation is !CassandraServer,
and !CassandraDaemon ties it together.
   * !CassandraServer turns thrift requests into the internal equivalents, then !StorageProxy
does the actual work, then !CassandraServer turns it back into thrift again
@@ -8, +8 @@

   * !AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. replicas
of each key range.  Primary replica is always determined by the token ring (in !TokenMetadata)
but you can do a lot of variation with the others.  !RackUnaware just puts replicas on the
next N-1 nodes in the ring.  !RackAware puts the first non-primary replica in the next node
in the ring in ANOTHER data center than the primary; then the remaining replicas in the same
as the primary.
   * !MessagingService handles connection pooling and running internal commands on the appropriate
stage (basically, a threaded executorservice).  Stages are set up in !StageManager; currently
there are read, write, and stream stages.  (Streaming is for when one node copies large sections
of its sstables to another, for bootstrap or relocation on the ring.)  The internal commands
are defined in !StorageService; look for `registerVerbHandlers`.
  
- =Write path=
+ = Write path =
   * !StorageProxy gets the nodes responsible for replicas of the keys from the !ReplicationStrategy,
then sends !RowMutation messages to them.
     * If nodes are changing position on the ring, "pending ranges" are associated with their
destinations in !TokenMetadata and these are also written to.
     * If nodes that should accept the write are down, but the remaining nodes can fulfill
the requested !ConsistencyLevel, the writes for the down nodes will be sent to another node
instead, with a header (a "hint") saying that data associated with that key should be sent
to the replica node when it comes back up.  This is called "hinted handoff" and reduces the
"eventual" in "eventual consistency."
-  * !RowMutationVerbHandler hands the write first to !CommitLog.java, then to the Memtable
for the appropriate !ColumnFamily (through Table.apply).
+  * on the destination node, !RowMutationVerbHandler hands the write first to !CommitLog.java,
then to the Memtable for the appropriate !ColumnFamily (through Table.apply).
   * When a Memtable is full, it gets sorted and written out as an !SSTable asynchronously
by !ColumnFamilyStore.switchMemtable
-    * When enough !SSTables exist, they are merged by !ColumnFamilyStore.doFileCompaction
+    * When enough SSTables exist, they are merged by !ColumnFamilyStore.doFileCompaction
-      * Making this concurrency-safe without blocking writes or reads while we remove the
old !SSTables from the list and add the new one is tricky, because naive approaches require
waiting for all readers of the old sstables to finish before deleting them (since we can't
know if they have actually started opening the file yet; if they have not and we delete the
file first, they will error out).  The approach we have settled on is to not actually delete
old !SSTables synchronously; instead we register a phantom reference with the garbage collector,
so when no references to the !SSTable exist it will be deleted.  (We also write a compaction
marker to the file system so if the server is restarted before that happens, we clean out
the old !SSTables at startup time.)
+      * Making this concurrency-safe without blocking writes or reads while we remove the
old SSTables from the list and add the new one is tricky, because naive approaches require
waiting for all readers of the old sstables to finish before deleting them (since we can't
know if they have actually started opening the file yet; if they have not and we delete the
file first, they will error out).  The approach we have settled on is to not actually delete
old SSTables synchronously; instead we register a phantom reference with the garbage collector,
so when no references to the !SSTable exist it will be deleted.  (We also write a compaction
marker to the file system so if the server is restarted before that happens, we clean out
the old SSTables at startup time.)
   * See ArchitectureSSTable and ArchitectureCommitLog for more details
  
- =Read path=
+ = Read path =
+  * !StorageProxy gets the nodes responsible for replicas of the keys from the !ReplicationStrategy,
then sends read messages to them
+    * This may be a !SliceFromReadCommand, a !SliceByNamesReadCommand, or a !RangeSliceReadCommand,
depending
+  * On the data node, !ReadVerbHandler gets the data from CFS.getColumnFamily or CFS.getRangeSlice
and sends it back as a !ReadResponse
+  * If a quorum read was requested, !StorageProxy waits for a majority of nodes to reply
and makes sure the answers match before returning.  Otherwise, it returns the data reply as
soon as it gets it, and checks the other replies for discrepancies in the background in !StorageService.doConsistencyCheck.
 This is called "read repair," and also helps achieve consistency sooner.
+    * As an optimization, !StorageProxy only asks the closest replica for the actual data;
the other replicas are asked only to compute a hash of the data.
  
+ = Further reading =
+  * Cassandra's distribution is closely related to the one presented in Amazon's Dynamo paper.
 Read repair, adjustable consistency levels, hinted handoff, and other concepts are discussed
there.  This is required background material: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
+ 

Mime
View raw message