cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna
Date Sat, 23 Oct 2010 17:15:50 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "HadoopSupport" page has been changed by jeremyhanna.
The comment on this change is: adding an initial cluster configuration section..


   * [[#MapReduce|MapReduce Support]]
   * [[#Pig|Pig Support]]
   * [[#Hive|Hive Support]]
+  * [[#ClusterConfig|Cluster Configuration]]
@@ -73, +74 @@

+ <<Anchor(ClusterConfig)>>
+ == Cluster Configuration ==
+ If you would like to configure a Cassandra cluster so that Hadoop may operate over its data,
it's best to overlay a Hadoop cluster over your Cassandra nodes.  You'll want to have a separate
server for your Hadoop `namenode`/`jobtracker`.  Then install Hadoop `tasktracker`s on each
of your Cassandra nodes.  That will allow the `jobtracker` to assign tasks to the Cassandra
nodes that contain data for those tasks.  At least one node in your cluster will also need
to be a `datanode`.  That's because Hadoop uses HDFS to store information like jar dependencies
for your job, static data (like stop words for a word count), and things like that - it's
the distributed cache.  It's a very small amount of data but the Hadoop cluster needs it to
run properly.
+ The nice thing about having `tasktracker`s on every node is that 1, you get data locality
and 2, your analytics engine scales with your data.
+ [[#Top|Top]]

View raw message