cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna
Date Mon, 01 Aug 2011 14:50:47 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "HadoopSupport" page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=37&rev2=38

Comment:
Updated cluster config section.  Took out single datanode thought as intermediate results
need more than that realistically.  Added bit about Brisk.

  <<Anchor(ClusterConfig)>>
  
  == Cluster Configuration ==
- If you would like to configure a Cassandra cluster so that Hadoop may operate over its data,
it's best to overlay a Hadoop cluster over your Cassandra nodes.  You'll want to have a separate
server for your Hadoop namenode/`JobTracker`.  Then install a Hadoop `TaskTracker` on each
of your Cassandra nodes.  That will allow the `Jobtracker` to assign tasks to the Cassandra
nodes that contain data for those tasks.  At least one node in your cluster will also need
to be a datanode.  That's because Hadoop uses HDFS to store information like jar dependencies
for your job, static data (like stop words for a word count), and things like that - it's
the distributed cache.  It's a very small amount of data but the Hadoop cluster needs it to
run properly.
+ 
+ The simplest way to configure your cluster to run Cassandra with Hadoop is to use Brisk,
the open-source packaging of Cassandra with Hadoop.  That will start the `JobTracker` and
`TaskTracker` processes for you.  It also uses CFS, an HDFS compatible distributed filesystem
built on Cassandra that removes the need for a Hadoop `NameNode` and `DataNode` processes.
 For details, see the Brisk [[http://www.datastax.com/docs/0.8/brisk/index|documentation]]
and [[http://github.com/riptano/brisk|code]]
+ 
+ Otherwise, if you would like to configure a Cassandra cluster yourself so that Hadoop may
operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes.  You'll
want to have a separate server for your Hadoop `NameNode/`JobTracker`.  Then install a Hadoop
`TaskTracker` on each of your Cassandra nodes.  That will allow the `JobTracker` to assign
tasks to the Cassandra nodes that contain data for those tasks.  Also install a Hadoop `DataNode`
on each Cassandra node.  Hadoop requires a distributed filesystem for copying dependency jars,
static data, and intermediate results to be stored.
  
  The nice thing about having a `TaskTracker` on every node is that you get data locality
and your analytics engine scales with your data. You also never need to shuttle around your
data once you've performed analytics on it - you simply output to Cassandra and you are able
to access that data with high random-read performance.
  
@@ -79, +82 @@

  }}}
  ==== Virtual Datacenter ====
  One thing that many have asked about is whether Cassandra with Hadoop will be usable from
a random access perspective. For example, you may need to use Cassandra for serving web latency
requests. You may also need to run analytics over your data. In Cassandra 0.7+ there is the
!NetworkTopologyStrategy which allows you to customize your cluster's replication strategy
by datacenter. What you can do with this is create a 'virtual datacenter' to separate nodes
that serve data with high random-read performance from nodes that are meant to be used for
analytics. You need to have a snitch configured with your topology and then according to the
datacenters defined there (either explicitly or implicitly), you can indicate how many replicas
you would like in each datacenter. You would install task trackers on nodes in your analytics
section and make sure that a replica is written to that 'datacenter' in your !NetworkTopologyStrategy
configuration. The practical upshot of this is your analytics nodes always have current data
and your high random-read performance nodes always serve data with predictable performance.
- 
- For an example of configuring Cassandra with Hadoop in the cloud, see the [[http://github.com/digitalreasoning/PyStratus|PyStratus]]
project on Github.
  
  [[#Top|Top]]
  

Mime
View raw message