cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Cassandra Wiki] Trivial Update of "HadoopSupport" by EricGilmore
Date Wed, 02 Mar 2011 02:56:24 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "HadoopSupport" page has been changed by EricGilmore.


  == MapReduce ==
  ==== Input from Cassandra ====
  Cassandra 0.6+ adds support for retrieving data from Cassandra.  This is based on implementations
of [[|InputSplit]],
and [[|RecordReader]]
so that Hadoop !MapReduce jobs can retrieve data from Cassandra.  For an example of how this
works, see the contrib/word_count example in 0.6 or later.  Cassandra rows or row  fragments
(that is, pairs of key + `SortedMap`  of columns) are input to Map tasks for  processing by
your job, as specified by a `SlicePredicate`  that describes which columns to fetch from each
@@ -31, +30 @@

              SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));
              ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  As of 0.7, configuration for Hadoop no longer resides in your job's specific storage-conf.xml.
See the `README` in the `word_count` and `pig` contrib modules for more details.
  ==== Output To Cassandra ====
- As of 0.7, there is be a basic mechanism included in Cassandra for outputting data to Cassandra.
 The `contrib/word_count` example in 0.7 contains two reducers - one for outputting data to
the filesystem and one to output data to Cassandra (default) using this new mechanism.  See
that example in the latest release for details.
+ As of 0.7, there is a basic mechanism included in Cassandra for outputting data to Cassandra.
 The `contrib/word_count` example in 0.7 contains two reducers - one for outputting data to
the filesystem and one to output data to Cassandra (default) using this new mechanism.  See
that example in the latest release for details.
  ==== Hadoop Streaming ====
  As of 0.7, there is support for [[|Hadoop
Streaming]].  For examples on how to use Streaming with Cassandra, see the contrib section
of the Cassandra source.  The relevant tickets are [[|CASSANDRA-1368]]
and [[|CASSANDRA-1497]].
  ==== Some troubleshooting ====
  Releases before  0.6.2/0.7 are affected by a small  resource leak that may cause jobs to
fail (connections are not released  properly, causing a resource leak). Depending on your
local setup you  may hit this issue, and workaround it by raising the limit of open file 
descriptors for the process (e.g. in linux/bash using `ulimit -n 32000`).  The error will
be reported on  the hadoop job side as a thrift !TimedOutException.
  If you are testing the integration against a single node and you obtain  some failures,
this may be normal: you are probably overloading the  single machine, which may again result
in timeout errors. You can  workaround it by reducing the number of concurrent tasks
@@ -57, +52 @@

               ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
@@ -66, +60 @@

  Cassandra 0.6+ also adds support for [[|Pig]] with its own implementation
of [[|LoadFunc]].  This
allows Pig queries to be run against data stored in Cassandra.  For an example of this, see
the `contrib/pig` example in 0.6 and later.
  When running Pig with Cassandra + Hadoop on a cluster, be sure to follow the `README` notes
in the `<cassandra_src>/contrib/pig` directory, the [[#ClusterConfig|Cluster Configuration]]
section on this page, and some additional notes here:
   * Set the `HADOOP_HOME` environment variable to `<hadoop_dir>`, e.g. `/opt/hadoop`
or `/etc/hadoop`
   * Set the `PIG_CONF` environment variable to `<hadoop_dir>/conf`
   * Set the `JAVA_HOME`
@@ -82, +77 @@

  == Cluster Configuration ==
  If you would like to configure a Cassandra cluster so that Hadoop may operate over its data,
it's best to overlay a Hadoop cluster over your Cassandra nodes.  You'll want to have a separate
server for your Hadoop namenode/`JobTracker`.  Then install a Hadoop `TaskTracker` on each
of your Cassandra nodes.  That will allow the `Jobtracker` to assign tasks to the Cassandra
nodes that contain data for those tasks.  At least one node in your cluster will also need
to be a datanode.  That's because Hadoop uses HDFS to store information like jar dependencies
for your job, static data (like stop words for a word count), and things like that - it's
the distributed cache.  It's a very small amount of data but the Hadoop cluster needs it to
run properly.
  The nice thing about having a `TaskTracker` on every node is that you get data locality
and your analytics engine scales with your data. You also never need to shuttle around your
data once you've performed analytics on it - you simply output to Cassandra and you are able
to access that data with high random-read performance.
  One configuration note on getting the task trackers to be able to perform queries over Cassandra,
you'll want to update your `HADOOP_CLASSPATH` in your `<hadoop>/conf/`
to include the Cassandra lib libraries.  For example you'll want to do something like this
in the `` on each of your task trackers:
  export HADOOP_CLASSPATH=/opt/cassandra/lib/*:$HADOOP_CLASSPATH
  ==== Virtual Datacenter ====
  One thing that many have asked about is whether Cassandra with Hadoop will be usable from
a random access perspective. For example, you may need to use Cassandra for serving web latency
requests. You may also need to run analytics over your data. In Cassandra 0.7+ there is the
!NetworkTopologyStrategy which allows you to customize your cluster's replication strategy
by datacenter. What you can do with this is create a 'virtual datacenter' to separate nodes
that serve data with high random-read performance from nodes that are meant to be used for
analytics. You need to have a snitch configured with your topology and then according to the
datacenters defined there (either explicitly or implicitly), you can indicate how many replicas
you would like in each datacenter. You would install task trackers on nodes in your analytics
section and make sure that a replica is written to that 'datacenter' in your !NetworkTopologyStrategy
configuration. The practical upshot of this is your analytics nodes always have current data
and your high random-read performance nodes always serve data with predictable performance.
@@ -100, +94 @@

  == Support ==
  Sometimes configuration and integration can get tricky. To get support for this functionality,
start with the `contrib` examples in the source download of Cassandra. Make sure you are following
instructions in the `README` file for that example. You can search the Cassandra user mailing
list or post on there as it is very active. You can also ask in the #Cassandra irc channel
on freenode for help. Other channels that might be of use are #hadoop, #hadoop-pig, and #hive.
Those projects' mailing lists are also very active.

View raw message