Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification. The "HadoopSupport" page has been changed by jeremyhanna. The comment on this change is: Consolidating the cluster configuration stuff.. http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=16&rev2=17 -------------------------------------------------- SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes())); ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate); }}} - Cassandra's splits are location-aware (this is the nature of the Hadoop [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]] design). Cassandra gives the Hadoop [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobTracker.html|JobTracker]] a list of locations with each split of data. That way, the !JobTracker can try to preserve data locality when assigning tasks to [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TaskTracker.html|TaskTracker]]s. Therefore, when using Hadoop alongside Cassandra, it is best to have a !TaskTracker running on each Cassandra node. - As of 0.7, configuration for Hadoop no longer resides in your job's specific storage-conf.xml. See the READMEs in the word_count and pig contrib modules for more details. + As of 0.7, configuration for Hadoop no longer resides in your job's specific storage-conf.xml. See the `README` in the word_count and pig contrib modules for more details. ==== Output To Cassandra ==== @@ -78, +77 @@ == Cluster Configuration == - If you would like to configure a Cassandra cluster so that Hadoop may operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. You'll want to have a separate server for your Hadoop `namenode`/`jobtracker`. Then install Hadoop `tasktracker`s on each of your Cassandra nodes. That will allow the `jobtracker` to assign tasks to the Cassandra nodes that contain data for those tasks. At least one node in your cluster will also need to be a `datanode`. That's because Hadoop uses HDFS to store information like jar dependencies for your job, static data (like stop words for a word count), and things like that - it's the distributed cache. It's a very small amount of data but the Hadoop cluster needs it to run properly. + If you would like to configure a Cassandra cluster so that Hadoop may operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. You'll want to have a separate server for your Hadoop namenode/`JobTracker`. Then install a Hadoop `TaskTracker` on each of your Cassandra nodes. That will allow the `Jobtracker` to assign tasks to the Cassandra nodes that contain data for those tasks. At least one node in your cluster will also need to be a datanode. That's because Hadoop uses HDFS to store information like jar dependencies for your job, static data (like stop words for a word count), and things like that - it's the distributed cache. It's a very small amount of data but the Hadoop cluster needs it to run properly. - The nice thing about having `tasktracker`s on every node is that 1, you get data locality and 2, your analytics engine scales with your data. + The nice thing about having a `TaskTracker` on every node is that you get data locality and your analytics engine scales with your data. [[#Top|Top]]