Like, would Crunch support 0.94.5? I'm not really sure: our HBase dependencies are pretty minimal, which makes me think that creating a MultiTableInputFormat Source would be easy to write, but HBase has a tendency to change out from underneath us in ways that I have a hard time diagnosing w/o help from folks who know it better than I do.


On Mon, Apr 8, 2013 at 1:52 PM, Micah Whitacre <mkwhitacre@gmail.com> wrote:
What's the minimum supported version of HBase Crunch will support?  We have the exact same need but because the fix for HBASE-3996 and its requirement for region server changes it wasn't as each to patch back to 0.92 or 0.94.2 (CDH 4.2).



On Mon, Apr 8, 2013 at 3:47 PM, Josh Wills <jwills@cloudera.com> wrote:
Maybe we need something based on this?



On Mon, Apr 8, 2013 at 1:41 PM, Chad Urso McDaniel <chadum@gmail.com> wrote:
This may be a core hadoop question.

We are using Crunch with HBase. 
We typically set up the input PTable like so:
---
      Scan scan = ...
      HBaseSourceTarget source = new HBaseSourceTarget(tableName, scan);
      PTable<ImmutableBytesWritable, Result> data = pipeline.read(source);
---

A use case that we want to use in order to speed up the processing with Crunch is using multiple Scans into one PTable.

We know which sections of the HBase table we want and they are not contiguous.

We have tried unioning the PTables but that turns out to be incredibly slow.
Currently we are using a filter that results in many unnecessary reads.

How do others solve this?

I'm temped to write a TableSource that can do this.

thanks



--
Director of Data Science
Twitter: @josh_wills




--
Director of Data Science
Cloudera
Twitter: @josh_wills