Stefan De Smit created CRUNCH-586:
-------------------------------------
Summary: SparkPipeline does not work with HBaseSourceTarget
Key: CRUNCH-586
URL: https://issues.apache.org/jira/browse/CRUNCH-586
Project: Crunch
Issue Type: Bug
Components: Spark
Affects Versions: 0.13.0
Reporter: Stefan De Smit
final Pipeline pipeline = new SparkPipeline("local", "crunchhbase", HBaseInputSource.class,
conf);
final PTable<ImmutableBytesWritable, Result> read = pipeline.read(new HBaseSourceTarget("t1",
new Scan()));
return an empty table, while it works with MRPipeline.
root cause is the combination of sparks getJavaRDDLike method:
source.configureSource(job, -1);
Converter converter = source.getConverter();
JavaPairRDD<?, ?> input = runtime.getSparkContext().newAPIHadoopRDD(
job.getConfiguration(),
CrunchInputFormat.class,
converter.getKeyClass(),
converter.getValueClass());
That assumes "CrunchInputFormat.class" (and always uses -1)
and hbase configureSoruce method:
if (inputId == -1) {
job.setMapperClass(CrunchMapper.class);
job.setInputFormatClass(inputBundle.getFormatClass());
inputBundle.configure(conf);
} else {
Path dummy = new Path("/hbase/" + table);
CrunchInputs.addInputPath(job, dummy, inputBundle, inputId);
}
easiest solution I see, is always calling CrunchInputs.addInputPath, in every source.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
|