spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Laurent T <>
Subject Re: Advanced log processing
Date Tue, 20 May 2014 14:23:16 GMT
Thanks for the advice. I think you're right. I'm not sure we're going to use
HBase but starting by partitioning data into multiple buckets will be a
first step. I'll see how it performs on large datasets.

My original question though was more like: is there a spark trick i don't
know about ?
Currently here's what i'm doing:
JavaPairRDD originalData = ...;JavaPairRDD incompleteData = originalData   
.filter(KeepIncompleteData)    .map(CleanData)    .cache();List pathList =
incompleteData    .flatMap(GetPossibleConciliationPaths)    .distinct()   
.collect()JavaPairRDD conciliationRDD = null;for (String filePath : pathList
) {	JavaPairRDD fileData = sc		.textFile(filePath)		.flatMap(ProcessData);
if (conciliationRDD == null) {		conciliationRDD = fileData;	}	else {	
conciliationRDD = conciliationRDD .union(fileData);	}}JavaPairRDD finalData
= originalData    .filter(KeepCompleteData)   
.union(conciliationRDD.join(incompleteData))    .saveAsTextFile(dir);
The collect part is what's frightening me the most as there may be alot of
different paths.Does that seem fine ?Would an approach with HBase allow me
to simply join the incomplete data with the stored state using a key ?Thanks

View this message in context:
Sent from the Apache Spark User List mailing list archive at
View raw message