spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <>
Subject Re: Advanced log processing
Date Mon, 19 May 2014 21:06:46 GMT
It seems you are not reducing the data in size. If you are not then you are
better off partitioning the data into buckets (folders?) & keep data sorted
in those buckets ..
A more cleaner approach is to use HBase to keep track of keys & keep adding
keys as you find them & let hbase handle it.

Mayur Rustagi
Ph: +1 (760) 203 3257
@mayur_rustagi <>

On Mon, May 19, 2014 at 2:14 PM, Laurent T <>wrote:

> (resending this as alot of  mails seems not to be delivered)
> Hi,
> I have some complex behavior i'd like to be advised on as i'm really new to
> Spark.
> I'm reading some log files that contains various events. There are two
> types
> of events: parents and children. A child event can only have one parent and
> a parent can have multiple children.
> Currently i'm mapping my lines to get a Tuple2(parentID, Tuple2(Parent,
> List<Child>)) and then reducing by key to combine all children into one
> list
> and associate them with their parent.
> .reduceByKey(new Function2<Tuple2&lt;Parent, List&lt;Child>>,
> Tuple2<Parent,
> List&lt;Child>>, Tuple2<Parent, List&lt;Child>>>(){...}).
> It works fine on static data. But in production, i will have to process
> only
> part of the log files, for instance, everyday at midnight i'll process the
> last day of logs.
> So i'm facing the problem that a Parent may arrive one day and children on
> the next day. Right after reducing, i'm having Tuples with no parent and
> i'd
> like, only for those, to go check the previous log files to find the parent
> in a efficient way.
> My first idea would be to branch data using a filter and it's opposite.
> I'll
> then read previous files one by one until i've found all parents or i've
> reached a predefined limit. I would finally merge back everything to
> finalize my job.
> The problem is, i'm not even sure how i can do that. The filter part should
> be easy but how am i gonna scan files one by one using spark ?
> I hope someone can guide me through this.
> FYI, there will be gigs of data to process.
> Thanks
> Laurent
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at

View raw message