spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Laurent T <>
Subject Re: Advanced log processing
Date Mon, 19 May 2014 08:44:50 GMT
(resending this as alot of  mails seems not to be delivered)


I have some complex behavior i'd like to be advised on as i'm really new to

I'm reading some log files that contains various events. There are two types
of events: parents and children. A child event can only have one parent and
a parent can have multiple children. 

Currently i'm mapping my lines to get a Tuple2(parentID, Tuple2(Parent,
List<Child>)) and then reducing by key to combine all children into one list
and associate them with their parent. 
.reduceByKey(new Function2<Tuple2&lt;Parent, List&lt;Child>>, Tuple2<Parent,
List&lt;Child>>, Tuple2<Parent, List&lt;Child>>>(){...}). 

It works fine on static data. But in production, i will have to process only
part of the log files, for instance, everyday at midnight i'll process the
last day of logs. 

So i'm facing the problem that a Parent may arrive one day and children on
the next day. Right after reducing, i'm having Tuples with no parent and i'd
like, only for those, to go check the previous log files to find the parent
in a efficient way. 

My first idea would be to branch data using a filter and it's opposite. I'll
then read previous files one by one until i've found all parents or i've
reached a predefined limit. I would finally merge back everything to
finalize my job. 
The problem is, i'm not even sure how i can do that. The filter part should
be easy but how am i gonna scan files one by one using spark ? 

I hope someone can guide me through this. 
FYI, there will be gigs of data to process. 


View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message