spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Horia <ho...@alum.berkeley.edu>
Subject Re: RDD function question
Date Mon, 16 Sep 2013 23:12:57 GMT
Stepping away from any particular framework, it seems to me that you can
never guarantee that you only read rows in that date range.

Even with a sorted array, you need to do a log( N ) binary search to find
each of your boundary dates. Unless you maintain explicit pointers to these
boundaries, which turns out to be moot because a) your dates are changing
dynamically so updating them and maintaining the sorted order requires a
minimum of log( N ) operations anyways, and b) you are dealing with files
not arrays - files require you to seek a particular line number one line at
a time, in O( N ).

So, back to your Spark specific question: you cannot do better than O( N )
anyways with a file so why worry about anything more sophisticated than a
'filter' transformation?
 On Sep 16, 2013 3:51 PM, "Satheessh" <satheessh1@gmail.com> wrote:

> 1. The date is dynamic. (I.e if the date is changed we shouldn't read all
> records).
> Look like below solution will read all the records if the date is changed.
> (Please Correct me if I am wrong)
>
> 2. We can assume file is sorted by date.
>
> Sent from my iPhone
>
> On Sep 16, 2013, at 5:27 PM, Horia <horia@alum.berkeley.edu> wrote:
>
> Without sorting, you can implement this using the 'filter' transformation.
>
> This will eventually read all the rows once, but subsequently only shuffle
> and send the transformed data which passed the filter.
>
> Does this help, or did I misunderstand?
> On Sep 16, 2013 1:37 PM, "satheessh chinnu" <satheessh1@gmail.com> wrote:
>
>> i am having a text file.  Each line is a record and first ten characters
>> on each line is a date in YYYY-MM-DD format.
>>
>> i would like to run a map function on this RDD with specific date range.
>> (i.e from 2005 -01-01 to 2007-12-31).  I would like to avoid reading the
>> records out of the specified data range. (i.e kind of primary index sorted
>> by date)
>>
>> is there way to implement this?
>>
>>
>>

Mime
View raw message