spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From satheessh chinnu <>
Subject Re: RDD function question
Date Tue, 17 Sep 2013 16:05:51 GMT
Thanks explained. I understand cannot do better than O( N ) in files.

  is there anyway we can achieve  log(n) using spark?  (may be storing as
in memory distributed binary tree (i.e key = date, value = line.  then
execute map function on each node.).

On Mon, Sep 16, 2013 at 7:12 PM, Horia <> wrote:

> Stepping away from any particular framework, it seems to me that you can
> never guarantee that you only read rows in that date range.
> Even with a sorted array, you need to do a log( N ) binary search to find
> each of your boundary dates. Unless you maintain explicit pointers to these
> boundaries, which turns out to be moot because a) your dates are changing
> dynamically so updating them and maintaining the sorted order requires a
> minimum of log( N ) operations anyways, and b) you are dealing with files
> not arrays - files require you to seek a particular line number one line at
> a time, in O( N ).
> So, back to your Spark specific question: you cannot do better than O( N )
> anyways with a file so why worry about anything more sophisticated than a
> 'filter' transformation?
>  On Sep 16, 2013 3:51 PM, "Satheessh" <> wrote:
>> 1. The date is dynamic. (I.e if the date is changed we shouldn't read all
>> records).
>> Look like below solution will read all the records if the date is
>> changed. (Please Correct me if I am wrong)
>> 2. We can assume file is sorted by date.
>> Sent from my iPhone
>> On Sep 16, 2013, at 5:27 PM, Horia <> wrote:
>> Without sorting, you can implement this using the 'filter' transformation.
>> This will eventually read all the rows once, but subsequently only
>> shuffle and send the transformed data which passed the filter.
>> Does this help, or did I misunderstand?
>> On Sep 16, 2013 1:37 PM, "satheessh chinnu" <> wrote:
>>> i am having a text file.  Each line is a record and first ten characters
>>> on each line is a date in YYYY-MM-DD format.
>>> i would like to run a map function on this RDD with specific date range.
>>> (i.e from 2005 -01-01 to 2007-12-31).  I would like to avoid reading the
>>> records out of the specified data range. (i.e kind of primary index sorted
>>> by date)
>>> is there way to implement this?

View raw message