spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nirav Patel <npa...@xactlycorp.com>
Subject Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD
Date Sat, 02 Apr 2016 09:58:00 GMT
@IIya Ganellin, not sure how zipWithIndex() will do less then O(n) scan.
Spark doc doesnt mention anything about it.

I found solution with spark 1.5.2 OrderedRDDFunctions. It has filterByRange
api.

Thanks

On Sun, Jan 24, 2016 at 10:27 PM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:

> One thing you can also look at is to save your data in a way that can be
> accessed through file patterns. Eg by hour, zone etc so that you only load
> what you need.
> On Jan 24, 2016 10:00 PM, "Ilya Ganelin" <ilganeli@gmail.com> wrote:
>
>> The solution I normally use is to zipWithIndex() and then use the filter
>> operation. Filter is an O(m) operation where m is the size of your
>> partition, not an O(N) operation.
>>
>> -Ilya Ganelin
>>
>> On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel <npatel@xactlycorp.com>
>> wrote:
>>
>>> Problem is I have RDD of about 10M rows and it keeps growing. Everytime
>>> when we want to perform query and compute on subset of data we have to use
>>> filter and then some aggregation. Here I know filter goes through each
>>> partitions and every rows of RDD which may not be efficient at all.
>>>
>>> Spark having Ordered RDD functions I dont see why it's so difficult to
>>> implement such function. Cassandra/Hbase has it for years where they can
>>> fetch data only from certain partitions based on your rowkey. Scala TreeMap
>>> has Range function to do the same.
>>>
>>> I think people have been looking for this for while. I see several post
>>> asking this.
>>>
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-td20170.html#a26048
>>>
>>> By the way, I assume there
>>> Thanks
>>> Nirav
>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>
>>
>>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Mime
View raw message