spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: Accessing the reduce key
Date Thu, 20 Mar 2014 21:52:26 GMT
You are using the data grouped (sorted?) To create the bloom filter ?
On Mar 20, 2014 4:35 PM, "Surendranauth Hiraman" <suren.hiraman@velos.io>
wrote:

> Mayur,
>
> To be a little clearer, for creating the Bloom Filters, I don't think
> broadcast variables are the way to go, though definitely that would work
> for using the Bloom Filters to filter data.
>
> The reason why is that the creation needs to happen in a single thread.
> Otherwise, some type of locking/distributed locking is needed on the
> individual Bloom Filter itself, with performance impact.
>
> Agreed?
>
> -Suren
>
>
>
>
> On Thu, Mar 20, 2014 at 3:40 PM, Surendranauth Hiraman <
> suren.hiraman@velos.io> wrote:
>
>> Mayur,
>>
>> Thanks. This step is for creating the Bloom Filter, not using it to
>> filter data, actually. But your answer still stands.
>>
>> Partitioning by key, having the bloom filters as a broadcast variable and
>> then doing mappartition makes sense.
>>
>> Are there performance implications for this approach, such as with using
>> the broadcast variable, versus the approach we used, in which the Bloom
>> Filter (again, for creating it) is only referenced by the single map
>> application?
>>
>> -Suren
>>
>>
>>
>>
>>
>> On Thu, Mar 20, 2014 at 3:20 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>>
>>> Why are you trying to reducebyKey? Are you looking to work on the data
>>> sequentially.
>>> If I understand correctly you are looking to filter your data using the
>>> bloom filter & each bloom filter is tied to which key is instantiating it.
>>> Following are some of the options
>>> *partiition* your data by key & use mappartition operator to run
>>> function on partition independently. The same function will be applied to
>>> each partition.
>>> If your bloomfilter is large then you can bundle all of them in as a
>>> broadcast variable & use it to apply the transformation on your data using
>>> a simple map operation, basically you are looking up the right bloom filter
>>> on each key & applying the filter on it, again here if unserializing bloom
>>> filter is time consuming then you can partition the data on key & then use
>>> the broadcast variable to look up the bloom filter for each key & apply
>>> filter on all data in serial.
>>>
>>> Regards
>>> Mayur
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Thu, Mar 20, 2014 at 1:55 PM, Surendranauth Hiraman <
>>> suren.hiraman@velos.io> wrote:
>>>
>>>> We ended up going with:
>>>>
>>>> map() - set the group_id as the key in a Tuple
>>>> reduceByKey() - end up with (K,Seq[V])
>>>> map() - create the bloom filter and loop through the Seq and persist
>>>> the Bloom filter
>>>>
>>>> This seems to be fine.
>>>>
>>>> I guess Spark cannot optimize the reduceByKey and map steps to occur
>>>> together since the fact that we are looping through the Seq is out of
>>>> Spark's control.
>>>>
>>>> -Suren
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Mar 20, 2014 at 9:48 AM, Surendranauth Hiraman <
>>>> suren.hiraman@velos.io> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> My team is trying to replicate an existing Map/Reduce process in Spark.
>>>>>
>>>>> Basically, we are creating Bloom Filters for quick set membership
>>>>> tests within our processing pipeline.
>>>>>
>>>>> We have a single column (call it group_id) that we use to partition
>>>>> into sets.
>>>>>
>>>>> As you would expect, in the map phase, we emit the group_id as the key
>>>>> and in the reduce phase, we instantiate the Bloom Filter for a given
key in
>>>>> the setup() method and persist that Bloom Filter in the cleanup() method.
>>>>>
>>>>> In Spark, we can do something similar with map() and reduceByKey() but
>>>>> we have the following questions.
>>>>>
>>>>>
>>>>> 1. Accessing the reduce key
>>>>> In reduceByKey(), how do we get access to the specific key within the
>>>>> reduce function?
>>>>>
>>>>>
>>>>> 2. Equivalent of setup/cleanup
>>>>> Where should we instantiate and persist each Bloom Filter by key? In
>>>>> the driver and then pass in the references to the reduce function? But
if
>>>>> so, how does the reduce function know which set's Bloom Filter it should
be
>>>>> writing to (question 1 above)?
>>>>>
>>>>> It seems if we use groupByKey and then reduceByKey, that gives us
>>>>> access to all of the values at one go. I assume there, Spark will manage
if
>>>>> those values all don't fit in memory in one go.
>>>>>
>>>>>
>>>>>
>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>> Velos
>>>>> Accelerating Machine Learning
>>>>>
>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>> NEW YORK, NY 10001
>>>>> O: (917) 525-2466 ext. 105
>>>>> F: 646.349.4063
>>>>> E: suren.hiraman@v <suren.hiraman@sociocast.com>elos.io
>>>>> W: www.velos.io
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>> Velos
>>>> Accelerating Machine Learning
>>>>
>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>> NEW YORK, NY 10001
>>>> O: (917) 525-2466 ext. 105
>>>> F: 646.349.4063
>>>> E: suren.hiraman@v <suren.hiraman@sociocast.com>elos.io
>>>> W: www.velos.io
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <suren.hiraman@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <suren.hiraman@sociocast.com>elos.io
> W: www.velos.io
>
>

Mime
View raw message