spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <>
Subject Re: Cartesian join on RDDs taking too much time
Date Wed, 25 May 2016 17:13:52 GMT
You can look at ways to group records from both rdds together instead of
doing Cartesian.  Say generate pair rdd from each with first letter as key.
Then do a partition and a join.
On May 25, 2016 8:04 PM, "Priya Ch" <> wrote:

> Hi,
>   RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would
> like to filter out the strings that have greater than 85% match and
> generate a score for it which is used in the susbsequent calculations.
> I tried generating pair rdd from both rdds A and B with same key for all
> the records. Now performing A.join(B) is also resulting in huge execution
> time..
> How do I go about with map and reduce here ? To generate pairs from 2 rdds
> I dont think map can be used because we cannot have rdd inside another rdd.
> Would be glad if you can throw me some light on this.
> Thanks,
> Padma Ch
> On Wed, May 25, 2016 at 7:39 PM, Jörn Franke <> wrote:
>> Solr or Elastic search provide much more functionality and are faster in
>> this context. The decision for or against them depends on your current and
>> future use cases. Your current use case is still very abstract so in order
>> to get a more proper recommendation you need to provide more details
>> including size of dataset, what you do with the result of the matching do
>> you just need the match number or also the pairs in the results etc.
>> Your concrete problem can also be solved in Spark (though it is not the
>> best and most efficient tool for this, but it has other strength) using the
>> map reduce steps. There are different ways to implement this (Generate
>> pairs from the input datasets in the map step or (maybe less recommendable)
>> broadcast the smaller dataset to all nodes and do the matching with the
>> bigger dataset there.
>> This highly depends on the data in your data set. How they compare in
>> size etc.
>> On 25 May 2016, at 13:27, Priya Ch <> wrote:
>> Why do i need to deploy solr for text anaytics...i have files placed in
>> HDFS. just need to look for matches against each string in both files and
>> generate those records whose match is > 85%. We trying to Fuzzy match
>> logic.
>> How can use map/reduce operations across 2 rdds ?
>> Thanks,
>> Padma Ch
>> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <>
>> wrote:
>>> Alternatively depending on the exact use case you may employ solr on
>>> Hadoop for text analytics
>>> > On 25 May 2016, at 12:57, Priya Ch <>
>>> wrote:
>>> >
>>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD
>>> B of
>>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i
>>> need
>>> > to check the matches found in rdd B as such for string "hi" i have to
>>> check
>>> > the matches against all strings in RDD B which means I need generate
>>> every
>>> > possible combination r

View raw message