spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Cartesian join on RDDs taking too much time
Date Wed, 25 May 2016 14:09:22 GMT
Solr or Elastic search provide much more functionality and are faster in this context. The
decision for or against them depends on your current and future use cases. Your current use
case is still very abstract so in order to get a more proper recommendation you need to provide
more details including size of dataset, what you do with the result of the matching do you
just need the match number or also the pairs in the results etc.

Your concrete problem can also be solved in Spark (though it is not the best and most efficient
tool for this, but it has other strength) using the map reduce steps. There are different
ways to implement this (Generate pairs from the input datasets in the map step or (maybe less
recommendable) broadcast the smaller dataset to all nodes and do the matching with the bigger
dataset there.
This highly depends on the data in your data set. How they compare in size etc.



> On 25 May 2016, at 13:27, Priya Ch <learnings.chitturi@gmail.com> wrote:
> 
> Why do i need to deploy solr for text anaytics...i have files placed in HDFS. just need
to look for matches against each string in both files and generate those records whose match
is > 85%. We trying to Fuzzy match logic. 
> 
> How can use map/reduce operations across 2 rdds ?
> 
> Thanks,
> Padma Ch
> 
>> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jornfranke@gmail.com> wrote:
>> 
>> Alternatively depending on the exact use case you may employ solr on Hadoop for text
analytics
>> 
>> > On 25 May 2016, at 12:57, Priya Ch <learnings.chitturi@gmail.com> wrote:
>> >
>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
>> > to check the matches found in rdd B as such for string "hi" i have to check
>> > the matches against all strings in RDD B which means I need generate every
>> > possible combination r
> 

Mime
View raw message