spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Somnath Pandeya <Somnath_Pand...@infosys.com>
Subject how to find near duplicate items from given dataset using spark
Date Thu, 02 Apr 2015 08:18:23 GMT
Hi All,

I want to find near duplicate items from given dataset
For e.g consider a data set

1.       Cricket,bat,ball,stumps

2.       Cricket,bowler,ball,stumps,

3.       Football,goalie,midfielder,goal

4.       Football,refree,midfielder,goal,
Here 1 and 2 are near duplicates (only field 2 is different ) and 3 and 4 are near duplicates(only
2 field is different)

This is what I did
Created an Article class and implemented equls and hashcode method (my hash code method returns
constant (1) for all objecst).
And in spark I am using article as a key doing group by on the article.
Is this approach correct, or is there any better approach.

This is how my code looks like.

Article Class
public class Article implements Serializable {

private static final long serialVersionUID = 1L;
       private String first;
       private String second;
       private String third;
       private String fourth;

       public Article() {
              set("", "", "", "");
       }

       public Article(String first, String second, String third, String fourth) {
              // super();
              set(first, second, third, fourth);
       }

       @Override
       public int hashCode() {
              int result = 1;
              return result;
       }

       @Override
       public boolean equals(Object obj) {
              if (this == obj)
                     return true;
              if (obj == null)
                     return false;
              if (getClass() != obj.getClass())
                     return false;
              Article other = (Article) obj;
              if ((first.equals(other.first) || second.equals(other.second)
                           || third.equals(other.third) || fourth.equals(other.fourth))) {
                     return true;
              } else {
                     return false;
              }
       }

       private void set(String first, String second, String third, String fourth) {
              this.first = first;
              this.second = second;
              this.third = third;
              this.fourth = fourth;
       }


            Spark Code
       public static void main(String[] args) throws Exception {

              SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount")
                           .setMaster("local");
              JavaSparkContext ctx = new JavaSparkContext(sparkConf);
              JavaRDD<String> lines = ctx.textFile("data1/*");

              JavaRDD<Article> articles = lines.map(new Function<String, Article>()
{

                     /**
              *
              */
                     private static final long serialVersionUID = 1L;

                     public Article call(String line) throws Exception {
                           String[] words = line.split(",");
                           // System.out.println(line);

                           Article article = new Article(words[0], words[1], words[2],
                                         words[3]);

                           return article;
                     }
              });


              JavaPairRDD<Article, String> articlePair = lines
                           .mapToPair(new PairFunction<String, Article, String>() {

                                  public Tuple2<Article, String> call(String line)
                                                throws Exception {

                                         String[] words = line.split(",");
                                         // System.out.println(line);

                                         Article article = new Article(words[0], words[1],
                                                       words[2], words[3]);
                                         return new Tuple2<Article, String>(article,
line);
                                  }
                           });

              JavaPairRDD<Article, Iterable<String>> articlePairs = articlePair
                           .groupByKey();


              Map<Article, Iterable<String>> dupArticles = articlePairs
                           .collectAsMap();

              System.out.println("size {} " + dupArticles.size());

              Set<Article> uniqueArticle = dupArticles.keySet();

              for (Article article : uniqueArticle) {
                     Iterable<String> temps = dupArticles.get(article);
                     System.out.println("keys " + article);
                     for (String string : temps) {
                           System.out.println(string);
                     }
                     System.out.println("==============");
              }
              ctx.close();
              ctx.stop();
       }
}


**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are not 
to copy, disclose, or distribute this e-mail or its contents to any other person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken 
every reasonable precaution to minimize this risk, but is not liable for any damage 
you may sustain as a result of any virus in this e-mail. You should carry out your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Mime
View raw message