lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <furkankam...@gmail.com>
Subject Re: Document Similarity Algorithm at Solr/Lucene
Date Tue, 23 Jul 2013 14:42:06 GMT
Thanks for your comments.

2013/7/23 Tommaso Teofili <tommaso.teofili@gmail.com>

> if you need a specialized algorithm for detecting blogposts plagiarism /
> quotations (which are different tasks IMHO) I think you have 2 options:
> 1. implement a dedicated one based on your features / metrics / domain
> 2. try to fine tune an existing algorithm that is flexible enough
>
> If I were to do it with Solr I'd probably do something like:
> 1. index "original" blogposts in Solr (possibly using Jack's suggestion
> about ngrams / shingles)
> 2. do MLT queries with "candidate blogposts copies" text
> 3. get the first, say, 2-3 hits
> 4. mark it as quote / plagiarism
> 5. eventually train a classifier to help you mark other texts as quote /
> plagiarism
>
> HTH,
> Tommaso
>
>
>
> 2013/7/23 Furkan KAMACI <furkankamaci@gmail.com>
>
> > Actually I need a specialized algorithm. I want to use that algorithm to
> > detect duplicate blog posts.
> >
> > 2013/7/23 Tommaso Teofili <tommaso.teofili@gmail.com>
> >
> > > Hi,
> > >
> > > I you may leverage and / or improve MLT component [1].
> > >
> > > HTH,
> > > Tommaso
> > >
> > > [1] : http://wiki.apache.org/solr/MoreLikeThis
> > >
> > >
> > > 2013/7/23 Furkan KAMACI <furkankamaci@gmail.com>
> > >
> > > > Hi;
> > > >
> > > > Sometimes a huge part of a document may exist in another document. As
> > > like
> > > > in student plagiarism or quotation of a blog post at another blog
> post.
> > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class
> > to
> > > > detect it?
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message