lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Domènec Sos i Vallès <>
Subject MoreLikeThis for finding PDF/Word/etc documents with similar or copied sections.
Date Tue, 06 Sep 2011 08:40:08 GMT
Hello, our current goal is finding a solution for a translations company. Their issue is that
very often they have to translate documents which have parts that have been copy & pasted
from another document that was translated before, so they do the same work more than once.

I am a newcomer to Lucene/Solr, so I took the Solr tutorial (kudos to whoever contributed
it, very good) and did some reading of Lucene in action and the Solr 3.1 cookbook.

My understanding of the solution is as follows, and I'd appreciate criticisms of whatever
is wrong or missing, or alternate solutions, thanks in advance. At the end, I ask about potential
issues I see right now.

- Add to schema.xml a field for the contents of the document, this must be stored and use
termVectors (thanks, oh thy cookbook).

- Import the documents with Solr Cell and take care of routing the document contents (which
may come in different fields depending on the import tool used by Tika) to the stored and
termVector field.

- Store the document with a unique id (this is mandatory as the document is associated to
an id in the main system).

- Do searchs on the unique id with the "more like this" commands in the URL.

My concerns about possible issues are:

- Performance: Will this work with thousands of documents containing from one page up to hundreds
of pages?

- Correctness: If 10 out of 50 pages are copy & paste, shall we get at least 20% similarity?
Will this be higher than documents that may have the same words but in different positions?

Please note, in the subject I said "similar or copied sections" instead of some other name
as, say, chapters which may require understanding of document structure. Sources of documents
are very diverse and there is no easy way to find out any sort of structure.

Thanks for reading, thanks for replying.

Domènec Sos i Vallès

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message