mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Some guidance for this noob - "Metadata Matching Engine"
Date Thu, 10 May 2012 22:02:16 GMT
It's closest to a clustering problem. Because your clusters are so
particular -- the elements are very close to each other, very distinct
from others -- it reduces to something similar.

If you had a good similarity metric for docs, you would just match a
new doc against each other doc and figure out where it's
nearly-identical to an existing doc. (You could speed it up by keeping
just one representative doc for each cluster.)

The question is just one of constructing a similarity metric. Is it
true that duplicates will match on most fields, and non-duplicates
will match on virtually none? then there's your metric, and there
should be some bright-line threshold between close and not-close


On Thu, May 10, 2012 at 10:57 PM, mBria <> wrote:
> Hi everyone,
> This may be a bit long, and I apologize up front.  I'm new to Mahout (And
> Machine Learning in general), and haven't actually built anything beyond the
> MiA book's examples with it.
> I'm looking for a little nudge/guidance on where to direct my next level of
> research/experimentation for a real-world problem.
> Basically, I need "document matching" support.  Context laundry-list:
> - "doc" is a somewhat sparse document with a set of 10-15 fields of varying
> length text (usually phrases) & numerical fields.
> - it's sparse in that not all fields will be valued for all docs
> - docs are almost always "logical duplicates" of a few other docs (say, 2-5
> on average);  we'll call a set of "dup docs" a "cluster"
> - there are millions of docs (and thus many thousands of "clusters")
> - although they are logical duplicates, the field values may be similar, but
> are often not identical (degree of "similarity" will vary non-trivially)
> - I've got an "example" document set (millions) already clustered (manually)
> in production
> So, what I want to build is a system that can take NEW documents, and give
> automated insight into which of the existing cluster this document belongs,
> or an indication that it belongs to none.
> Initially, I saw this most as a "*CLASSIFICATION *problem":
> - I've got a immense /training set/ already
> - I want to "classify" new stuff based on smart /field-level similarity/
> evaluation
> - I want to pick one "class" (ie, cluster) the doc belongs to
> The problem with this (maybe?) is that I'm gathering that classification
> really works best for BINARY classes ("you go here, or you go there").  My
> case is that there are thousands of classes (clusters), and it may even be
> that the given doc doesn't really fit any of them well (in which it should
> become a new cluster of one).  To a lesser degree, I'd like to know I could
> if I wanted get the system to tell a a small set of clusters the new doc may
> fit well with with a "score".
> Looking at this then from a "*CLUSTERING *problem" angle:
> - yes, I want docs "clustered" based on similarity of its field values
> - but, I've already got the existing millions of docs already clustered, and
> I just want to funnel news docs into the clusters
> So, while "clustering docs" is definitely the end result of the system, I
> don't really think this is an obvious "clustering problem" from the
> ML/Mahout POV.  Least not a standard one.
> Looking at this from a "*RECOMMENDATION* problem" angle:
> - I can kinda think of the existing clusters as being clusters as containing
> docs "related" to the other docs in the cluster
> - Then I could say this new doc is like another existing doc, which
> "associates" to these other docs (in the cluster) therefore this new one
> associates to those other ones (and belongs in the cluster)
> But, beyond this being a real stretch and probably silly (useless), the big
> missing aspect is the ability to leverage doc field similarity.  It's
> advanced field value similarity which really drives the "match".  So, I
> don't think Recommenders help much here.
> My gut is telling me I want some hybrid of clustering and classification,
> but I'm not sure.
> So, my head is still running full-speed trying see this in various ways to
> see what I can use from Mahout to contribute to my system, but before I got
> too far down my own rabbit holes I wanted to Ask The Expert.
> Again, sorry for the novel!
> Any ideas, references to things to look at, anything at all that you think
> might be helpful would be great.  Not looking for anyone to "hand me the
> solution", but polling for guidance.
> Thanks much!
> Mike
> --
> View this message in context:
> Sent from the Mahout User List mailing list archive at

View raw message