mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Weitzenfeld <>
Subject Re: Some guidance for this noob - "Metadata Matching Engine"
Date Thu, 10 May 2012 22:11:37 GMT
I agree with Sean.

Once you've constructed an appropriate similarity metric, you can calculate
the similarity for a given document and all current clusters.

Then, the decision rule you apply is essentially classifying the document.
 You also have the option of saying, 'This is not similar enough to any of
my clusters, I'm staring a new one.'  For example, if you are using
hierarchical clustering, you can set a similarity threshold beyond which a
document will not be considered to be a cluster member.

The side effect of thinking of it as a clustering problem is that you have
the option of the clusters adapting as new documents are added.  Think of
K-means, for example: if you add a document to a cluster, it's mean may
change, influencing further classifications.  I say that you have the
'option' of this happening because you could fix the cluster locations
based on your 'training' dataset, so to speak.

The first step is to construct a similarity metric that you think
appropriately captures your human understanding of dis/similarity.

On Thu, May 10, 2012 at 6:02 PM, Sean Owen <> wrote:

> It's closest to a clustering problem. Because your clusters are so
> particular -- the elements are very close to each other, very distinct
> from others -- it reduces to something similar.
> If you had a good similarity metric for docs, you would just match a
> new doc against each other doc and figure out where it's
> nearly-identical to an existing doc. (You could speed it up by keeping
> just one representative doc for each cluster.)
> The question is just one of constructing a similarity metric. Is it
> true that duplicates will match on most fields, and non-duplicates
> will match on virtually none? then there's your metric, and there
> should be some bright-line threshold between close and not-close
> documents.
> Sean
> On Thu, May 10, 2012 at 10:57 PM, mBria <> wrote:
> > Hi everyone,
> >
> > This may be a bit long, and I apologize up front.  I'm new to Mahout (And
> > Machine Learning in general), and haven't actually built anything beyond
> the
> > MiA book's examples with it.
> >
> > I'm looking for a little nudge/guidance on where to direct my next level
> of
> > research/experimentation for a real-world problem.
> >
> > Basically, I need "document matching" support.  Context laundry-list:
> > - "doc" is a somewhat sparse document with a set of 10-15 fields of
> varying
> > length text (usually phrases) & numerical fields.
> > - it's sparse in that not all fields will be valued for all docs
> > - docs are almost always "logical duplicates" of a few other docs (say,
> 2-5
> > on average);  we'll call a set of "dup docs" a "cluster"
> > - there are millions of docs (and thus many thousands of "clusters")
> > - although they are logical duplicates, the field values may be similar,
> but
> > are often not identical (degree of "similarity" will vary non-trivially)
> > - I've got an "example" document set (millions) already clustered
> (manually)
> > in production
> >
> > So, what I want to build is a system that can take NEW documents, and
> give
> > automated insight into which of the existing cluster this document
> belongs,
> > or an indication that it belongs to none.
> >
> > Initially, I saw this most as a "*CLASSIFICATION *problem":
> > - I've got a immense /training set/ already
> > - I want to "classify" new stuff based on smart /field-level similarity/
> > evaluation
> > - I want to pick one "class" (ie, cluster) the doc belongs to
> >
> > The problem with this (maybe?) is that I'm gathering that classification
> > really works best for BINARY classes ("you go here, or you go there").
>  My
> > case is that there are thousands of classes (clusters), and it may even
> be
> > that the given doc doesn't really fit any of them well (in which it
> should
> > become a new cluster of one).  To a lesser degree, I'd like to know I
> could
> > if I wanted get the system to tell a a small set of clusters the new doc
> may
> > fit well with with a "score".
> >
> > Looking at this then from a "*CLUSTERING *problem" angle:
> > - yes, I want docs "clustered" based on similarity of its field values
> > - but, I've already got the existing millions of docs already clustered,
> and
> > I just want to funnel news docs into the clusters
> >
> > So, while "clustering docs" is definitely the end result of the system, I
> > don't really think this is an obvious "clustering problem" from the
> > ML/Mahout POV.  Least not a standard one.
> >
> > Looking at this from a "*RECOMMENDATION* problem" angle:
> > - I can kinda think of the existing clusters as being clusters as
> containing
> > docs "related" to the other docs in the cluster
> > - Then I could say this new doc is like another existing doc, which
> > "associates" to these other docs (in the cluster) therefore this new one
> > associates to those other ones (and belongs in the cluster)
> >
> > But, beyond this being a real stretch and probably silly (useless), the
> big
> > missing aspect is the ability to leverage doc field similarity.  It's
> > advanced field value similarity which really drives the "match".  So, I
> > don't think Recommenders help much here.
> >
> > My gut is telling me I want some hybrid of clustering and classification,
> > but I'm not sure.
> >
> > So, my head is still running full-speed trying see this in various ways
> to
> > see what I can use from Mahout to contribute to my system, but before I
> got
> > too far down my own rabbit holes I wanted to Ask The Expert.
> >
> > Again, sorry for the novel!
> >
> > Any ideas, references to things to look at, anything at all that you
> think
> > might be helpful would be great.  Not looking for anyone to "hand me the
> > solution", but polling for guidance.
> >
> > Thanks much!
> > Mike
> >
> > --
> > View this message in context:
> > Sent from the Mahout User List mailing list archive at

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message