mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Wed, 09 Jun 2010 17:31:04 GMT
On Wed, Jun 9, 2010 at 5:11 AM, Kris Jack <mrkrisjack@gmail.com> wrote:

> Hi,
>
> Thanks very for the code.  In implementing it, I got a little stuck on
> specifying the JobConf's similarity job -
>
>  JobConf conf = new JobConf("similarity job");
>
> I assume that I should define here how I would like two vectors to be
> compared with one another?  Please do correct me if that's wrong.  If so,
> however, could you point me to any examples of what this code should look
> like (e.g. cosine similarity)?  I'm sure that these kinds of similarity
> jobs
> must already exist in Mahout but being new to both Mahout and MapReduce,
> I'm
> not sure where to find them.
>

In the sample I mentioned (using sparse matrix multiplication), you don't
get to chose the similarity - if the input vectors are unit-length
normalized,
then the computation is cosine similarity.  You would have to write your own
map-reduce job to do a different one.

  -jake




>
> Thanks,
> Kris
>
>
>
> 2010/6/8 Jake Mannix <jake.mannix@gmail.com>
>
> > Hi Kris,
> >
> >  So you already know how to make a sparse feature matrix out of
> > your Solr index, based on Grant's instructions?  Once you have that
> > matrix loaded into HDFS, then the following Java code should
> > create a document-document similarity matrix:
> >
> > //
> >  String p = "/path/to/matrix/on/hdfs";
> >  String tmpPath = "/tmp/matrixmultiplyspace";
> >  int numDocuments = // whatever your numDocuments is
> >  int numTerms = // total number of terms in the matrix
> >
> >  DistributedRowMatrix text = new DistributedRowMatrix(inputPath,
> >    tmpPath, numDocuments, numTerms);
> >  JobConf conf = new JobConf("similarity job");
> >  text.configure(conf);
> >
> >  DistributedRowMatrix transpose = text.transpose();
> >
> >  DistributedRowMatrix similarity = transpose.times(transpose);
> >  System.out.println("Similarity matrix lives: " +
> similarity.getRowPath());
> > //
> >
> > Now, the rows of this similarity are going to be way too dense, so
> > you'll want to write a small map-reduce job (well, no reduce is
> necessary)
> > to run over this matrix and trim out all the unuseful entries of each
> > row, but that shouldn't be too hard to do.
> >
> > Of course, to do it really efficiently, that functionality could be
> folded
> > into the reducer of the matrix multiply job, and done in the same pass
> over
> > the data as that one.
> >
> >  -jake
> >
> >
> >
> > On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack <mrkrisjack@gmail.com> wrote:
> >
> > > Hi Jake,
> > >
> > > Thanks for that.  The first solution that you suggest is more like what
> I
> > > was imagining.
> > >
> > > Please excuse me, I'm new to Mahout and don't know how to use it to
> > > generate
> > > the full document-document similarity matrix.  I would rather not have
> to
> > > re-implement the moreLikeThis algorithm that, although rather straight
> > > forward, may take time for a newbie to MapReduce like me.  Could you
> > guide
> > > me a little in finding the relevant Mahout code for generating the
> matrix
> > > or
> > > is it not really designed for that?
> > >
> > > For the moment, I would be happy to have an off-line batch version
> > working.
> > > Also, it is desirable to take advantage of the text processing features
> > > that
> > > I have already configured using solr, so I would prefer to read in the
> > > feature vectors for the documents from a lucene index, as I am doing at
> > > present (e.g.
> > >
> > >
> >
> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
> > > ).
> > >
> > > Thanks,
> > > Kris
> > >
> > >
> > >
> > > 2010/6/8 Jake Mannix <jake.mannix@gmail.com>
> > >
> > > > Hi Kris,
> > > >
> > > >  If you generate a full document-document similarity matrix offline,
> > and
> > > > then make sure to sparsify the rows (trim off all similarities below
> a
> > > > threshold, or only take the top N for each row, etc...).  Then
> encoding
> > > > these values directly in the index would indeed allow for *superfast*
> > > > MoreLikeThis functionality, because you've already computed all
> > > > of the similar results offline.
> > > >
> > > >  The only downside is that it won't apply to newly indexed documents.
> > > > If your indexing setup is such that you don't fold in new documents
> > live,
> > > > but do so in batch, then this should be fine.
> > > >
> > > >  An alternative is to use something like a Locality Sensitive Hash
> > > > (something one of my co-workers is writing up a nice implementation
> > > > of now, and I'm going to get him to contribute it once it's fully
> > > tested),
> > > > to reduce the search space (as a lucene Filter) and speed up the
> > > > query.
> > > >
> > > >  -jake
> > > >
> > > > On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <mrkrisjack@gmail.com>
> > wrote:
> > > >
> > > > > Hi Olivier,
> > > > >
> > > > > Thanks for your suggestions.  I have over 10 million documents and
> > they
> > > > > have
> > > > > quite a lot of meta-data associated with them including rather
> large
> > > text
> > > > > fields.  It is possible to tweak the moreLikeThis function from
> solr.
> > >  I
> > > > > have tried changing the parameters (
> > > > > http://wiki.apache.org/solr/MoreLikeThis)
> > > > > but am not managing to get results in under 300ms without
> sacrificing
> > > the
> > > > > quality of the results too much.
> > > > >
> > > > > I suspect that there would be gains to be made from reducing the
> > > > > dimensionality of the feature vectors before indexing with lucene
> so
> > I
> > > > may
> > > > > give that a try.  I'll keep you posted if I come up with other
> > > solutions.
> > > > >
> > > > > Thanks,
> > > > > Kris
> > > > >
> > > > >
> > > > >
> > > > > 2010/6/8 Olivier Grisel <olivier.grisel@ensta.org>
> > > > >
> > > > > > 2010/6/8 Kris Jack <mrkrisjack@gmail.com>:
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > I currently use lucene's moreLikeThis function through
solr to
> > find
> > > > > > > documents that are related to one another.  A single call,
> > however,
> > > > > takes
> > > > > > > around 4 seconds to complete and I would like to reduce
this.
>  I
> > > got
> > > > to
> > > > > > > thinking that I might be able to use Mahout to generate
a
> > document
> > > > > > > similarity matrix offline that could then be looked-up
in real
> > time
> > > > for
> > > > > > > serving.  Is this a reasonable use of Mahout?  If so, what
> > > functions
> > > > > will
> > > > > > > generate a document similarity matrix?  Also, I would like
to
> be
> > > able
> > > > > to
> > > > > > > keep the text processing advantages provided through lucene
so
> it
> > > > would
> > > > > > help
> > > > > > > if I could still use my lucene index.  If not, then could
you
> > > > recommend
> > > > > > any
> > > > > > > alternative solutions please?
> > > > > >
> > > > > > How many documents do you have in your index? Have you tried
to
> > tweak
> > > > > > the MoreLikeThis parameters ? (I don't know if it's possible
> using
> > > the
> > > > > > solr interface, I use it directly using the lucene java API)
> > > > > >
> > > > > > For instance you can trade off recall for speed by decreasing
the
> > > > > > number of terms to use in the query and trade recall for
> precision
> > > and
> > > > > > speed by increasing the percentage of terms that should match.
> > > > > >
> > > > > > You could also use Mahout implementation of SVD to build low
> > > > > > dimensional semantic vectors representing your documents (a.k.a.
> > > > > > Latent Semantic Indexing) and then index those transformed
> > frequency
> > > > > > vectors in a dedicated lucene index (or document field provided
> you
> > > > > > name the resulting terms with something that does not match
real
> > life
> > > > > > terms present in other). However using standard SVD will probably
> > > > > > result in dense (as opposed to sparse) low dimensional semantic
> > > > > > vectors. I don't think lucene's lookup performance is good with
> > dense
> > > > > > frequency vectors even though the number of terms is greatly
> > reduced
> > > > > > by SVD. Hence it would probably be better to either threshold
the
> > top
> > > > > > 100 absolute values of each semantic vectors before indexing
> > > (probably
> > > > > > the simpler solution) or using a sparsifying penalty contrained
> > > > > > variant of SVD / LSI. You should have a look at the literature
on
> > > > > > sparse coding or sparse dictionary learning, Sparse-PCA and
more
> > > > > > generally L1 penalty regression methods such as the Lasso and
> LARS.
> > I
> > > > > > don't know about any library  for sparse semantic coding of
> > document
> > > > > > that works automatically with lucene. Probably some non trivial
> > > coding
> > > > > > is needed there.
> > > > > >
> > > > > > Another alternative is finding low dimensional (64 or 32
> > components)
> > > > > > dense codes and then binary thresholding then and store integer
> > code
> > > > > > in the DB or the lucene index and then build smart exact match
> > > queries
> > > > > > to find all document lying in the hamming ball of size 1 or
2 of
> > the
> > > > > > reference document's binary code. But I think this approach
while
> > > > > > promising for web scale document collections is even more
> > > experimental
> > > > > > and requires very good code low dim encoders (I don't think
> linear
> > > > > > models such as SVD are good enough for reducing sparse 10e6
> > > components
> > > > > > vectors to dense 64 components vectors, non linear encoders
such
> as
> > > > > > Stacked Restricted Boltzmann Machines are probably a better
> > choice).
> > > > > >
> > > > > > In any case let us know about your results, I am really
> interested
> > on
> > > > > > practical yet scalable solutions to this problem.
> > > > > >
> > > > > > --
> > > > > > Olivier
> > > > > > http://twitter.com/ogrisel - http://github.com/ogrisel
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Dr Kris Jack,
> > > > > http://www.mendeley.com/profiles/kris-jack/
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Dr Kris Jack,
> > > http://www.mendeley.com/profiles/kris-jack/
> > >
> >
>
>
>
> --
> Dr Kris Jack,
> http://www.mendeley.com/profiles/kris-jack/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message