mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: mahout PLSI (with some lucene, thrown in)
Date Tue, 23 Jun 2009 10:39:31 GMT

On Jun 21, 2009, at 9:39 PM, Paul Jones wrote:

> I think I am starting to get a feel for what each of these  
> frameworks can achieve, however due to overlap in some of these  
> applications, I am curious about how each one exposes data to the  
> other, again trawled through the lists, best I can, and read the  
> Lucene in action book over the weekend.
> To me Nutch should be used as a crawler, rather than a indexer (but  
> I have read that Nutch is better than indexing than lucene, and  
> hence lucene should be used just for search).

Nutch uses Lucene for indexing.  The two aren't really comparable.   
Lucene is a search library.  Nutch is an application designed for  
large scale crawling and search.  Nutch tends to be pretty  
monolithic.  You might be happier with Solr, as it is more flexible  
and easier to configure, but still gives you access to Lucene.

> Mahout seems to come into its element when you are playing with  
> various algorithms, whether for clustering, nearest neighbour or  
> whatever, but lucene also seems to work with term-vectors (as does  
> Nutch), to work out the "distance" between words, if so, once this  
> is done, are the words then already ranked. If so, then would you  
> run other algos like PLSI on that data, or (at least to me) it makes  
> more sense to take the data from Nutch, use Mahout, and then puch  
> back in Lucene to search with.

One aspect of Mahout (or machine learning) that I find intriguing is  
using it to power "intelligent" search.  In this case, you use ML to  
extract/categorize/cluster, etc. all in an effort to make it easier  
for people to search/discover the information they are looking for.

There are, of course, many other uses that have nothing to do with  
search and there is nothing about Mahout other than the LuceneIterable  
class in utils and a few helper classes to make working  with text  
easier.  It is perfectly reasonable to use Mahout on numerical data or  
even mixed data as long as you can properly setup the problem.

> Another question on indexing:
> The vector calc's or term-freq are building the relationship between  
> words in a document/web page. e.g "red" is related to "crimson", but  
> how does this relate back to ranking the documents themselves in a  
> search query, so you search for "red" now it is related to "crimson"  
> but if doc1 has "red" in it it should be returned at pos1, and the  
> one with "crimson" at pos2. I am going to try to answer my own  
> question let me know if the answer sucks...
> Each word is related to a document word -> doc
> So has relationships between words are formed, then inherently  
> relationship between the docs are also deduced from here. Is this  
> kind of correct? so you need not worry about ranking the document  
> itself? Or are there two indexes, one which contains the  
> relationships between the words with a doc, and the other which  
> relates each word to each doc, if this is true can you run different  
> algos on each problem to get the end results.
> e.g red relates to crimson with value 1, and red relates to blue  
> with value 0.5 so we have relationships between words
> Now red related to doc 1 as +1, and relates to doc 2 +0.5, and  
> crimson relates to doc2 as +1, hence we have relationship between  
> words and the docs
> phew....

I'm not sure I'm following.  In traditional TF-IDF search (i.e.  
Lucene) red and crimson would both relate to one or more docs.   
Whether red or crimson comes first is going to depend on the  
statistics of the collection.  Presumably, if you have same a priori  
information about those words (maybe based on your analysis of the  
documents) you could boost one word even more such that red comes first.

> So two more questions :-), I looked at intergrating user feedback,  
> if we assume we have obtained the feedback, and a person thinks doc1  
> is actually about "crimson" how would this be intergrated back into  
> the algos, would this be via the boost function in Lucene, or is  
> there a better way of doing it using Taste and dropping it into the  
> Mahout anaylsis.

I'd say you could likely do it with either.

> and ... how do you rate the "words" from a Title, Meta Tag, Image  
> Alt text higher than other words in the webpage, or even say user  
> defined Tags in say blogs

In Lucene, you can do this several ways.  If you want to boost the  
whole field, then do just that.  If you want to boost individual terms  
in a given field, you need to use Payloads and the BoostingTermQuery.

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

View raw message