mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: mahout PLSI (with some lucene, thrown in)
Date Thu, 25 Jun 2009 17:27:13 GMT

Hi Paul,

One clarification.  Nutch doesn't really index nor does it search.  It simply uses Lucene
and makes Lucene do both.  Nutch has its own classes and other tools that under Lucene for
indexing/searching under the hood.  Think of Nutch as a biggish search engine application
that has everything that you need for web crawling, document/web page parsing[1], and such,
but doesn't do the indexing/searching itself, at least not directly.

[1] Even parsing is not done by Nutch itself - Nutch uses other libraries to do the actual
Sematext -- -- Lucene - Solr - Nutch

----- Original Message ----
> From: Paul Jones <>
> To:
> Sent: Tuesday, June 23, 2009 9:45:40 AM
> Subject: Re: mahout PLSI (with some lucene, thrown in)
> tks Grant, more questions...I think it better if I explain what I am trying to 
> do.
> 1. I want to crawl blogs which talk about "cars" - To me Nutch would do this
> 2. Of this I want then to be able to search for various words ... "red toyota" - 
> In order to do point 2, I would need to index all the data, and provide a 
> "rank/rating" to each result. 
> Nutch does this using a similar scoring mech. to lucene, and (based on what you 
> mentioned) I read the Nutch can Boost, the url, anchor, title etc.
> Nutch can also allow search, BUT is lucene better for a large scale system, 
> since it seems to allow "better" searching or at least access to it. If so I 
> would need to give Lucene access to the index created by Nutch (I guess one of 
> my questions is what happens during indexing? is it the scoring/rating, or just 
> "indexing" to allow faster data retrieval"). Is this correct?
> 3. There is a inter-relationship between "words" in the documents, and a 
> relationship between the "word" and the webpage itself, so a td-idf works out 
> the "relationship" between the keywords and the documents, i.e "red" is more 
> relevant to doc1, than doc2. 
> This Lucene can do, and gives a basic rating system based on searched keyword, 
> and document returned...Hopefully so far so good :-)
> 4. But what if I wanted to understand the relationships between the keywords 
> themselves. Assume I had the word 'red" and wanted to display those similar to 
> "red" like "crimson". i.e if I have collected 100K keywords, and wanted to build 
> a clusters of these keywords, so that "red, crimson, ruby, magenta" formed 
> cluster 1, and "blue, azure, ultramarine" formed cluster 2. Then when someone 
> searched for "ruby" although the td-idf calc would show "No results" I could 
> look up in my cluster and see what other colours are similar and fire a query 
> for "red or crimson or magenta" hence it would return a value, based on the 
> cluster in which that colour was present.
> Use case: A user searches for "red cars" my crawling has picked up crimson cars 
> only, now unless I know crimson and red are "related" I may have zero results. 
> I guess in the case of colours a manual cluster may need to be formed, but 
> surely there must be a way of clustering these words dynamically. Imagine we 
> have crawled 100K webpages, and we have 100 pages which show "red" and 100 which 
> show "crimson" and then 100 which show both "red and crimson" is there a way to 
> deduce that there maybe (albeit weak) relationship between red AND crimson. Of 
> course we can pre-seed this info, which then gets weighted by actual results.
> 5. And this is where Mahout comes in...or at least I think it does. Mahout has 
> lots of clever algo's underneath the hood, some more relevant that others. Where 
> I am really getting confused is at what point in my pipeline to deploy these.
> Nutch ---> Mahout ---> Lucene ---> Taste ---> Mahout         [crawl + index

> ----> algos for clustering, distance, rating ---> search ---> user feedback

> ----> algo's......]
> If I wanted to implement PLSI for me the above scenario would work, BUT how 
> would the scoring done by Nutch affect the data fed into Mahout for this, should 
> the data just be raw (parsed etc, but no rating), the processed, the opened for 
> search, and then user feedback dropped in.
> Hope thats a little clearer. Wondering what setups people have? i.e the block 
> level order in which the data is processed. Maybe I am reading it wrong and its 
> not a one to one process.
> tks for reading
> Paul
> ________________________________
> From: Grant Ingersoll 
> To:
> Sent: Tuesday, 23 June, 2009 11:39:31
> Subject: Re: mahout PLSI (with some lucene, thrown in)
> On Jun 21, 2009, at 9:39 PM, Paul Jones wrote:
> > I think I am starting to get a feel for what each of these frameworks can 
> achieve, however due to overlap in some of these applications, I am curious 
> about how each one exposes data to the other, again trawled through the lists, 
> best I can, and read the Lucene in action book over the weekend.
> > 
> > To me Nutch should be used as a crawler, rather than a indexer (but I have 
> read that Nutch is better than indexing than lucene, and hence lucene should be 
> used just for search).
> Nutch uses Lucene for indexing.  The two aren't really comparable.  Lucene is a 
> search library.  Nutch is an application designed for large scale crawling and 
> search.  Nutch tends to be pretty monolithic.  You might be happier with Solr, 
> as it is more flexible and easier to configure, but still gives you access to 
> Lucene.
> > Mahout seems to come into its element when you are playing with various 
> algorithms, whether for clustering, nearest neighbour or whatever, but lucene 
> also seems to work with term-vectors (as does Nutch), to work out the "distance" 
> between words, if so, once this is done, are the words then already ranked. If 
> so, then would you run other algos like PLSI on that data, or (at least to me) 
> it makes more sense to take the data from Nutch, use Mahout, and then puch back 
> in Lucene to search with.
> One aspect of Mahout (or machine learning) that I find intriguing is using it to 
> power "intelligent" search.  In this case, you use ML to 
> extract/categorize/cluster, etc. all in an effort to make it easier for people 
> to search/discover the information they are looking for.
> There are, of course, many other uses that have nothing to do with search and 
> there is nothing about Mahout other than the LuceneIterable class in utils and a 
> few helper classes to make working  with text easier.  It is perfectly 
> reasonable to use Mahout on numerical data or even mixed data as long as you can 
> properly setup the problem.
> > 
> > Another question on indexing:
> > 
> > The vector calc's or term-freq are building the relationship between words in 
> a document/web page. e.g "red" is related to "crimson", but how does this relate 
> back to ranking the documents themselves in a search query, so you search for 
> "red" now it is related to "crimson" but if doc1 has "red" in it it should be 
> returned at pos1, and the one with "crimson" at pos2. I am going to try to 
> answer my own question let me know if the answer sucks...
> > 
> > Each word is related to a document word -> doc
> > So has relationships between words are formed, then inherently relationship 
> between the docs are also deduced from here. Is this kind of correct? so you 
> need not worry about ranking the document itself? Or are there two indexes, one 
> which contains the relationships between the words with a doc, and the other 
> which relates each word to each doc, if this is true can you run different algos 
> on each problem to get the end results.
> > 
> > e.g red relates to crimson with value 1, and red relates to blue with value 
> 0.5 so we have relationships between words
> > Now red related to doc 1 as +1, and relates to doc 2 +0.5, and crimson relates 
> to doc2 as +1, hence we have relationship between words and the docs
> > 
> > phew....
> I'm not sure I'm following.  In traditional TF-IDF search (i.e. Lucene) red and 
> crimson would both relate to one or more docs.  Whether red or crimson comes 
> first is going to depend on the statistics of the collection.  Presumably, if 
> you have same a priori information about those words (maybe based on your 
> analysis of the documents) you could boost one word even more such that red 
> comes first.
> > 
> > So two more questions :-), I looked at intergrating user feedback, if we 
> assume we have obtained the feedback, and a person thinks doc1 is actually about 
> "crimson" how would this be intergrated back into the algos, would this be via 
> the boost function in Lucene, or is there a better way of doing it using Taste 
> and dropping it into the Mahout anaylsis.
> I'd say you could likely do it with either.
> > 
> > and ... how do you rate the "words" from a Title, Meta Tag, Image Alt text 
> higher than other words in the webpage, or even say user defined Tags in say 
> blogs
> In Lucene, you can do this several ways.  If you want to boost the whole field, 
> then do just that.  If you want to boost individual terms in a given field, you 
> need to use Payloads and the BoostingTermQuery.
> --------------------------
> Grant Ingersoll
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
> Solr/Lucene:

View raw message