mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: plsi in pig
Date Wed, 11 Feb 2009 12:02:21 GMT
This is excellent, Prasen.

I see no reason not to include them.  We are about ML first,  
distributed/scalable ML second and Hadoop-based third, IMO.  Java  
would be a distant fourth in my mind.  In other words, I don't feel  
particularly strong about us being Java only or even Hadoop only.  To  
me there is a significant need for community-developed machine  
learning capabilities with a commercial friendly license.  Add in the  
ability to scale/run efficiently and you have a home run.  In fact,  
those are the very reasons we founded Mahout.


On Feb 11, 2009, at 6:40 AM, prasenjit mukherjee wrote:

> Pig is a higher level language ( more like Swazall for Google's
> mapreduce )  on top of hadoop which makes hadoop easy to use.
>
> It has SQL like syntaxes and can break the command into separate
> mapreduce tasks and also chain them. From execution point of view they
> are as simple as running a shell script with very few
> operators/commands.
>
> Some of its commands are join, group, cogroup, load etc.
>
> For example the following pig script  takes a logfile in the format :
> <txid>,<txt>,<user> and outputs user-term-freq  file in the foll
> format : <txt>\t<user>\t<cnt>
>
> raw = load 'tx_log.csv' using PigStorage(',') AS
> (transactionid:chararray, txt:chararray, user:chararray);
> tokenized = FOREACH raw GENERATE user, flatten(TOKENIZE(txt)) as  
> attribute;
> user_term_freq = group tokenized by (user,attribute);
> user_term_freq = foreach ratings generate  
> flatten(group),COUNT(tokenized);
> store ratings into 'user_term_freq.txt';
>
> During runtime pig takes the input and breaks it into several map and
> reduce tasks. It takes the hadoop-site.xml from its classpath.
>
> -Prasen
>
> On Wed, Feb 11, 2009 at 4:54 PM, Sean Owen <srowen@gmail.com> wrote:
>> Needs to go somewhere like trunk/core/src/pig/main right, versus / 
>> java/ ?
>>
>> I also see no harm in adding it, other than that it would remain
>> pretty isolated right? isn't part of the build, can't be integrated
>> with the other code, etc.? Does it add value to package it with the
>> project then?
>>
>> Perhaps I misunderstand what Pig can do or how it can relate to Java?
>>
>> On Wed, Feb 11, 2009 at 11:13 AM, Grant Ingersoll <gsingers@apache.org 
>> > wrote:
>>> Hmm, hadn't really thought about it, but I see no reason why we  
>>> wouldn't
>>> accept it and add it.  I think our source tree can definitely  
>>> handle it.
>>>
>>> I'd propose it go somewhere under:
>>> trunk/core/src/main/pig/plsi
>>>
>>> I'm not familiar with Pig, but I can learn, and I know others are,  
>>> is it a
>>> single file?
>>>
>>> See http://cwiki.apache.org/MAHOUT/howtocontribute.html for  
>>> instructions on
>>> contributing.  Basically, just attach the file(s) to a JIRA issue.
>>>
>>> On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:
>>>
>>>> Hi,
>>>> I have implemented hofmann's plsi/em algo in pig which I would like
>>>> to contribute back to the community for further
>>>> scrutinization/improvement.  Let me know if mahout is the  
>>>> appropriate
>>>> forum or should  it go to  pig project.
>>>>
>>>> Haven't  seen any non-java contributions to Mahout yet, which  
>>>> begs the
>>>> question is Mahout only java based ?
>>>>
>>>> -Thanks,
>>>> Prasen
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>>> using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message