mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: plsi in pig
Date Wed, 11 Feb 2009 16:16:42 GMT
Cool will look at this after the release.



On Feb 11, 2009, at 10:09, prasenjit mukherjee <prasen.bea@gmail.com>  
wrote:

> So I created a jira-issue :
> https://issues.apache.org/jira/browse/MAHOUT-106 and also submitted a
> patch along with readme instructions. Please feel free to try out with
> different input samples. The default behaviour is to run pig in local
> mode. Appreciate any suggestions/reviews.
>
> -Prasen
>
> On Wed, Feb 11, 2009 at 5:32 PM, Grant Ingersoll  
> <gsingers@apache.org> wrote:
>> This is excellent, Prasen.
>>
>> I see no reason not to include them.  We are about ML first,
>> distributed/scalable ML second and Hadoop-based third, IMO.  Java  
>> would be a
>> distant fourth in my mind.  In other words, I don't feel  
>> particularly strong
>> about us being Java only or even Hadoop only.  To me there is a  
>> significant
>> need for community-developed machine learning capabilities with a  
>> commercial
>> friendly license.  Add in the ability to scale/run efficiently and  
>> you have
>> a home run.  In fact, those are the very reasons we founded Mahout.
>>
>>
>> On Feb 11, 2009, at 6:40 AM, prasenjit mukherjee wrote:
>>
>>> Pig is a higher level language ( more like Swazall for Google's
>>> mapreduce )  on top of hadoop which makes hadoop easy to use.
>>>
>>> It has SQL like syntaxes and can break the command into separate
>>> mapreduce tasks and also chain them. From execution point of view  
>>> they
>>> are as simple as running a shell script with very few
>>> operators/commands.
>>>
>>> Some of its commands are join, group, cogroup, load etc.
>>>
>>> For example the following pig script  takes a logfile in the  
>>> format :
>>> <txid>,<txt>,<user> and outputs user-term-freq  file in the
foll
>>> format : <txt>\t<user>\t<cnt>
>>>
>>> raw = load 'tx_log.csv' using PigStorage(',') AS
>>> (transactionid:chararray, txt:chararray, user:chararray);
>>> tokenized = FOREACH raw GENERATE user, flatten(TOKENIZE(txt)) as
>>> attribute;
>>> user_term_freq = group tokenized by (user,attribute);
>>> user_term_freq = foreach ratings generate  
>>> flatten(group),COUNT(tokenized);
>>> store ratings into 'user_term_freq.txt';
>>>
>>> During runtime pig takes the input and breaks it into several map  
>>> and
>>> reduce tasks. It takes the hadoop-site.xml from its classpath.
>>>
>>> -Prasen
>>>
>>> On Wed, Feb 11, 2009 at 4:54 PM, Sean Owen <srowen@gmail.com> wrote:
>>>>
>>>> Needs to go somewhere like trunk/core/src/pig/main right, versus / 
>>>> java/ ?
>>>>
>>>> I also see no harm in adding it, other than that it would remain
>>>> pretty isolated right? isn't part of the build, can't be integrated
>>>> with the other code, etc.? Does it add value to package it with the
>>>> project then?
>>>>
>>>> Perhaps I misunderstand what Pig can do or how it can relate to  
>>>> Java?
>>>>
>>>> On Wed, Feb 11, 2009 at 11:13 AM, Grant Ingersoll <gsingers@apache.org

>>>> >
>>>> wrote:
>>>>>
>>>>> Hmm, hadn't really thought about it, but I see no reason why we  
>>>>> wouldn't
>>>>> accept it and add it.  I think our source tree can definitely  
>>>>> handle it.
>>>>>
>>>>> I'd propose it go somewhere under:
>>>>> trunk/core/src/main/pig/plsi
>>>>>
>>>>> I'm not familiar with Pig, but I can learn, and I know others  
>>>>> are, is it
>>>>> a
>>>>> single file?
>>>>>
>>>>> See http://cwiki.apache.org/MAHOUT/howtocontribute.html for  
>>>>> instructions
>>>>> on
>>>>> contributing.  Basically, just attach the file(s) to a JIRA issue.
>>>>>
>>>>> On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:
>>>>>
>>>>>> Hi,
>>>>>> I have implemented hofmann's plsi/em algo in pig which I would  
>>>>>> like
>>>>>> to contribute back to the community for further
>>>>>> scrutinization/improvement.  Let me know if mahout is the  
>>>>>> appropriate
>>>>>> forum or should  it go to  pig project.
>>>>>>
>>>>>> Haven't  seen any non-java contributions to Mahout yet, which  
>>>>>> begs the
>>>>>> question is Mahout only java based ?
>>>>>>
>>>>>> -Thanks,
>>>>>> Prasen
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>>> Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>>
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

Mime
View raw message