mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Schlaikjer <andrew.schlaik...@gmail.com>
Subject Re: Mahout - Pig Hackday
Date Thu, 03 May 2012 02:00:59 GMT
Hi Tim, Ted,

I wanted to chime in here regarding Elephant Bird utilities for
Pig-Mahout integration. I'm the author of EB's SequenceFileLoader,
SequenceFileStorage, and all the supporting WritableConverters,
including the VectorWritableConverter which facilitates conversion of
Mahout Vector data to Pig tuple formats (and vice versa). Details are
in the javadocs here:

https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/mahout/VectorWritableConverter.java
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/load/SequenceFileLoader.java
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java

These utils are already heavily used here at Twitter. I've also put
together a number of UDFs for working with Vector data in Pig which I
hope to open source in the next month.

There's been recent discussion on EB's mailing list regarding
improvements to the project's build organization and artifact
publishing. We're interested in splitting the codebase into smaller
modules to produce thinner jar artifacts and simplify maven dependency
specification. In the process, I'll likely migrate
VectorWritableConverter to this separate Twitter OSS project with
related Vector UDFs and depend on those EB modules which include
SequenceFileStorage and friends.

I'll keep this list posted!

Cheers,
Andy
@sagemintblue


On Wed, May 2, 2012 at 5:20 PM, Timothy Potter <thelabdude@gmail.com> wrote:
> Thanks Ted! Removing the elephant-bird dependency / build problems
> sounds like a good task we should include in our plans for the hackday
> ... what are your thoughts on adding pig-vector to Mahout as a contrib
> module? Do you want to keep it separate or eventually make its way
> into the project?
>
> Praneet - thanks for throwing your hat in ;-) Sounds like you're doing
> some interesting things with Mahout and Pig already. Will definitely
> keep you in the loop as we work out the details ...
>
> Cheers,
> Tim
>
> On Wed, May 2, 2012 at 1:43 PM, praneet mhatre <praneetmhatre@gmail.com> wrote:
>> On Wed, May 2, 2012 at 11:13 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>>
>>> On Wed, May 2, 2012 at 11:06 AM, Timothy Potter <thelabdude@gmail.com
>>> >wrote:
>>>
>>> > We're really keen on Ted's pig-vector project
>>> > (https://github.com/tdunning/pig-vector) as we're building a number of
>>> > classifiers on Mahout's SGD framework, with the bulk of our data being
>>> > in Cassandra processed almost entirely with Pig. We'd love to hear
>>> > about any planned features for the pig-vector project we can help out
>>> > on. Any similar Pig-Mahout projects we should know about?
>>> >
>>>
>>> The huge problem with pig-vector is that dependency on elephant bird makes
>>> it really almost impossible to build.  Elephant bird has obscure
>>> dependencies on things like yaml-beans.  That is a problem because the
>>> yaml-beans maintainer has a forceful way of expressing his distaste for all
>>> things to do with Maven and thus refuses to publish any artifacts in
>>> standard ways.  Actually, the maintainer has a rather forceful manner that
>>> he applies to all interactions as far as I can tell.
>>>
>>> On the other hand, the necessary capabilities that pig-vector needs from
>>> Elephant bird are quite minor and could probably be reasonably extract.  I
>>> am under-water, however, and thus cannot finish that right away.  I can and
>>> will assist anybody who has the necessary time and enthusiasm.  This might
>>> make a very nice pig day effort.
>>>
>>>
>>> > In general, we're reaching out today to see who else in the community
>>> > is interested in better Pig / Mahout integration and what types of
>>> > challenges they're facing? Any cool UDFs you'd like to share?
>>> >
>>>
>>> Praneet at UCI (praneetmhatre@gmail.com) has been doing some interesting
>>> work here to do with feature sharding in pig.  Perhaps he can speak up.
>>>
>>
>> Hello Timothy,
>>
>> I have tried writing sharded versions of classifiers and they seem to work
>> well. But my code requires a pre-processing step before the classification
>> and re-aggregation of results (which was easy when I worked with Weka).
>> However, to be able to do the same in Mahout, I need something like
>> pig-vector to take of the pre-processing part.
>>
>> So yes, I am very interested in Pig / Mahout integration! But admittedly I
>> only have introductory knowledge of Pig. And as far the integration part
>> goes, my contribution so far has been limited to testing the stuff Ted has
>> written.
>>
>> But the idea of Pig-Mahout hackday sounds great! And I would definitely
>> like to be involved in it.
>>
>>
>>
>> --
>> Praneet Mhatre
>> Graduate Student
>> Donald Bren School of ICS
>> University of California, Irvine

Mime
View raw message