mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?
Date Mon, 25 Apr 2011 17:57:38 GMT
(typo corrected )

I am not sure i see the difficulty but it is possible we are talking
about slightly different things.
Hadoop solves this stuff thru some pluggable strategies, such as InputFormat .

Those strategies are paramerized (and also perhaps persisted) thru
some form of declarative definitions (if we keep analogy with hadoop,
they use Configuration stuff for serializing something like that --
but of course property based definitions are probably quite
underwhelming for this case). Similarly, Lucene defines Analyzer
preprocessing strategies. Surely, we could probably define some
strategies handling rows of pre-standardized inputs producing
vectorized and standardized inputs as a result.

A little bit larger Q is what to use for pre-vectorized inputs as
Vector obviously won't handle various datatypes esp. qualitative

But perhaps we already have some of this, i am not sure. I saw a fare
amount of classes that adapt various formats (what was it? TSV?
ARFF?), perhaps we could we strategize that as well.

On Fri, Apr 22, 2011 at 9:10 AM, Ted Dunning <> wrote:
> Yes.
> But how do we specify the input?  And how do we specify the encodings?
> This is what has always held me back in the past.  Should we just allow
> classes to be specified on the command line?
> On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov <> wrote:
>> Maybe there's a space for Mr based input conversion job indeed as a command
>> line routine? I was kind of thinking about the same. Maybe even along with
>> standartisation of the values. Some formal definition of inputs being fed
>> to
>> it.

View raw message