mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?
Date Fri, 22 Apr 2011 15:47:56 GMT
Maybe there's a space for Mr based input conversion job indeed as a command
line routine? I was kind of thinking about the same. Maybe even along with
standartisation of the values. Some formal definition of inputs being fed to
it.

apologies for brevity.

Sent from my android.
-Dmitriy
On Apr 21, 2011 3:05 PM, "Ted Dunning" <ted.dunning@gmail.com> wrote:
> It is definitely a reasonable idea to convert data to hashed feature
vectors
> using map-reduce.
>
> And yes, you can pick a vector length that is long enough so that you
don't
> have to worry about
> collisions. You need to examine your data to decide how large that needs
to
> be, but it isn't hard
> to do. The encoding framework handles to the placement of features in the
> vector for you. You
> don't have to worry about that.
>
> On Wed, Apr 20, 2011 at 8:03 PM, Stanley Xu <wenhao.xu@gmail.com> wrote:
>
>> Thanks Ted. Since the SGD is a sequential method, so the Vector be
created
>> for each line could be very large and won't consume too much memory.
Could
>> I
>> assume if we have limited number of features, or could use the map-reduce
>> to
>> pre-process the data to know how many different values in a category
could
>> have, we could just create a long vector, and put different feature
values
>> to different slot to avoid the possible feature collision?
>>
>> Thanks,
>> Stanley
>>
>>
>>
>> On Thu, Apr 21, 2011 at 12:24 AM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>>
>> > Stanley,
>> >
>> > Yes. What you say is correct. Feature hashing can cause degradation.
>> >
>> > With multiple hashing, however, you do have a fairly strong guarantee
>> that
>> > the feature hashing is very close to information preserving. This is
>> > related to the fact that the feature hashing operation is a random
linear
>> > transformation. Since we are hashing to something that is still quite a
>> > high dimensional space, the information loss is likely to be minimal.
>> >
>> > On Wed, Apr 20, 2011 at 6:06 AM, Stanley Xu <wenhao.xu@gmail.com>
wrote:
>> >
>> > > Dear all,
>> > >
>> > > Per my understand, what Feature Hashing did in SGD do compress the
>> > Feature
>> > > Dimensions to a fixed length Vector. Won't that make the training
>> result
>> > > incorrect if Feature Hashing Collision happened? Won't the two
features
>> > > hashed to the same slot would be thought as the same feature? Even if
>> we
>> > > have multiple probes to reduce the total collision like a bloom
filter.
>> > > Won't it also make the slot that has the collision looks like a
>> > combination
>> > > feature?
>> > >
>> > > Thanks.
>> > >
>> > > Best wishes,
>> > > Stanley Xu
>> > >
>> >
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message