spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <nick.pentre...@gmail.com>
Subject Re: [MLlib] Term Frequency in TF-IDF seems incorrect
Date Tue, 02 Aug 2016 09:25:46 GMT
Note that both HashingTF and CountVectorizer are usually used for creating
TF-IDF normalized vectors. The definition (
https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency
in TF-IDF is actually the "number of times the term occurs in the document".

So it's perhaps a bit of a misnomer, but the implementation is correct.

On Tue, 2 Aug 2016 at 05:44 Yanbo Liang <ybliang8@gmail.com> wrote:

> Hi Hao,
>
> HashingTF directly apply a hash function (Murmurhash3) to the features to
> determine their column index. It excluded any thought about the term
> frequency or the length of the document. It does similar work compared with
> sklearn FeatureHasher. The result is increased speed and reduced memory
> usage, but it does not remember what the input features looked like and can
> not convert the output back to the original features. Actually we misnamed
> this transformer, it only does the work of feature hashing rather than
> computing hashing term frequency.
>
> CountVectorizer will select the top vocabSize words ordered by term
> frequency across the corpus to build the hash table of the features. So it
> will consume more memory than HashingTF. However, we can convert the output
> back to the original feature.
>
> Both of the transformers do not consider the length of each document. If
> you want to compute term frequency divided by the length of the document,
> you should write your own function based on transformers provided by MLlib.
>
> Thanks
> Yanbo
>
> 2016-08-01 15:29 GMT-07:00 Hao Ren <invkrh@gmail.com>:
>
>> When computing term frequency, we can use either HashTF or
>> CountVectorizer feature extractors.
>> However, both of them just use the number of times that a term appears in
>> a document.
>> It is not a true frequency. Acutally, it should be divided by the length
>> of the document.
>>
>> Is this a wanted feature ?
>>
>> --
>> Hao Ren
>>
>> Data Engineer @ leboncoin
>>
>> Paris, France
>>
>
>

Mime
View raw message