spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <>
Subject Re: [MLlib] Term Frequency in TF-IDF seems incorrect
Date Tue, 02 Aug 2016 09:25:46 GMT
Note that both HashingTF and CountVectorizer are usually used for creating
TF-IDF normalized vectors. The definition ( of term frequency
in TF-IDF is actually the "number of times the term occurs in the document".

So it's perhaps a bit of a misnomer, but the implementation is correct.

On Tue, 2 Aug 2016 at 05:44 Yanbo Liang <> wrote:

> Hi Hao,
> HashingTF directly apply a hash function (Murmurhash3) to the features to
> determine their column index. It excluded any thought about the term
> frequency or the length of the document. It does similar work compared with
> sklearn FeatureHasher. The result is increased speed and reduced memory
> usage, but it does not remember what the input features looked like and can
> not convert the output back to the original features. Actually we misnamed
> this transformer, it only does the work of feature hashing rather than
> computing hashing term frequency.
> CountVectorizer will select the top vocabSize words ordered by term
> frequency across the corpus to build the hash table of the features. So it
> will consume more memory than HashingTF. However, we can convert the output
> back to the original feature.
> Both of the transformers do not consider the length of each document. If
> you want to compute term frequency divided by the length of the document,
> you should write your own function based on transformers provided by MLlib.
> Thanks
> Yanbo
> 2016-08-01 15:29 GMT-07:00 Hao Ren <>:
>> When computing term frequency, we can use either HashTF or
>> CountVectorizer feature extractors.
>> However, both of them just use the number of times that a term appears in
>> a document.
>> It is not a true frequency. Acutally, it should be divided by the length
>> of the document.
>> Is this a wanted feature ?
>> --
>> Hao Ren
>> Data Engineer @ leboncoin
>> Paris, France

View raw message