spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <>
Subject Re: [MLlib] Term Frequency in TF-IDF seems incorrect
Date Tue, 02 Aug 2016 03:44:04 GMT
Hi Hao,

HashingTF directly apply a hash function (Murmurhash3) to the features to
determine their column index. It excluded any thought about the term
frequency or the length of the document. It does similar work compared with
sklearn FeatureHasher. The result is increased speed and reduced memory
usage, but it does not remember what the input features looked like and can
not convert the output back to the original features. Actually we misnamed
this transformer, it only does the work of feature hashing rather than
computing hashing term frequency.

CountVectorizer will select the top vocabSize words ordered by term
frequency across the corpus to build the hash table of the features. So it
will consume more memory than HashingTF. However, we can convert the output
back to the original feature.

Both of the transformers do not consider the length of each document. If
you want to compute term frequency divided by the length of the document,
you should write your own function based on transformers provided by MLlib.


2016-08-01 15:29 GMT-07:00 Hao Ren <>:

> When computing term frequency, we can use either HashTF or CountVectorizer
> feature extractors.
> However, both of them just use the number of times that a term appears in
> a document.
> It is not a true frequency. Acutally, it should be divided by the length
> of the document.
> Is this a wanted feature ?
> --
> Hao Ren
> Data Engineer @ leboncoin
> Paris, France

View raw message