spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Ren <inv...@gmail.com>
Subject [MLlib] Term Frequency in TF-IDF seems incorrect
Date Mon, 01 Aug 2016 22:29:23 GMT
When computing term frequency, we can use either HashTF or CountVectorizer
feature extractors.
However, both of them just use the number of times that a term appears in a
document.
It is not a true frequency. Acutally, it should be divided by the length of
the document.

Is this a wanted feature ?

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France

Mime
View raw message