spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Ren <>
Subject [MLlib] Term Frequency in TF-IDF seems incorrect
Date Mon, 01 Aug 2016 22:29:23 GMT
When computing term frequency, we can use either HashTF or CountVectorizer
feature extractors.
However, both of them just use the number of times that a term appears in a
It is not a true frequency. Acutally, it should be divided by the length of
the document.

Is this a wanted feature ?

Hao Ren

Data Engineer @ leboncoin

Paris, France

View raw message