mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ani Tumanyan <...@bnotions.com>
Subject TF-IDF confusion
Date Tue, 03 Dec 2013 15:03:31 GMT
Hello everyone,

I'm working on a project, where I'm trying to extract topics from news articles. I have around
500,000 articles as a dataset. Here are the steps that I'm following:

1. First of all I'm doing some sort of preprocessing. For this I'm using Behemoth to annotate
the document and get rid of non-English documents,
2. Then I'm running Mahout's sparse vector command to generate TF-IDF vectors. The problem
with TF-IDF vector is that the number of words for a document is far more than the number
of words in TF vectors. Moreover there are some words/terms in TF-IDF vector that didn't appear
in that specific document anyway. Is this a correct behaviour or there is something wrong
with my approach?

Thanks in advance!

Ani
Mime
View raw message