The patch for pruning words with high document frequencies is ready:
https://issues.apache.org/jira/browse/MAHOUT-688
On Thu, Apr 28, 2011 at 5:08 PM, Vasil Vasilev wrote:
> Also the topic regularization patch is ready:
> https://issues.apache.org/jira/browse/MAHOUT-684
>
>
> On Thu, Apr 28, 2011 at 10:53 AM, Vasil Vasilev wrote:
>
>> Hi all,
>>
>> The LDA Vectorization patch is ready. You can take a look at:
>> https://issues.apache.org/jira/browse/MAHOUT-683*
>>
>> *Regards, Vasil
>> *
>> *
>> On Thu, Apr 21, 2011 at 9:47 AM, Vasil Vasilev wrote:
>>
>>> Ok. I am going to try out 1) suggested by Jake, then write couple of
>>> tests and then I will file the Jira-s.
>>>
>>>
>>> On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll wrote:
>>>
>>>>
>>>> On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:
>>>>
>>>> > Hi Mahouters,
>>>> >
>>>> > I was experimenting with the LDA clustering algorithm on the Reuters
>>>> data
>>>> > set and I did several enhancements, which if you find interesting I
>>>> could
>>>> > contribute to the project:
>>>> >
>>>> > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and
>>>> not
>>>> > the tf-idf ones which result from seq2sparse. Due this fact words like
>>>> > "and", "where", etc. get also included in the resulting topics. To
>>>> prevent
>>>> > that I run seq2sparse with the whole tf-idf sequence and then run the
>>>> > "pruner". It first calculates the standard deviation of the document
>>>> > frequencies of the words and then prunes all entries in the tf vectors
>>>> whose
>>>> > document frequency is bigger then 3 times the calculated standard
>>>> deviation.
>>>> > This ensures including most of the words population, but still pruning
>>>> the
>>>> > unnecessary ones.
>>>> >
>>>> > 2. Implemented the alpha-estimation part of the LDA algorithm as
>>>> described
>>>> > in the Blei, Ng, Jordan paper. This leads to better results in
>>>> maximizing
>>>> > the log-likelihood for the same number of iterations. Just an example
>>>> - for
>>>> > 20 iterations on the reuters data set the enhanced algorithm reaches
>>>> value
>>>> > of -6975124.693072233, compared to -7304552.275676554 with the
>>>> original
>>>> > implementation
>>>> >
>>>> > 3. Created LDA Vectorizer. It executes only the inference part of the
>>>> LDA
>>>> > algorithm based on the last LDA state and the input document vectors
>>>> and for
>>>> > each vector produces a vector of the gammas, that are result of the
>>>> > inference. The idea is that the vectors produced in this way can be
>>>> used for
>>>> > clustering with any of the existing algorithms (like canopy, kmeans,
>>>> etc.)
>>>> >
>>>>
>>>> As Jake says, this all sounds great. Please see:
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute
>>>>
>>>>
>>>
>>
>