mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Extracting the topics of documents (LDA, Mahout 0.7)
Date Thu, 06 Feb 2014 15:08:46 GMT
I can't comment on the specific question that you ask, but it should not
necessarily be expected that LDA will reconstruct the categories that you
have in mind.  It will develop categories that explain the data as well as
it can, but that won't necessarily match the categories you intend.

It is likely, however, that the topics that LDA derives would make a good
set of features for a classifier.




On Thu, Feb 6, 2014 at 2:56 PM, Stamatis Rapanakis
<stamrapanakis@gmail.com>wrote:

>   I am trying to run the LDA algorithm. I can create meaningful topics but
> the document/topic assignment is of very bad quality.
>
>   I have assigned 30 tweets to the following 10 topics:
>
> /grammy awards
> /greek crisis
> /greek islands
> /premier inn
> /premier league
> /rihanna
> /syria
> /terrorism
> /winter olympics
> /winter sales
>
>   I have a total of 300 tweets and my purpose is to run the LDA algorithm
> to see how well these tweets are assigned. For example, if the number of
> topics parameter is set to 10, how much do they match to the original
> assignment.
>
> 1. I start by creating a file that will contain (in random order) the
> tweets (*tweets.tsv*). This file will be used to compare the final tweets
> topic assignment.
>
> 2. I remove stopwords, urls, replies and create a file with the tweets
> text only (*tweets_no_stopwords.tsv*). One tweet (document) per file
> line. This will be the LDA input file.
>
> 3. I use some java code to create a sequence file from
> *tweets_no_stopwords.tsv.* I use a SequenceFile.Writer object with key an
> integer and value the tweet text (extract attached tweets_no_stopwords.rar
> that contains a chunk-0 file).
>
>  By executing the command: *mahout seqdumper -i
> tweets_no_stopwords/chunk-0*
> the chunk-0 file contents appear correctly:
>
> *Key: 1: Value: #nowplaying Rihanna - Unfaithful !! π?’™ trop belle !!*
> *Key: 2: Value: Grammy Awards Hairstyles: Memorable Moments*
> *...*
> *Key: 299: Value: team scored goal matches! (Man City)*
> *Key: 300: Value: Rocsi Diaz Wearing 5th Mercer- Grammy Awards*
>
> 4. I convert the data to vectors:
>
> bin/mahout seq2sparse -i tweets_no_stopwords -o
> tweets_no_stopwords-vectors -ow
>
> (I review the file with the command: *bin/mahout seqdumper -i
> tweets_no_stopwords-vectors/tf-vectors/part-r-00000*)
>
> 5. I convert keys to IntWritables
>
> bin/mahout rowid -i tweets_no_stopwords-vectors/tf-vectors/ -o
> tweets_no_stopwords-vectors/tf-vectors-cvb
>
> The created tf-vectors-cvb/docIndex, tf-vectors-cvb/matrix files have keys
> from 0 - 299 (300 instances).
>
> 6. Finally I run the LDA algorithm:
>
> *bin/mahout cvb -i tweets_no_stopwords-vectors/tf-vectors-cvb/matrix/ -o
> lda_output/topicterm -mt lda_output/models -dt lda_output/docTopics -k 10
> -x 40 -dict tweets_no_stopwords-vectors/dictionary.file-0*
>
> Note: I have to enter Cltr+C to stop the command execution (after it
> finished and the message "Program took XXXX ms" appears). But the folders
> are created as expected.
>
> The topics created (lda_output/topicterm) seem fine. I execute the command:
>
> *bin/mahout vectordump -i lda_output/topicterm -d
> tweets_no_stopwords-vectors/dictionary.file-0 -dt sequencefile -c csv -p
> true -o p_term_topic.txt -sort lda_output/topicterm -vs 10*
>
> and follow the steps described in this link (
> http://sujitpal.blogspot.gr/2013/10/topic-modeling-with-mahout-on-amazon-emr.html)
> to create a file *p_term_topic.txt* and show a report with the output.
>
> *Topic 0**Topic 1**Topic 2* *Topic 3**Topic 4*winter, sales, olympics,
> love, played, people, big, photo, sale, trailterrorism, grammy, awards,
> blaindianexus, 56th, balochistan, bla, rock, 2014, photos islands, greek,
> greece, travel, find, book, make, kea, days, holidaygreek, crisis, β,
> lol, s, top, economic, tomorrow, job, eugrammys, found, style, red,
> hairdressers, room, mata, good, ty, walks *Topic5**Topic 6**Topic 7**Topic
> 8**Topic 9*sochi, team, time, all, usa, war, free, syria, sending, checksyria,
> city, manchester, united, back, hit, watching, chelsea, week, matchday syria,
> support, olympic, economy, video, today, competition, arab, u.s, inn'srihanna,
> time, watch, unapologetic, follow, great, euro, congrats, bet, hotelspremier,
> inn, league, stay, season, β, year, home, goals, won
>
>
>
> These results are good, if you have in mind the (10) categories they
> belonged to:
>
> /grammy awards
> /greek crisis
> /greek islands
> /premier inn
> /premier league
> /rihanna
> /syria
> /terrorism
> /winter olympics
> /winter sales
>
> But the results in the folder *lda_output/docTopics* are really bad!
>
> bin/mahout seqdumper -i lda_output/docTopics/part-m-00000  (Display the
> results)
>
> Key: 0: Value:
> {0:2.7932644743653218E-5,1:0.2582390963222569,2:0.03389979994715306,3:0.16986766822778876,4:
> *0.5144069716184998*
> ,5:6.134281324000599E-5,6:0.022817498374309925,7:1.2427551415773865E-4,8:4.7632128287483606E-4,9:7.909325497553191E-5}
> Key: 1: Value:
> {0:0.004101560509130678,1:0.02531905947518225,2:0.14528444920763148,3:
> *0.32904199007739116*
> ,4:0.06024210378042988,5:0.15510210839789676,6:0.0364093686560865,7:0.13256015086012124,8:0.0613456311044372,9:0.05059357793169288}
> Key: 2: Value:
> {0:2.093051210521087E-4,1:0.0242076645518674,2:0.12014785226603218,3:0.15589333731396188,4:0.022516226489811282,5:0.015141667919690474,6:0.08494844406302673,7:0.150039462386397,8:0.15927498562672762,9
> *:0.2676210542614334*}
>
>
> *Tweet**Topic* *Tweet text*14#nowplaying Rihanna Unfaithful !! �?�� trop
> belle !!23Grammy Awards Hairstyles: Memorable Moments39 Preeminent
> #terrorism research center website. Check out: cc
>
>
>  Am I missing something? Doesn't key 0 correspond to the first tweet
> (document), key 2 to the second tweet and so on?
>
>   Thank you in advance for your responses.
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message