mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: mahout 0.5 to 0.7 commandline parameter of lda
Date Thu, 18 Oct 2012 17:27:47 GMT
On Thu, Oct 18, 2012 at 9:16 AM, Vineeth <vineethrakesh@gmail.com> wrote:

> I am running the lda for the first time. I gave the following command to
> test over the Reuters dataset but i got the error
>
> lda -i reuters-vectors/tf-vectors -o reuters-lda-sparse -k 10 -v 7000 -x
> 20 -ow
>
> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_**PREFIX/bin, running
> locally
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/home/vineeth_**
> rakesh/src/mahout/examples/**target/mahout-examples-0.8-**
> SNAPSHOT-job.jar!/org/slf4j/**impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/home/vineeth_**
> rakesh/src/mahout/examples/**target/dependency/slf4j-jcl-1.**
> 6.6.jar!/org/slf4j/impl/**StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/home/vineeth_**
> rakesh/src/mahout/examples/**target/dependency/slf4j-**
> log4j12-1.6.1.jar!/org/slf4j/**impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.**html#multiple_bindings<http://www.slf4j.org/codes.html#multiple_bindings>for
an explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.**Log4jLoggerFactory]
> 12/10/18 12:11:17 ERROR driver.MahoutDriver: : Try the new Collapsed
> Variation Bayes LDA, try bin/mahout cvb or bin/mahout cvb0_local
>
> As i mentioned this command seems to be for Mahout 0.5. Now if i have to
> use Collapsed Variation LDA how do you give the parameters? are there any
> websites describing the usage of CVB lda?


if you want a summary of all the command line options for CVB impl, just do:

mahout cvb

mahout cvb -i path/to/tf-vectors -o output_dir/lda_output -k <num_topics>
-x <num_iterations> -a <smoothing alpha param> -e <smoothing eta param>
-dict path/to/dictionary.file-0 -dt <"sequencefile" or "text">
--topic_model_temp_dir path/to/store/temp_state

num_iterations can be something like 20-30, and it's not too sensitive to
alpha or eta, but they should be pretty small (0.01 or so seems be the
right order of magnitude for both of them, often, but you have to play with
it, we don't learn the hyperparameters in this impl).

Let me know if that works for you.


>
> On 12-10-18 09:09 AM, Jake Mannix wrote:
>
>> For Mahout 0.7, the format of the model files for LDA are just a
>> SequenceFile<IntWritable, VectorWritable>, with the row numbers being the
>> topicIds, and the entries being the (un-normalized) probabilities for each
>> termId.
>>
>> bin/vectordump --dictionary <path to dictionary file> \
>>                           --dictioanryType <either text or sequencefile> \
>>                           --input <path to model files> \
>>                           --vectorSize <num entries per topic you want to
>> see> \
>>                           --sortVectors
>>
>>
>> On Wed, Oct 17, 2012 at 10:11 PM, vineeth <vineethrakesh@gmail.com>
>> wrote:
>>
>>  Hello,
>>>
>>> I am seeing from this website http://theglassicon.com/**
>>> computing/machine-learning/****running-lda-algorithm-mahout<h**
>>> ttp://theglassicon.com/**computing/machine-learning/**
>>> running-lda-algorithm-mahout<http://theglassicon.com/computing/machine-learning/running-lda-algorithm-mahout>
>>> >(**Mahout 0.5). This website give the complete procedure to get
>>> probabilities
>>>
>>> of word and topics using LDA. However, these steps donot work on Mahout
>>> 0.7. Can some one give an updated website of the same steps?, or can some
>>> one provide me the alternative commands and parameters?
>>>
>>> Thank You
>>> Vineeth
>>>
>>>
>>
>>
>


-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message