mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Problem running new LDA algorithm (cvb) against the Reuters data
Date Sat, 05 May 2012 05:28:57 GMT
I'm about to head to bed right now (long day, flight to and from sf in one
day, need sleep), but short answer is
that the new LDA requires SequenceFile<IntWritable, VectorWritable> as
input (the same disk format
as DistributedRowMatrix), which you can get out of SequenceFile<Text,
VectorWritable> by running the
RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before
running CVB.

Let us know if that doesn't help!

On Fri, May 4, 2012 at 8:54 PM, DAN HELM <danielhelm@verizon.net> wrote:

> I am attempting to run the new LDA algorithm cvb (Mahout version 0.6)
> against the Reuters data.   I just added another
> entry to the cluster-reuters.sh example script as follows:
>
> ******************************************************************************
> elif [ "x$clustertype" == "xcvb" ]; then
>   $MAHOUT seq2sparse \
>     -i ${WORK_DIR}/reuters-out-seqdir/ \
>     -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
>     -wt tf -seq -nr 3 --namedVector \
>   && \
>   $MAHOUT cvb \
>     -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
>     -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \
>     -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>     -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \
>   && \
>   $MAHOUT ldatopics \
>     -i ${WORK_DIR}/reuters-cvb/state-2 \
>     -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>     -dt sequencefile
>
> ******************************************************************************
> I successfully ran the previous LDA algorithm against Reuters but I am
> most interested in this new implementation of LDA because I want the new
> feature that generates document-to-cluster mappings (e.g., parameter –dt).
>
> When I run the above code via Hadoop pseudo distributed mode as well as on
> a small cluster I receive the same error from the "mahout cvb" command.
> All the pre-clustering logic including sequence file and sparse vector
> generation works fine but when the cvb clustering is attempted the mappers
> fail with the following error in the Hadoop map task log:
>
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.hadoop.io.IntWritable
>  at
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> Any help with resolving the problem would be appreciated.
>
> Dan




-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message