mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: rowid conversion step to prepare input vectors for cvb clustering
Date Fri, 01 Jun 2012 13:53:20 GMT
On Thu, May 31, 2012 at 4:18 PM, DAN HELM <danielhelm@verizon.net> wrote:

> I have a question about using rowid to convert sparse vectors (generated
> via seq2sparse) to the form needed for cvb clustering (i.e., to change the
> Text key to an Integer).  Prior to running this step I had 3 “part” files
> in my tf-vectors folder.  After running rowid on the tf-vectors folder it
> generates one “Matrix“ file and a “docIndex” file.


Yeah, the RowIdJob is a non-distributed process, so it never bothers to
make lots of mapper files, but it should, and in fact it would be nice if
this were configurable.  It's easy enough to do so, if you look in
org.apache.mahout.utils.vectors.RowIdJob, you can see that it's writing to
one output file.  Probably the right thing is to create a directory of that
name ("matrix"), and inside of it, create part-{n} files for each of
00000-...n.  A patch to change this would be welcome!  Then if it turned
out your 3-part input was running too slow, you could use the RowIdJob to
turn it into a 30-part file, and parallelize faster!


> The result of this step is that when running the cvb clustering on the
> folder containing “Matrix” only a single mapper runs on one node.  For a
> large collection this takes an excessive amount of time to run.
>
> I assume cvb should be able to run in a distributed fashion on multiple
> nodes using many mappers/tasktrackers?


CVB is certainly able to run with lots of mappers, yes.


> If so, am I running rowid incorrectly on the entire tf-vectors folder as
> opposed to separately on each “part” file in tf-vectors?  Of course it
> generates the name “Matrix” in output so this implies it wants to generate
> a single file.
>

To get you moving faster, you can either modify RowIdJob (and submit a
patch of what you did, please!), or else run it separately on each part
file and rename each of the "matrix" output files and use those.  The only
problem with this latter approach is that you'll have duplicate document
ids, which doesn't matter in running LDA, but will make it harder to tell
your final document -> topic assignments  on the final step of the process,
where you'll get an output which has SequenceFiles with <docId, p(topic |
docid) > keys and values out.


>
> Any advice on running cvb using multiple mappers would be appreciated.
> The following are some pertinent lines from my test shell script to process
> Reuters data:
>
> *******************************************
>   $MAHOUT2 seq2sparse \
>     -i ${WORK_DIR}/reuters-out-seqdir/ \
>     -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
>     -wt tf -seq -nr 3 --namedVector \
>   && \
>   $MAHOUT rowid \
>     -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
>     -o ${WORK_DIR}/sparse-vectors-cvb \
>   && \
>   $HADOOP fs -mv ${WORK_DIR}/sparse-vectors-cvb/docIndex
> ${WORK_DIR}/sparse-vectors-index-cvb \
>   && \
>   $MAHOUT cvb \
>     -i ${WORK_DIR}/sparse-vectors-cvb \
>     -o ${WORK_DIR}/reuters-cvb -k 150 -ow -x 10 \
>     -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>     -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb




-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message