mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?
Date Thu, 21 Feb 2013 17:05:21 GMT
This looks like you've got an old version of Mahout - are you running on
trunk?  This has been fixed on trunk, there was a bug in the 0.6 (roughly)
timeframe in which vectors for vectordump --sort were assumed incorrectly
to be of size MAX_INT, which lead to heap problems no matter how much heap
you gave it.   Well, maybe you could have worked around it with 2^32 * (4 +
8) bytes ~ 48GB, but really the solution is to upgrade to run off of trunk.


On Wed, Feb 20, 2013 at 8:47 PM, 万代豊 <20525entradero@gmail.com> wrote:

> My trial as below. However still doesn't get through...
>
> Increased MAHOUT_HEAPSIZE as below and also deleted out the comment mark
> from mahout shell script so that I can check it's actually taking effect.
> Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)
>
> ~bin/mahout~
> JAVA=$JAVA_HOME/bin/java
> JAVA_HEAP_MAX=-Xmx4g      * <- Increased from the original 3g to 4g*
> # check envvars which might override default args
> if [ "$MAHOUT_HEAPSIZE" != "" ]; then
>   echo "run with heapsize $MAHOUT_HEAPSIZE"
>   JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
>   echo $JAVA_HEAP_MAX
> fi
>
> Also added the same heap size as 4G in hadoop-env.sh as
>
> ~hadoop-env.sh~
> # The maximum amount of heap to use, in MB. Default is 1000.
> export HADOOP_HEAPSIZE=4000
>
> [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
> [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
> NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
> --vectorSize 5 --printKey TRUE --sortVectors TRUE
> run with heapsize 4000    * <- Looks like RunJar is taking 4G heap?*
> -Xmx4000m                       *<- Right?*
> Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> --dictionaryType=[sequencefile], --endPhase=[2147483647],
> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>  at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
>  at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> [hadoop@localhost NHTSA]$
> I've also monitored that at least all the Hadoop tasks are taking 4GB of
> heap through VisualVM utility.
>
> I have done ClusterDump to extract the top 10 terms from the result of
> K-Means as below using the exactly same input data sets as below, however,
> this tasks requires no extra heap other that the default.
>
> $ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
> NHTSA-vectors01/dictionary.file-* -i
> NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
> -b 30-n 10
>
> I believe the vectordump utility and the clusterdump derive from different
> roots in terms of it's heap requirement.
>
> Still waiting for some advise from you people.
> Regards,,,
> Y.Mandai
> 2013/2/19 万代豊 <20525entradero@gmail.com>
>
> >
> > Well , the --sortVectors for the vectordump utility to evaluate the
> result
> > for CVB clistering unfortunately brought me OutofMemory issue...
> >
> > Here is the case that seem to goes well without --sortVectors option.
> > $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> > NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> > --printKey TRUE
> > ...
> > WHILE FOR:1.3623429635926918E-6,WHILE FRONT:1.6746456292420305E-11,WHILE
> > FUELING:1.9818992669733008E-11,WHILE
> FUELING,:1.0646022811429909E-11,WHILE
> > GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
> > HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
> > I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
> > IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
> > IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
> > IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
> > INSPECTING:3.854370531928256E-
> > ...
> >
> > Once you give --sortVectors TRUE as below.  I ran into OutofMemory
> > exception.
> > $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> > NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> > --printKey TRUE *--sortVectors TRUE*
> > Running on hadoop, using /usr/local/hadoop/bin/hadoop and
> HADOOP_CONF_DIR=
> > MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> > 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
> > {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> > --dictionaryType=[sequencefile], --endPhase=[2147483647],
> > --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> > --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> > 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
> > *Exception in thread "main" java.lang.OutOfMemoryError: Java heap space*
> >  at
> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
> >  at
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
> >  at
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
> >  at
> >
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
> >  at
> >
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
> >  at
> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
> >  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >  at
> > org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >  at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >  at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >  at java.lang.reflect.Method.invoke(Method.java:597)
> >  at
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >  at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >  at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >  at java.lang.reflect.Method.invoke(Method.java:597)
> >  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > I see that there are several parameters  that are sensitive to giving
> heap
> > to Mahout job either dependently/independent across Hadoop and Mahout
> such
> > as
> > MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.
> >
> > Can anyone advise me which configuration file, shell scripts, XMLs that I
> > should give some addiotnal heap and also the proper way to monitor the
> > actual heap usage here?
> >
> > I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
> > pseudo-distributed configuration on a VMWare Player partition running
> > CentOS6.3 64Bit.
> >
> > Regards,,,
> > Y.Mandai
> > 2013/2/1 Jake Mannix <jake.mannix@gmail.com>
> >
> >> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <20525entradero@gmail.com
> >> >wrote:
> >>
> >> > Thank Jake for your guidance.
> >> > Good to know that I wasn't alway wrong but was just not familiar
> enough
> >> > about the vector dump usage.
> >> > I'll try this out later when I can as soon as possible.
> >> > Hope that --sort doesn't eat up too much heap.
> >> >
> >>
> >> If you're using code on master, --sort should only be using an
> additional
> >> K
> >> objects of memory (where K is the value you passed to --vectorSize), as
> >> it's just using an auxiliary heap to grab the top k items of the vector.
> >>  It was a bug previously that it tried to instantiate a vector.size()
> >> [which in some cases was Integer.MAX_INT] sized list somewhere.
> >>
> >>
> >> >
> >> > Regards,,,
> >> > Yutaka
> >> >
> >> > iPhoneから送信
> >> >
> >> > On 2013/01/31, at 23:33, Jake Mannix <jake.mannix@gmail.com> wrote:
> >> >
> >> > > Hi Yutaka,
> >> > >
> >> > >
> >> > > On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20525entradero@gmail.com>
> >> wrote:
> >> > >
> >> > >> Hi
> >> > >> Here is a question around how to evaluate the result of Mahout
0.7
> >> CVB
> >> > >> (Collapsed Variational Bayes), which used to be LDA
> >> > >> (Latent Dirichlet Allocation) in Mahout version under 0.5.
> >> > >> I believe I have no prpblem running CVB itself and this is purely
a
> >> > >> question on the efficient way to visualize or evaluate the result.
> >> > >
> >> > > Looks like result evaluation in Mahout-0.5 at least could be done
> >> using
> >> > the
> >> > >> utility called "LDAPrintTopic", however this is already
> >> > >> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA)
> >> > >>
> >> > >> I'm using , as said using Mahout-0.7. I believe I'm running CVB
> >> > >> successfully and obtained results in two separate directory in
> >> > >> /user/hadoop/temp/topicModelState/model-1 through model-20 as
> >> specified
> >> > as
> >> > >> number of iterations and also in
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009
as
> >> > >> specified as number of topics tha I wanted to extract/decomposite.
> >> > >>
> >> > >> Neither of the files contained in the directory can be dumped
using
> >> > Mahout
> >> > >> vectordump, however the output format is way different
> >> > >> from what you should've gotten using LDAPrintTopic in below 0.5
> which
> >> > >> should give you back the result as the Topic Id. and it's
> >> > >> associated top terms in very direct format. (See "Mahout in Action"
> >> > p.181
> >> > >> again).
> >> > >>
> >> > >
> >> > > Vectordump should be exactly what you want, actually.
> >> > >
> >> > >
> >> > >>
> >> > >> Here is what I've done as below.
> >> > >> 1. Say I have already generated document vector and use tf-vectors
> to
> >> > >> generate a document/term matrix as
> >> > >>
> >> > >> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
> >> > >> NHTSA-matrix03
> >> > >>
> >> > >> 2. and get rid of the matrix docIndex as it should get in my way
> (as
> >> > been
> >> > >> advised somewhere…)
> >> > >> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
> >> > >> NHTSA-matrix03-docIndex
> >> > >>
> >> > >> 3. confirmed if I have only what I need here as
> >> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
> >> > >> Found 1 items
> >> > >> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
> >> > >> /user/hadoop/NHTSA-matrix03/matrix
> >> > >>
> >> > >> 4.and kick off CVB as
> >> > >> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse
> >> -dict
> >> > >> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
> >> > >> …
> >> > >> ….
> >> > >> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688
> ms
> >> > >> (Minutes: 733.1281333333334)
> >> > >> (Took over 12hrs to complete to process 100k documents on my laptop
> >> with
> >> > >> pseudo-distributed Hadoop 0.20.203)
> >> > >>
> >> > >> 5. Take a look at what I've got.
> >> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
> >> > >> Found 12 items
> >> > >> -rw-r--r--   1 hadoop supergroup          0 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> >> > >> /user/hadoop/NHTSA-LDA-sparse/_logs
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00001
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00002
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00003
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00004
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00005
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00006
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00007
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00008
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00009
> >> > >> [hadoop@localhost NHTSA]$
> >> > >>
> >> > >
> >> > > Ok, these should be your model files, and to view them, you
> >> > > can do it the way you can view any
> >> > > SequenceFile<IntWriteable, VectorWritable>, like this:
> >> > >
> >> > > $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
> >> > > -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt
> >> > --dictionaryType
> >> > > sequencefile
> >> > > --vectorSize 5 --sort
> >> > >
> >> > > This will dump the top 5 terms (with weights - not sure if they'll
> be
> >> > > normalized properly) from each topic to the output file
> >> "topic_dump.txt"
> >> > >
> >> > > Incidentally, this same command can be run on the topicModelState
> >> > > directories as well, which let you see how fast your topic model was
> >> > > converging (and thus show you on a smaller data set how many
> >> iterations
> >> > you
> >> > > may want to be running with later on).
> >> > >
> >> > >
> >> > >>
> >> > >> and
> >> > >> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
> >> > >> Found 20 items
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 07:59
> >> > >> /user/hadoop/temp/topicModelState/model-1
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 13:32
> >> > >> /user/hadoop/temp/topicModelState/model-10
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:09
> >> > >> /user/hadoop/temp/topicModelState/model-11
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:46
> >> > >> /user/hadoop/temp/topicModelState/model-12
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:23
> >> > >> /user/hadoop/temp/topicModelState/model-13
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:59
> >> > >> /user/hadoop/temp/topicModelState/model-14
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 16:36
> >> > >> /user/hadoop/temp/topicModelState/model-15
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:13
> >> > >> /user/hadoop/temp/topicModelState/model-16
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:48
> >> > >> /user/hadoop/temp/topicModelState/model-17
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:25
> >> > >> /user/hadoop/temp/topicModelState/model-18
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:59
> >> > >> /user/hadoop/temp/topicModelState/model-19
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 08:37
> >> > >> /user/hadoop/temp/topicModelState/model-2
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> >> > >> /user/hadoop/temp/topicModelState/model-20
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:13
> >> > >> /user/hadoop/temp/topicModelState/model-3
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:50
> >> > >> /user/hadoop/temp/topicModelState/model-4
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 10:27
> >> > >> /user/hadoop/temp/topicModelState/model-5
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:04
> >> > >> /user/hadoop/temp/topicModelState/model-6
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:41
> >> > >> /user/hadoop/temp/topicModelState/model-7
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:18
> >> > >> /user/hadoop/temp/topicModelState/model-8
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:55
> >> > >> /user/hadoop/temp/topicModelState/model-9
> >> > >>
> >> > >> Hope someone could help this out.
> >> > >> Regards,,,
> >> > >> Yutaka
> >> > >>
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > >  -jake
> >> >
> >>
> >>
> >>
> >> --
> >>
> >>   -jake
> >>
> >
> >
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message