mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 万代豊 <20525entrad...@gmail.com>
Subject Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?
Date Thu, 21 Feb 2013 04:47:47 GMT
My trial as below. However still doesn't get through...

Increased MAHOUT_HEAPSIZE as below and also deleted out the comment mark
from mahout shell script so that I can check it's actually taking effect.
Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)

~bin/mahout~
JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx4g      * <- Increased from the original 3g to 4g*
# check envvars which might override default args
if [ "$MAHOUT_HEAPSIZE" != "" ]; then
  echo "run with heapsize $MAHOUT_HEAPSIZE"
  JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
  echo $JAVA_HEAP_MAX
fi

Also added the same heap size as 4G in hadoop-env.sh as

~hadoop-env.sh~
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=4000

[hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
[hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
--vectorSize 5 --printKey TRUE --sortVectors TRUE
run with heapsize 4000    * <- Looks like RunJar is taking 4G heap?*
-Xmx4000m                       *<- Right?*
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
{--dictionary=[NHTSA-vectors01/dictionary.file-*],
--dictionaryType=[sequencefile], --endPhase=[2147483647],
--input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
--startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
 at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
 at
org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
 at
org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
 at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
[hadoop@localhost NHTSA]$
I've also monitored that at least all the Hadoop tasks are taking 4GB of
heap through VisualVM utility.

I have done ClusterDump to extract the top 10 terms from the result of
K-Means as below using the exactly same input data sets as below, however,
this tasks requires no extra heap other that the default.

$ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
NHTSA-vectors01/dictionary.file-* -i
NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
-b 30-n 10

I believe the vectordump utility and the clusterdump derive from different
roots in terms of it's heap requirement.

Still waiting for some advise from you people.
Regards,,,
Y.Mandai
2013/2/19 万代豊 <20525entradero@gmail.com>

>
> Well , the --sortVectors for the vectordump utility to evaluate the result
> for CVB clistering unfortunately brought me OutofMemory issue...
>
> Here is the case that seem to goes well without --sortVectors option.
> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> --printKey TRUE
> ...
> WHILE FOR:1.3623429635926918E-6,WHILE FRONT:1.6746456292420305E-11,WHILE
> FUELING:1.9818992669733008E-11,WHILE FUELING,:1.0646022811429909E-11,WHILE
> GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
> HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
> I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
> IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
> IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
> IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
> INSPECTING:3.854370531928256E-
> ...
>
> Once you give --sortVectors TRUE as below.  I ran into OutofMemory
> exception.
> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> --printKey TRUE *--sortVectors TRUE*
> Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> --dictionaryType=[sequencefile], --endPhase=[2147483647],
> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
> *Exception in thread "main" java.lang.OutOfMemoryError: Java heap space*
>  at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
>  at
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
>  at
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
>  at
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
>  at
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
>  at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> I see that there are several parameters  that are sensitive to giving heap
> to Mahout job either dependently/independent across Hadoop and Mahout such
> as
> MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.
>
> Can anyone advise me which configuration file, shell scripts, XMLs that I
> should give some addiotnal heap and also the proper way to monitor the
> actual heap usage here?
>
> I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
> pseudo-distributed configuration on a VMWare Player partition running
> CentOS6.3 64Bit.
>
> Regards,,,
> Y.Mandai
> 2013/2/1 Jake Mannix <jake.mannix@gmail.com>
>
>> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <20525entradero@gmail.com
>> >wrote:
>>
>> > Thank Jake for your guidance.
>> > Good to know that I wasn't alway wrong but was just not familiar enough
>> > about the vector dump usage.
>> > I'll try this out later when I can as soon as possible.
>> > Hope that --sort doesn't eat up too much heap.
>> >
>>
>> If you're using code on master, --sort should only be using an additional
>> K
>> objects of memory (where K is the value you passed to --vectorSize), as
>> it's just using an auxiliary heap to grab the top k items of the vector.
>>  It was a bug previously that it tried to instantiate a vector.size()
>> [which in some cases was Integer.MAX_INT] sized list somewhere.
>>
>>
>> >
>> > Regards,,,
>> > Yutaka
>> >
>> > iPhoneから送信
>> >
>> > On 2013/01/31, at 23:33, Jake Mannix <jake.mannix@gmail.com> wrote:
>> >
>> > > Hi Yutaka,
>> > >
>> > >
>> > > On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20525entradero@gmail.com>
>> wrote:
>> > >
>> > >> Hi
>> > >> Here is a question around how to evaluate the result of Mahout 0.7
>> CVB
>> > >> (Collapsed Variational Bayes), which used to be LDA
>> > >> (Latent Dirichlet Allocation) in Mahout version under 0.5.
>> > >> I believe I have no prpblem running CVB itself and this is purely a
>> > >> question on the efficient way to visualize or evaluate the result.
>> > >
>> > > Looks like result evaluation in Mahout-0.5 at least could be done
>> using
>> > the
>> > >> utility called "LDAPrintTopic", however this is already
>> > >> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA)
>> > >>
>> > >> I'm using , as said using Mahout-0.7. I believe I'm running CVB
>> > >> successfully and obtained results in two separate directory in
>> > >> /user/hadoop/temp/topicModelState/model-1 through model-20 as
>> specified
>> > as
>> > >> number of iterations and also in
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009 as
>> > >> specified as number of topics tha I wanted to extract/decomposite.
>> > >>
>> > >> Neither of the files contained in the directory can be dumped using
>> > Mahout
>> > >> vectordump, however the output format is way different
>> > >> from what you should've gotten using LDAPrintTopic in below 0.5 which
>> > >> should give you back the result as the Topic Id. and it's
>> > >> associated top terms in very direct format. (See "Mahout in Action"
>> > p.181
>> > >> again).
>> > >>
>> > >
>> > > Vectordump should be exactly what you want, actually.
>> > >
>> > >
>> > >>
>> > >> Here is what I've done as below.
>> > >> 1. Say I have already generated document vector and use tf-vectors
to
>> > >> generate a document/term matrix as
>> > >>
>> > >> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
>> > >> NHTSA-matrix03
>> > >>
>> > >> 2. and get rid of the matrix docIndex as it should get in my way (as
>> > been
>> > >> advised somewhere…)
>> > >> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
>> > >> NHTSA-matrix03-docIndex
>> > >>
>> > >> 3. confirmed if I have only what I need here as
>> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
>> > >> Found 1 items
>> > >> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
>> > >> /user/hadoop/NHTSA-matrix03/matrix
>> > >>
>> > >> 4.and kick off CVB as
>> > >> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse
>> -dict
>> > >> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
>> > >> …
>> > >> ….
>> > >> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms
>> > >> (Minutes: 733.1281333333334)
>> > >> (Took over 12hrs to complete to process 100k documents on my laptop
>> with
>> > >> pseudo-distributed Hadoop 0.20.203)
>> > >>
>> > >> 5. Take a look at what I've got.
>> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
>> > >> Found 12 items
>> > >> -rw-r--r--   1 hadoop supergroup          0 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
>> > >> /user/hadoop/NHTSA-LDA-sparse/_logs
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00001
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00002
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00003
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00004
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00005
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00006
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00007
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00008
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00009
>> > >> [hadoop@localhost NHTSA]$
>> > >>
>> > >
>> > > Ok, these should be your model files, and to view them, you
>> > > can do it the way you can view any
>> > > SequenceFile<IntWriteable, VectorWritable>, like this:
>> > >
>> > > $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
>> > > -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt
>> > --dictionaryType
>> > > sequencefile
>> > > --vectorSize 5 --sort
>> > >
>> > > This will dump the top 5 terms (with weights - not sure if they'll be
>> > > normalized properly) from each topic to the output file
>> "topic_dump.txt"
>> > >
>> > > Incidentally, this same command can be run on the topicModelState
>> > > directories as well, which let you see how fast your topic model was
>> > > converging (and thus show you on a smaller data set how many
>> iterations
>> > you
>> > > may want to be running with later on).
>> > >
>> > >
>> > >>
>> > >> and
>> > >> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
>> > >> Found 20 items
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 07:59
>> > >> /user/hadoop/temp/topicModelState/model-1
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 13:32
>> > >> /user/hadoop/temp/topicModelState/model-10
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:09
>> > >> /user/hadoop/temp/topicModelState/model-11
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:46
>> > >> /user/hadoop/temp/topicModelState/model-12
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:23
>> > >> /user/hadoop/temp/topicModelState/model-13
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:59
>> > >> /user/hadoop/temp/topicModelState/model-14
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 16:36
>> > >> /user/hadoop/temp/topicModelState/model-15
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:13
>> > >> /user/hadoop/temp/topicModelState/model-16
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:48
>> > >> /user/hadoop/temp/topicModelState/model-17
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:25
>> > >> /user/hadoop/temp/topicModelState/model-18
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:59
>> > >> /user/hadoop/temp/topicModelState/model-19
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 08:37
>> > >> /user/hadoop/temp/topicModelState/model-2
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
>> > >> /user/hadoop/temp/topicModelState/model-20
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:13
>> > >> /user/hadoop/temp/topicModelState/model-3
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:50
>> > >> /user/hadoop/temp/topicModelState/model-4
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 10:27
>> > >> /user/hadoop/temp/topicModelState/model-5
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:04
>> > >> /user/hadoop/temp/topicModelState/model-6
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:41
>> > >> /user/hadoop/temp/topicModelState/model-7
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:18
>> > >> /user/hadoop/temp/topicModelState/model-8
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:55
>> > >> /user/hadoop/temp/topicModelState/model-9
>> > >>
>> > >> Hope someone could help this out.
>> > >> Regards,,,
>> > >> Yutaka
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > >  -jake
>> >
>>
>>
>>
>> --
>>
>>   -jake
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message