mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yutaka Mandai <20525entrad...@gmail.com>
Subject Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?
Date Sat, 23 Feb 2013 01:53:35 GMT
Jake
Now this is very clear and I will work on this build from the latest source.
Thank you.
Regards,,,
Y.Mandai


iPhoneから送信

On 2013/02/23, at 3:14, Jake Mannix <jake.mannix@gmail.com> wrote:

> On Fri, Feb 22, 2013 at 2:26 AM, 万代豊 <20525entradero@gmail.com> wrote:
> 
>> Thanks Jake for your attention on this.
>> I believe I have the trunk code from the official download site.
>> Well my Mahout version is 0.7 and I have downloaded from local mirror site
>> as
>> http://ftp.jaist.ac.jp/pub/apache/mahout/0.7/  and confirmed that the
>> timestamp on ther mirror
>> site as 12-Jun-2012 and the time stamp for my installed files are all
>> identical.
>> Note that I'm using the precompiled Jar files only and have not built on my
>> machine from source code locally.
>> I believe this will not affect negatively.
>> 
>> Mahout-0.7 is my first and only experienced version. Never have tried older
>> ones nor newer 0.8 snapshot either...
>> 
>> Can you think of any other possible workaround?
>> 
> 
> You should try to build from trunk source, this bug is fixed in trunk,
> that's the
> correct workaround.  That, or wait for our next officially released version
> (0.8).
> 
> 
>> 
>> Also, Am I doing Ok with giving heap size for both Hadoop and Mahout for
>> this case?
>> I could confirm the heap assignment for the Hadoop jobs since they are
>> resident processes while
>> Mahout RunJob immediately dies before the VisualVM utility can recognozes
>> it, so I'm not confident if
>> RunJob really got how much he really wanted or not...
>> 
> 
> Heap is not going to help you here, you're dealing with a bug.  The correct
> code doesn't need really very much memory at all (less than 100MB to do
> the job you're talking about).
> 
> 
>> 
>> Regards,,,
>> Y.Mandai
>> 
>> 
>> 
>> 2013/2/22 Jake Mannix <jake.mannix@gmail.com>
>> 
>>> This looks like you've got an old version of Mahout - are you running on
>>> trunk?  This has been fixed on trunk, there was a bug in the 0.6
>> (roughly)
>>> timeframe in which vectors for vectordump --sort were assumed incorrectly
>>> to be of size MAX_INT, which lead to heap problems no matter how much
>> heap
>>> you gave it.   Well, maybe you could have worked around it with 2^32 *
>> (4 +
>>> 8) bytes ~ 48GB, but really the solution is to upgrade to run off of
>> trunk.
>>> 
>>> 
>>> On Wed, Feb 20, 2013 at 8:47 PM, 万代豊 <20525entradero@gmail.com> wrote:
>>> 
>>>> My trial as below. However still doesn't get through...
>>>> 
>>>> Increased MAHOUT_HEAPSIZE as below and also deleted out the comment
>> mark
>>>> from mahout shell script so that I can check it's actually taking
>> effect.
>>>> Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)
>>>> 
>>>> ~bin/mahout~
>>>> JAVA=$JAVA_HOME/bin/java
>>>> JAVA_HEAP_MAX=-Xmx4g      * <- Increased from the original 3g to 4g*
>>>> # check envvars which might override default args
>>>> if [ "$MAHOUT_HEAPSIZE" != "" ]; then
>>>>  echo "run with heapsize $MAHOUT_HEAPSIZE"
>>>>  JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
>>>>  echo $JAVA_HEAP_MAX
>>>> fi
>>>> 
>>>> Also added the same heap size as 4G in hadoop-env.sh as
>>>> 
>>>> ~hadoop-env.sh~
>>>> # The maximum amount of heap to use, in MB. Default is 1000.
>>>> export HADOOP_HEAPSIZE=4000
>>>> 
>>>> [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
>>>> [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
>>>> NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
>>>> --vectorSize 5 --printKey TRUE --sortVectors TRUE
>>>> run with heapsize 4000    * <- Looks like RunJar is taking 4G heap?*
>>>> -Xmx4000m                       *<- Right?*
>>>> Running on hadoop, using /usr/local/hadoop/bin/hadoop and
>>> HADOOP_CONF_DIR=
>>>> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
>>>> 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
>>>> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
>>>> --dictionaryType=[sequencefile], --endPhase=[2147483647],
>>>> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
>>>> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
>>>> 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>>> at
>>> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
>>>> at
>>> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>> at
>>>> 
>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> 
>>>> 
>>> 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> at
>>>> 
>>>> 
>>> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> 
>>>> 
>>> 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> at
>>>> 
>>>> 
>>> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>> [hadoop@localhost NHTSA]$
>>>> I've also monitored that at least all the Hadoop tasks are taking 4GB
>> of
>>>> heap through VisualVM utility.
>>>> 
>>>> I have done ClusterDump to extract the top 10 terms from the result of
>>>> K-Means as below using the exactly same input data sets as below,
>>> however,
>>>> this tasks requires no extra heap other that the default.
>>>> 
>>>> $ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
>>>> NHTSA-vectors01/dictionary.file-* -i
>>>> NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
>>>> -b 30-n 10
>>>> 
>>>> I believe the vectordump utility and the clusterdump derive from
>>> different
>>>> roots in terms of it's heap requirement.
>>>> 
>>>> Still waiting for some advise from you people.
>>>> Regards,,,
>>>> Y.Mandai
>>>> 2013/2/19 万代豊 <20525entradero@gmail.com>
>>>> 
>>>>> 
>>>>> Well , the --sortVectors for the vectordump utility to evaluate the
>>>> result
>>>>> for CVB clistering unfortunately brought me OutofMemory issue...
>>>>> 
>>>>> Here is the case that seem to goes well without --sortVectors option.
>>>>> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
>>>>> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
>>>>> --printKey TRUE
>>>>> ...
>>>>> WHILE FOR:1.3623429635926918E-6,WHILE
>>> FRONT:1.6746456292420305E-11,WHILE
>>>>> FUELING:1.9818992669733008E-11,WHILE
>>>> FUELING,:1.0646022811429909E-11,WHILE
>>>>> GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
>>>>> HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
>>>>> I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
>>>>> IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
>>>>> IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
>>>>> IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
>>>>> INSPECTING:3.854370531928256E-
>>>>> ...
>>>>> 
>>>>> Once you give --sortVectors TRUE as below.  I ran into OutofMemory
>>>>> exception.
>>>>> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
>>>>> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
>>>>> --printKey TRUE *--sortVectors TRUE*
>>>>> Running on hadoop, using /usr/local/hadoop/bin/hadoop and
>>>> HADOOP_CONF_DIR=
>>>>> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
>>>>> 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
>>>>> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
>>>>> --dictionaryType=[sequencefile], --endPhase=[2147483647],
>>>>> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
>>>>> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
>>>>> 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
>>>>> *Exception in thread "main" java.lang.OutOfMemoryError: Java heap
>>> space*
>>>>> at
>>>> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
>>>>> at
>>>> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>> at
>>>>> 
>>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at
>>>>> 
>>>> 
>>> 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>> at
>>>>> 
>>>> 
>>> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>> at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at
>>>>> 
>>>> 
>>> 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>> at
>>>>> 
>>>> 
>>> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>> I see that there are several parameters  that are sensitive to giving
>>>> heap
>>>>> to Mahout job either dependently/independent across Hadoop and Mahout
>>>> such
>>>>> as
>>>>> MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.
>>>>> 
>>>>> Can anyone advise me which configuration file, shell scripts, XMLs
>>> that I
>>>>> should give some addiotnal heap and also the proper way to monitor
>> the
>>>>> actual heap usage here?
>>>>> 
>>>>> I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
>>>>> pseudo-distributed configuration on a VMWare Player partition running
>>>>> CentOS6.3 64Bit.
>>>>> 
>>>>> Regards,,,
>>>>> Y.Mandai
>>>>> 2013/2/1 Jake Mannix <jake.mannix@gmail.com>
>>>>> 
>>>>>> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <
>>> 20525entradero@gmail.com
>>>>>>> wrote:
>>>>>> 
>>>>>>> Thank Jake for your guidance.
>>>>>>> Good to know that I wasn't alway wrong but was just not familiar
>>>> enough
>>>>>>> about the vector dump usage.
>>>>>>> I'll try this out later when I can as soon as possible.
>>>>>>> Hope that --sort doesn't eat up too much heap.
>>>>>>> 
>>>>>> 
>>>>>> If you're using code on master, --sort should only be using an
>>>> additional
>>>>>> K
>>>>>> objects of memory (where K is the value you passed to --vectorSize),
>>> as
>>>>>> it's just using an auxiliary heap to grab the top k items of the
>>> vector.
>>>>>> It was a bug previously that it tried to instantiate a
>> vector.size()
>>>>>> [which in some cases was Integer.MAX_INT] sized list somewhere.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Regards,,,
>>>>>>> Yutaka
>>>>>>> 
>>>>>>> iPhoneから送信
>>>>>>> 
>>>>>>> On 2013/01/31, at 23:33, Jake Mannix <jake.mannix@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> Hi Yutaka,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20525entradero@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi
>>>>>>>>> Here is a question around how to evaluate the result
of Mahout
>>> 0.7
>>>>>> CVB
>>>>>>>>> (Collapsed Variational Bayes), which used to be LDA
>>>>>>>>> (Latent Dirichlet Allocation) in Mahout version under
0.5.
>>>>>>>>> I believe I have no prpblem running CVB itself and this
is
>>> purely a
>>>>>>>>> question on the efficient way to visualize or evaluate
the
>>> result.
>>>>>>>> 
>>>>>>>> Looks like result evaluation in Mahout-0.5 at least could
be
>> done
>>>>>> using
>>>>>>> the
>>>>>>>>> utility called "LDAPrintTopic", however this is already
>>>>>>>>> obsolete since Mahout 0.5. (See "Mahout in Action" p.181
on
>> LDA)
>>>>>>>>> 
>>>>>>>>> I'm using , as said using Mahout-0.7. I believe I'm running
CVB
>>>>>>>>> successfully and obtained results in two separate directory
in
>>>>>>>>> /user/hadoop/temp/topicModelState/model-1 through model-20
as
>>>>>> specified
>>>>>>> as
>>>>>>>>> number of iterations and also in
>>>>>>>>> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009
>>> as
>>>>>>>>> specified as number of topics tha I wanted to
>>> extract/decomposite.
>>>>>>>>> 
>>>>>>>>> Neither of the files contained in the directory can be
dumped
>>> using
>>>>>>> Mahout
>>>>>>>>> vectordump, however the output format is way different
>>>>>>>>> from what you should've gotten using LDAPrintTopic in
below 0.5
>>>> which
>>>>>>>>> should give you back the result as the Topic Id. and
it's
>>>>>>>>> associated top terms in very direct format. (See "Mahout
in
>>> Action"
>>>>>>> p.181
>>>>>>>>> again).
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Vectordump should be exactly what you want, actually.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Here is what I've done as below.
>>>>>>>>> 1. Say I have already generated document vector and use
>>> tf-vectors
>>>> to
>>>>>>>>> generate a document/term matrix as
>>>>>>>>> 
>>>>>>>>> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors
-o
>>>>>>>>> NHTSA-matrix03
>>>>>>>>> 
>>>>>>>>> 2. and get rid of the matrix docIndex as it should get
in my
>> way
>>>> (as
>>>>>>> been
>>>>>>>>> advised somewhere…)
>>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
>>>>>>>>> NHTSA-matrix03-docIndex
>>>>>>>>> 
>>>>>>>>> 3. confirmed if I have only what I need here as
>>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
>>>>>>>>> Found 1 items
>>>>>>>>> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20
07:11
>>>>>>>>> /user/hadoop/NHTSA-matrix03/matrix
>>>>>>>>> 
>>>>>>>>> 4.and kick off CVB as
>>>>>>>>> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o
>> NHTSA-LDA-sparse
>>>>>> -dict
>>>>>>>>> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
>>>>>>>>> …
>>>>>>>>> ….
>>>>>>>>> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took
>> 43987688
>>>> ms
>>>>>>>>> (Minutes: 733.1281333333334)
>>>>>>>>> (Took over 12hrs to complete to process 100k documents
on my
>>> laptop
>>>>>> with
>>>>>>>>> pseudo-distributed Hadoop 0.20.203)
>>>>>>>>> 
>>>>>>>>> 5. Take a look at what I've got.
>>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
>>>>>>>>> Found 12 items

Mime
View raw message