mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Latent Semantic Analysis
Date Thu, 05 Apr 2012 18:51:55 GMT
also you are printing your input path -- how does it look like in
reality? because this path that it complains about, SSVDOutput/data,
in fact should be the input path. That's what's perplexing.

We are talking hadoop job setup process here, nothing specific to the
solution itself. And job setup/directory management fails for some
reason.

On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> Any chance you could test it with its current dependency, 0.20.204? or
> that would be hard to stage?
>
> Newer hadoop version is frankly all i can think of here for the reason of this.
>
> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <mohajeri@gmail.com> wrote:
>> Hi Dmitriy,
>>
>> It is a Clojure code from: https://github.com/algoriffic/lsa4solr
>> Of course I modified it to use Mahout .6 distribution, also running on
>> hadoop-0.20.205.0, here is the Closure code that I changed,
>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still need
>> modification b/c I'm not reading the eigenValue/Vector from the solver
>> correctly.  Originally this code was based on Mahout .4. I'm creating the
>> Matrix from Solr 3.1.0, very similar to what was done on: '
>> https://github.com/algoriffic/lsa4solr'
>>
>> Thanks,
>>
>> (defn decompose-svd
>>  [mat k]
>>  ;(println "input path " (.getRowPath mat))
>>  ;(println "dd " (into-array [(.getRowPath mat)]))
>>  ;(println "numCol " (.numCols mat))
>>  ;(println "numrow " (.numRows mat))
>>  (let [eigenvalues (new java.util.ArrayList)
>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
>>    numCol (.numCols mat)
>>        config (.getConf mat)
>>    rawPath (.getRowPath mat)
>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
>>    inputPath (into-array [rawPath])
>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60 3)
>>    decomposer (doto (.run ssvdSolver))
>>    V (normalize-matrix-columns (.viewPart (.transpose eigenvectors)
>>                           (int-array [0 0])
>>                           (int-array [(.numCols mat) k])))
>>    U (mmult mat V)
>>    S (diag (take k (reverse eigenvalues)))]
>>    {:U U
>>     :S S
>>     :V V}))
>>
>>
>>
>>
>>
>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>
>>> Yeah. i don't see how it may have arrived at that error.
>>>
>>>
>>> Peyman,
>>>
>>> I need to know more -- it looks like you are using embedded api, not a
>>> command line, so i need to see how you you initialize the solver and
>>> also which version of Mahout libraries you are using (your stack trace
>>> numbers do not correspond to anything reasonable on current trunk).
>>>
>>> thanks.
>>>
>>> -d
>>>
>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>> wrote:
>>> > Hm. i never saw that and not sure where this folder comes from. Which
>>> > hadoop version are you using? This may be a result of incompatible
>>> > support for multiple outputs in the newer hadoop versions . I tested
>>> > it with CDH3u0/u3 and it was fine. This folder should normally appear
>>> > in the conversation, i suspect it is an internal hadoop thing.
>>> >
>>> > This is without me actually looking at the code per stack trace.
>>> >
>>> >
>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <mohajeri@gmail.com>
>>> wrote:
>>> >> Hi Guys,
>>> >> I'm now using ssvd for my LSA code and get the following error, at the
>>> time
>>> >> of error all I have under 'SSVD-out' folder:
>>> >> Q-job/QHat-m-00000<
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
>>> >&
>>> >> R-m-00000<
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
>>> >&
>>> >> _SUCCESS<
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
>>> >&
>>> >> part-m-00000.deflate<
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
>>> >
>>> >>
>>> >> I'm not clear where '/data' folder is supposed to be set, is it part
of
>>> the
>>> >> output of the QJob, I don't see any error in the QJob*?
>>> >>
>>> >> *Thanks,*
>>> >> *
>>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
>>> >>
>>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>>> >>    at
>>> >>
>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>>> >>    at
>>> >>
>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>> >>    at
>>> >>
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>> >>    at
>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>>> >>    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>>> >>    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>>> >>    at java.security.AccessController.doPrivileged(Native Method)
>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
>>> >>    at
>>> >>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>> >>    at
>>> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>> >>    at
>>> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>>> >>    at
>>> >>
>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>>> >>    at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>>> >>    at
>>> >>
>>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>>> >>    at
>>> >>
>>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>>> >>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
>>> >>    at
>>> >>
>>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>>> >>    at
>>> >>
>>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>>> >>    at
>>> >>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>> >>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>>> >>    at
>>> >>
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>>> >>    at
>>> >>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>>> >>    at
>>> >>
>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>> >>    at
>>> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>> >>
>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>> wrote:
>>> >>
>>> >>> for the third time, in context of lsa, faster and hence perhaps
better
>>> >>> alternative to lanczos is ssvd. Is there any specific reason you
want
>>> >>> to use lanczos solver in context of LSA?
>>> >>>
>>> >>> -d
>>> >>>
>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <mohajeri@gmail.com
>>> >
>>> >>> wrote:
>>> >>> > Hi Guys,
>>> >>> >
>>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch of
API
>>> >>> > changes and in the meantime realized I had a bug with my input
>>> matrix,
>>> >>> > zero rows read from Solr b/c multiple fields in Solr were index
and
>>> >>> > not just the one I was interested in, that issues is fixed
and I have
>>> >>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows
mat)
>>> >>> > 15932 (or the transpose)
>>> >>> > Unfortunately I'm getting the below error now, in the context
of some
>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
>>> >>> > causing this issue but in this particular case the matrix is
in
>>> >>> > memory!! I'm using this google package: guava-r09.jar
>>> >>> >
>>> >>> > SEVERE: java.util.NoSuchElementException
>>> >>> >        at
>>> >>>
>>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>>> >>> >        at
>>> >>>
>>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>>> >>> >        at
>>> >>>
>>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>>> >>> >        at
>>> >>>
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>>> >>> >        at
>>> >>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>>> >>> >
>>> >>> >
>>> >>> > Any suggestion?
>>> >>> > Thanks,
>>> >>> > Peyman
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com>
>>> >>> wrote:
>>> >>> >> Peyman,
>>> >>> >>
>>> >>> >>
>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd,
it may
>>> >>> >> benefit you in some regards compared to Lanczos.
>>> >>> >>
>>> >>> >> -d
>>> >>> >>
>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
>>> mohajeri@gmail.com>
>>> >>> wrote:
>>> >>> >>> Hi Dmitriy & Others,
>>> >>> >>>
>>> >>> >>> Dmitriy thanks for your previous response.
>>> >>> >>> I have a follow up question to my LSA project. I have
managed to
>>> >>> >>> upload 1,500 documents from two different news groups
(one about
>>> >>> >>> graphics and one about Atheism
>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/)
to Solr.
>>> However my
>>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues
(there are
>>> >>> >>> eigenvectors as you see in the follow up logs).
>>> >>> >>> The only things I'm doing different from
>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm
not using the
>>> >>> >>> 'Summary' field but rather the actual 'text' field
in Solr. I'm
>>> >>> >>> assuming the issue is that Summary field already removes
the noise
>>> and
>>> >>> >>> make the clustering work and the raw index data does
not do that,
>>> am I
>>> >>> >>> correct or there are other potential explanations?
For the desired
>>> >>> >>> rank I'm using values between 10-100 and looking for
#clusters
>>> between
>>> >>> >>> 2-10 (different values for different trials), but always
the same
>>> >>> >>> result comes out, no clusters found.
>>> >>> >>> If my issue is related to not having summarization
done, how can
>>> that
>>> >>> >>> be done in Solr? I wasn't able to fine a Summary field
in Solr.
>>> >>> >>>
>>> >>> >>> Thanks
>>> >>> >>> Peyman
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize
the
>>> tri-diagonal
>>> >>> >>> auxiliary matrix.
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
solve
>>> >>> >>> INFO: LanczosSolver finished.
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com>
>>> >>> wrote:
>>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory,
seq2sparse
>>> and
>>> >>> ssvd
>>> >>> >>>> commands. Nuances are understanding dictionary
format and llr
>>> >>> anaylysis of
>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer
than the
>>> default
>>> >>> one.
>>> >>> >>>>
>>> >>> >>>> With indexing part you are on your own at this
point.
>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mohajeri@gmail.com>
>>> >>> wrote:
>>> >>> >>>>
>>> >>> >>>>> Hi Guys,
>>> >>> >>>>>
>>> >>> >>>>> I'm interested in this work:
>>> >>> >>>>>
>>> >>> >>>>>
>>> >>>
>>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>>> >>> >>>>>
>>> >>> >>>>> I looked at some of the comments and notices
that there was
>>> interest
>>> >>> >>>>> in incorporating it into Mahout, back in 2010.
I'm also having
>>> issues
>>> >>> >>>>> running this code due to dependencies on older
version of Mahout.
>>> >>> >>>>>
>>> >>> >>>>> I was wondering if LSA is now directly available
in Mahout? Also
>>> if I
>>> >>> >>>>> upgrade to the latest Mahout would this Clojure
code work?
>>> >>> >>>>>
>>> >>> >>>>> Thanks
>>> >>> >>>>> Peyman
>>> >>> >>>>>
>>> >>>
>>>

Mime
View raw message