mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peyman Mohajerian <mohaj...@gmail.com>
Subject Re: Latent Semantic Analysis
Date Thu, 05 Apr 2012 12:22:11 GMT
Hi Guys,
I'm now using ssvd for my LSA code and get the following error, at the time
of error all I have under 'SSVD-out' folder:
Q-job/QHat-m-00000<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070>&
R-m-00000<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070>&
_SUCCESS<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070>&
part-m-00000.deflate<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070>

I'm not clear where '/data' folder is supposed to be set, is it part of the
output of the QJob, I don't see any error in the QJob*?

*Thanks,*
*
SEVERE: java.io.FileNotFoundException: File does not exist:
hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
    at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
    at
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
    at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
    at org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
    at
org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
    at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
    at
lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
    at
lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
    at
org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
    at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
    at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
    at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)

On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> for the third time, in context of lsa, faster and hence perhaps better
> alternative to lanczos is ssvd. Is there any specific reason you want
> to use lanczos solver in context of LSA?
>
> -d
>
> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <mohajeri@gmail.com>
> wrote:
> > Hi Guys,
> >
> > Per you advice I did upgrade to Mahout .6 and did a bunch of API
> > changes and in the meantime realized I had a bug with my input matrix,
> > zero rows read from Solr b/c multiple fields in Solr were index and
> > not just the one I was interested in, that issues is fixed and I have
> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat)
> > 15932 (or the transpose)
> > Unfortunately I'm getting the below error now, in the context of some
> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
> > causing this issue but in this particular case the matrix is in
> > memory!! I'm using this google package: guava-r09.jar
> >
> > SEVERE: java.util.NoSuchElementException
> >        at
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
> >        at
> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
> >        at
> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
> >        at
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
> >        at
> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
> >
> >
> > Any suggestion?
> > Thanks,
> > Peyman
> >
> >
> >
> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> >> Peyman,
> >>
> >>
> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
> >> benefit you in some regards compared to Lanczos.
> >>
> >> -d
> >>
> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <mohajeri@gmail.com>
> wrote:
> >>> Hi Dmitriy & Others,
> >>>
> >>> Dmitriy thanks for your previous response.
> >>> I have a follow up question to my LSA project. I have managed to
> >>> upload 1,500 documents from two different news groups (one about
> >>> graphics and one about Atheism
> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr. However my
> >>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
> >>> eigenvectors as you see in the follow up logs).
> >>> The only things I'm doing different from
> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
> >>> 'Summary' field but rather the actual 'text' field in Solr. I'm
> >>> assuming the issue is that Summary field already removes the noise and
> >>> make the clustering work and the raw index data does not do that, am I
> >>> correct or there are other potential explanations? For the desired
> >>> rank I'm using values between 10-100 and looking for #clusters between
> >>> 2-10 (different values for different trials), but always the same
> >>> result comes out, no clusters found.
> >>> If my issue is related to not having summarization done, how can that
> >>> be done in Solr? I wasn't able to fine a Summary field in Solr.
> >>>
> >>> Thanks
> >>> Peyman
> >>>
> >>>
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal
> >>> auxiliary matrix.
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: LanczosSolver finished.
> >>>
> >>>
> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> >>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse and
> ssvd
> >>>> commands. Nuances are understanding dictionary format and llr
> anaylysis of
> >>>> n-grams and perhaps use a slightly better lemmatizer than the default
> one.
> >>>>
> >>>> With indexing part you are on your own at this point.
> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mohajeri@gmail.com>
> wrote:
> >>>>
> >>>>> Hi Guys,
> >>>>>
> >>>>> I'm interested in this work:
> >>>>>
> >>>>>
> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
> >>>>>
> >>>>> I looked at some of the comments and notices that there was interest
> >>>>> in incorporating it into Mahout, back in 2010. I'm also having issues
> >>>>> running this code due to dependencies on older version of Mahout.
> >>>>>
> >>>>> I was wondering if LSA is now directly available in Mahout? Also
if I
> >>>>> upgrade to the latest Mahout would this Clojure code work?
> >>>>>
> >>>>> Thanks
> >>>>> Peyman
> >>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message