mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()
Date Thu, 09 Jul 2015 22:20:32 GMT

0.10.x branch is for spark 1.2.x and master (0.11.0-snapshot) is for spark
my undersanding 0.11.0 should mostly work with exception for Spark shell,
which is disabled on the HEAD. we are still woking on PR to re-enable it again.

numNonZeroElementsPerRow is in RLikeDrmOps.

Operations is a Scala pattern (not sure of its name -- operation
decorator or something?)

On Thu, Jul 9, 2015 at 7:25 AM, Hegner, Travis <>

> Hello list,
> I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS()
> job to run. First some info on my environment:
> I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn
> setup it's pretty much an OOTB setup, but it has been upgraded many times
> since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1
> commits merged in from what I've read about cloudera's versioning). I have
> my own fork of mahout which is currently just a mirror of ''.
> I'm very comfortable making changes, compiling, and using my version of the
> library should your suggestions lead me in that direction. I am still
> pretty new to scala, so I have a hard time wrapping my head around what
> some of the syntactic sugars actually do, but I'm getting there.
> I'm successfully getting my data transformed to an RDD that essentially
> looks like (<document_id>, <tag>), creating an IndexedDataSet with that,
> and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able
> to narrow the issue down to a specific case:
> Let's say I have the following records (among others) in my RDD:
> ...
> (doc1, tag1)
> (doc2, tag1)
> ...
> doc1, and doc2 have no other tags, but tag1 may exist on many other
> documents. The rest of my dataset has many other doc/tag combinations, but
> I've narrowed down the issue to seemingly only occur in this case. I've
> been able to trace down that the java.lang.IllegalArgumentException is
> occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and
> "numInteractionsWithAandB = 1") when calling
> LogLikelihood.logLikelihoodRatio() from
> SimilarityAnalysis.logLikelihoodRatio().
> Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the
> line (163 in my branch):
> val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)
> IDE (intellij) complains that it cannot resolve
> "drmA.numNonZeroElementsPerRow", however the library compiles successfully.
> Tracing the codepath shows that if that value is not being correctly
> populated, it would have a direct impact on the values used in
> logLikelihoodRatio(). That said, it seems to only fail in this very
> particular case.
> I should note that I can run SimilarityAnalysis.cooccurrencesIDSs()
> successfully with a single list of (<user_id>, <item_id>) pairs of my own
> data.
> I have 3 questions given this scenario:
> First, am I using the proper branch of code for attempting to run on a
> spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this
> was the only branch I could find for it.
> Second, Is anyone able to shed some light on the above error? Is drmA not
> a correct type, or does that method no longer apply to that type?
> Third, what would be the mathematical implications if I run
> SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>)
> pairs. Would the results be sound, or does that make absolutely no sense?
> Would it be beneficial even as only a troubleshooting step?
> Thanks in advance for any help you may be able to provide!
> Travis Hegner
> ________________________________
> The information contained in this communication is confidential and is
> intended only for the use of the named recipient. Unauthorized use,
> disclosure, or copying is strictly prohibited and may be unlawful. If you
> have received this communication in error, you should know that you are
> bound to confidentiality, and should please immediately notify the sender.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message