mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()
Date Fri, 10 Jul 2015 18:40:01 GMT
The IndexDataset creates two BiDictionaries (Bi-directional dictionaries) of Int <->
String so if it can be a String the element ids have no other restrictions.

May indeed be a bug I’ll look at is asap, since it passes the scala tests, any data you
can spare might help but if you are doing a lot of prep, maybe that’s not so easy?

On Jul 10, 2015, at 11:16 AM, Hegner, Travis <THegner@trilliumit.com> wrote:

I am actually not using the CLI, I am using the API directly. Also, I am transforming the
data into an RDD of (BigDecimal, String), mapping that to (String,String) and creating an
IndexedDatasetSpark which I feed into rowSimilarityIDS(). This same process works flawlessly
when calling cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD of (<tag>,
<doc_id>).

My string tags do have some special characters, so I have been simply hashing them into an
md5 string as a precaution since it shouldn't change the final result. I will try and scan
the data for any nulls or other oddities. If I can't find anything obvious, then I'll try
to pair it down to a small enough sample that is still affected in order to share.

Are there any normalizing rules that I should be aware of? For example, all the doc_id's must
be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Friday, July 10, 2015 1:34 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don’t suppose you could share your data or at least a snippet? Some odd errors can creep
in if there is invalid data, like a null doc id or tag. Very little data validation is done,
which is something I need to address. I’ll it try on some sample data I have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by default tab
separates doc-id from the list and a space separates items in the list. Separators can be
changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <THegner@trilliumit.com> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the
stack trace at the end of the message. As I mentioned in my original message, I've narrowed
it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition
I described, as I set up a test case with a small amount of data exhibiting the same condition
described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would
think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me
know if you still have the problem and include the stack trace. I’ve been using cooccurrence,
which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs()
with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does
that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are
talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the
same results but let’s get it working. I’ll look at it either tomorrow or this weekend.
If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration
function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <THegner@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First
some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty
much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's
running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's
versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'.
I'm very comfortable making changes, compiling, and using my version of the library should
your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard
time wrapping my head around what some of the syntactic sugars actually do, but I'm getting
there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>,
<tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS().
I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of
my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly
only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException
is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB
= 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my
branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however
the library compiles successfully. Tracing the codepath shows that if that value is not being
correctly populated, it would have a direct impact on the values used in logLikelihoodRatio().
That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single
list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster?
I've read about a "joint effort" for spark 1.3, and this was the only branch I could find
for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type,
or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs()
with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does
that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the
use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited
and may be unlawful. If you have received this communication in error, you should know that
you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the
use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited
and may be unlawful. If you have received this communication in error, you should know that
you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the
use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited
and may be unlawful. If you have received this communication in error, you should know that
you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the
use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited
and may be unlawful. If you have received this communication in error, you should know that
you are bound to confidentiality, and should please immediately notify the sender.


Mime
View raw message