mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Kelly <mich...@onespot.com>
Subject Re: spark-itemsimilarity IndexException - outside allowable range
Date Fri, 03 Apr 2015 18:36:52 GMT
Yes, I updated recently, when running on a cluster I checked out the
latest master of mahout, and locally I've probably updated in the last
week or so.

On Fri, Apr 3, 2015 at 7:04 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
> OK, got it to reproduce. This is not what I expected. Its too many columns in a vector-hmm.
Found the other user’s issue which was null input, not a bug.
>
> BTW when did you update Mahout? Just put in the ability to point to dirs so I assume
recently?
>
>
> On Apr 3, 2015, at 9:08 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
>
> Yeah, that’s exactly what the other user is doing. This should be a common architecture
in the future. I’m already looking at the other so will add this too. Thanks a bunch for
the data.
>
>
> On Apr 3, 2015, at 8:58 AM, Michael Kelly <michael@onespot.com> wrote:
>
> Yes, we are using a spark streaming job to create the input, and I
> wasn't repartitioning it, so there were a lot of parts. I'm testing it
> out now with repartitioning to see if that works.
> This is just a single interaction type.
>
> Thanks again,
>
> Michael
>
> On Fri, Apr 3, 2015 at 4:52 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
>> This sounds like a bug. Thanks for the sample input and narrowing it down. I’ll
look at it today.
>>
>> I got a similar question from another user with a lot of part files. A Spark streaming
job creates the part files. Is that what you are doing?
>>
>> Is this a single interaction type?
>>
>>
>> On Apr 3, 2015, at 6:23 AM, Michael Kelly <michael@onespot.com> wrote:
>>
>> Hi Pat,
>>
>> I've done some further digging and it looks like the problem is
>> occurring when the input files are split up to into parts. The input
>> to the item-similarity matrix is the output from a spark job and it
>> ends up in about 2000 parts (on the hadoop file system). I have
>> reproduced the error locally using a small subset of the rows.
>>
>> This is a snippet of the file I am using -
>>
>> ...
>>
>> 5138353282348067470,1891081885
>> 4417954190713934181,1828065687
>> 133682221673920382,1454844406
>> 133682221673920382,1129053737
>> 133682221673920382,548627241
>> 133682221673920382,1048452021
>> 8547417492653230933,1121310481
>> 7693904559640861382,1333374361
>> 7204049418352603234,606209305
>> 139299176617553863,467181330
>> ...
>>
>>
>> When I run the item-similarity against a single input file which
>> contains all the rows, the job succeeds without error.
>>
>> When I break up the input file into 100 parts, and use the directory
>> containing them as input then I get the 'Index outside allowable
>> range' exception.
>>
>> Her are the input files that I used tarred and gzipped -
>>
>> https://s3.amazonaws.com/static.onespot.com/mahout/passing_single_file.tar.gz
>> https://s3.amazonaws.com/static.onespot.com/mahout/failing_split_into_100_parts.tar.gz
>>
>> There are 44067 rows in total, 11858 unique userIds and 24166 unique itemIds.
>>
>> This is the exception that I see on the 100 part run -
>> 15/04/03 12:07:09 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 707)
>> org.apache.mahout.math.IndexException: Index 24190 is outside
>> allowable range of [0,24166)
>> at org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
>> at org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
>> at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
>> at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
>> at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
>> at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
>> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
>> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
>> at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
>> at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
>> at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
>> at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
>> at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
>> at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> I tried splitting the file up in 10,20 and 50 parts and the job completed.
>> Also, should the resulting similarity matrix be the same wether the
>> input is split up or not? I passed in the same random seed for the
>> spark job, but the matrices were different
>>
>> Thanks,
>>
>> Michael
>>
>>
>>
>> On Thu, Apr 2, 2015 at 6:56 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
>>> The input must be tuples (if not using a filter) so the CLI you have expects
user and item ids that are
>>>
>>> user-id1,item-id1
>>> user-id500,item-id3000
>>> …
>>>
>>> The ids must be tokenized because it doesn’t use a full csv parser, only lines
of delimited text.
>>>
>>> If this doesn’t help can you supply a snippet of the input
>>>
>>>
>>> On Apr 2, 2015, at 10:39 AM, Michael Kelly <michael@onespot.com> wrote:
>>>
>>> Hi all,
>>>
>>> I'm running the spark-itemsimilarity job from the cli on an AWS emr
>>> cluster, and I'm running into an exception.
>>>
>>> The input file format is
>>> UserId<tab>ItemId1<tab>ItemId2<tab>ItemId3......
>>>
>>> There is only one row per user, and a total of 97,000 rows.
>>>
>>> I also tried input with one row per UserId/ItemId pair, which had
>>> about 250,000 rows, but I also saw a similar exception, this time the
>>> out of bounds index was around 110,000.
>>>
>>> The input is stored in hdfs and this is the command I used to start the job -
>>>
>>> mahout spark-itemsimilarity --input userItems --output output --master
>>> yarn-client
>>>
>>> Any idea what the problem might be?
>>>
>>> Thanks,
>>>
>>> Michael
>>>
>>>
>>>
>>> 15/04/02 16:37:40 WARN TaskSetManager: Lost task 1.0 in stage 10.0
>>> (TID 7631, ip-XX.XX.ec2.internal):
>>> org.apache.mahout.math.IndexException: Index 22050 is outside
>>> allowable range of [0,21997)
>>>
>>>     org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
>>>
>>>     org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
>>>
>>>     org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
>>>
>>>     org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
>>>
>>>     scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
>>>
>>>     scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
>>>
>>>     scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
>>>
>>>     scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
>>>
>>>     scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
>>>
>>>     scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
>>>
>>>     scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
>>>
>>>     scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
>>>
>>>     scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
>>>
>>>     scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>>
>>>     org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144)
>>>
>>>     org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>>>
>>>     org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
>>>
>>>     org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>>
>>>     org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>
>>>     org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>
>>>     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>
>>>     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>
>>>     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>
>>>     java.lang.Thread.run(Thread.java:745)
>>>
>>
>
>

Mime
View raw message