spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: ALS failure with size > Integer.MAX_VALUE
Date Sun, 30 Nov 2014 09:31:16 GMT
(It won't be that, since you see that the error occur when reading a
block from disk. I think this is an instance of the 2GB block size

On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
<> wrote:
> Hi Bharath – I’m unsure if this is your problem but the
> MatrixFactorizationModel in MLLIB which is the underlying component for ALS
> expects your User/Product fields to be integers. Specifically, the input to
> ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am wondering if
> perhaps one of your identifiers exceeds MAX_INT, could you write a quick
> check for that?
> I have been running a very similar use case to yours (with more constrained
> hardware resources) and I haven’t seen this exact problem but I’m sure we’ve
> seen similar issues. Please let me know if you have other questions.
> From: Bharath Ravi Kumar <>
> Date: Thursday, November 27, 2014 at 1:30 PM
> To: "" <>
> Subject: ALS failure with size > Integer.MAX_VALUE
> We're training a recommender with ALS in mllib 1.1 against a dataset of 150M
> users and 4.5K items, with the total number of training records being 1.2
> Billion (~30GB data). The input data is spread across 1200 partitions on
> HDFS. For the training, rank=10, and we've configured {number of user data
> blocks = number of item data blocks}. The number of user/item blocks was
> varied  between 50 to 1200. Irrespective of the block size (e.g. at 1200
> blocks each), there are atleast a couple of tasks that end up shuffle
> reading > 9.7G each in the aggregate stage (ALS.scala:337) and failing with
> the following exception:
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>         at
>         at
>         at
>         at
>         at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message