spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julio Antonio Soto de Vicente <ju...@esbet.es>
Subject Re: Spark ML's RandomForestClassifier OOM
Date Tue, 10 Jan 2017 11:16:22 GMT
No. I am running Spark on YARN on a 3 node testing cluster. 

My guess is that given the amount of splits done by a hundred trees of depth 30 (which should
be more than 100 * 2^30), either the executors or the driver die OOM while trying to store
all the split metadata. I guess that  the same issue affects both local and distributed modes.
But those are just conjectures.

--
Julio

> El 10 ene 2017, a las 11:22, Marco Mistroni <mmistroni@gmail.com> escribió:
> 
> You running locally? Found exactly same issue.
> 2 solutions:
> _ reduce datA size.  
> _ run on EMR
> Hth
> 
>> On 10 Jan 2017 10:07 am, "Julio Antonio Soto" <julio@esbet.es> wrote:
>> Hi, 
>> 
>> I am running into OOM problems while training a Spark ML RandomForestClassifier (maxDepth
of 30, 32 maxBins, 100 trees).
>> 
>> My dataset is arguably pretty big given the executor count and size (8x5G), with
approximately 20M rows and 130 features.
>> 
>> The "fun fact" is that a single DecisionTreeClassifier with the same specs (same
maxDepth and maxBins) is able to train without problems in a couple of minutes.
>> 
>> AFAIK the current random forest implementation grows each tree sequentially, which
means that DecisionTreeClassifiers are fit one by one, and therefore the training process
should be similar in terms of memory consumption. Am I missing something here?
>> 
>> Thanks
>> Julio

Mime
View raw message