spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly
Date Sat, 14 Jun 2014 10:19:46 GMT
Hi Manish,
Thanks for your reply.

I am attaching the logs here(regression, 5 levels). It contains the last
100s of lines. Also, I am attaching the screenshot of Spark UI. The first 4
levels complete in less than 6 seconds, while the 5th level doesn't
complete even after several hours.
Due to the reason that this is somebody else's data, I can't share it.

Can you check the code snippet attached in my first email and see if it
needs something to enable it to work for large data and >= 5 levels. It is
working for 3 levels on the same dataset, but, not for 5 levels.

In the mean time, I will try to run it on the latest master and let you
know the results. If it runs fine there, then, it can be related to 128 MB
limit issue that you mentioned.

Thanks and Regards,
Suraj Sheth

On Sat, Jun 14, 2014 at 12:05 AM, Manish Amde <> wrote:

> Hi Suraj,
> I can't answer 1) without knowing the data. However, the results for 2)
> are surprising indeed. We have tested with a billion samples for regression
> tasks so I am perplexed with the behavior.
> Could you try the latest Spark master to see whether this problem goes
> away. It has code that limits memory consumption at the master and worker
> nodes to 128 MB by default which ideally should not be needed given the
> amount of RAM on your cluster.
> Also, feel free to send the DEBUG logs. It might give me a better idea of
> where the algorithm is getting stuck.
> -Manish
> On Wed, Jun 11, 2014 at 1:20 PM, SURAJ SHETH <> wrote:
>> Hi Filipus,
>> The train data is already oversampled.
>> The number of positives I mentioned above is for the test dataset : 12028
>> (apologies for not making this clear earlier)
>> The train dataset has 61,264 positives out of 689,763 total rows. The
>> number of negatives is 628,499.
>> Oversampling was done for the train dataset to ensure that we have
>> atleast 9-10% of positives in the train part
>> No oversampling is done for the test dataset.
>> So, the only difference that remains is the amount of data used for
>> building a tree.
>> But, I have a few more questions :
>> Have we tried how much data can be used at most to build a single
>> Decision Tree.
>> Since, I have enough RAM to fit all the data into memory(only 1.3 GB of
>> train data and 30x3 GB of RAM), I would expect it to build a single
>> Decision Tree with all the data without any issues. But, for maxDepth >= 5,
>> it is not able to. I confirmed that when it keeps running for hours, the
>> amount of free memory available is more than 70%. So, it doesn't seem to be
>> a Memory issue either.
>> Thanks and Regards,
>> Suraj Sheth
>> On Wed, Jun 11, 2014 at 10:19 PM, filipus <> wrote:
>>> well I guess your problem is quite unbalanced and due to the information
>>> value as a splitting criterion I guess the algo stops after very view
>>> splits
>>> work arround is oversampling
>>> build many training datasets like
>>> take randomly 50% of the positives and from the negativ the same amount
>>> or
>>> let say the double
>>> => 6000 positives and 12000 negatives
>>> build a tree
>>> this you do many times => many models (agents)
>>> and than you make an ensemble model. means vote all the model
>>> in a way similar two random forest but at the completely different
>>> --
>>> View this message in context:
>>> Sent from the Apache Spark User List mailing list archive at

View raw message