spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Thom <>
Subject RE: Does DecisionTree model in MLlib deal with missing values?
Date Sun, 11 Jan 2015 21:46:04 GMT
Is there any plan to extend the data types that would be accepted by the Tree models in Spark?
e.g. Many models that we build contain a large number of string-based categorical factors.
Currently the only strategy is to map these string values to integers, and store the mapping
so the data can be remapped when the model is scored. A viable solution, but cumbersome for
models with hundreds of these kinds of factors.

Concerning missing data, I haven't been able to figure out how to use NULL values in LabeledPoints,
and I'm not sure whether DecisionTrees correctly handle the case of missing data. The only
thing I've been able to work out is to use a placeholder value, which is not really what is
needed. I think this will introduce bias in the model if there is a significant proportion
of missing data. e.g. suppose we have a factor that is "TimeSpentonX". If 20% of values are
missing, what numeric value should this missing data be replaced with? Almost every choice
will bias the final model...what we really want is the algorithm to just ignore those values.


-----Original Message-----
From: Sean Owen []
Sent: Sunday, 11 January 2015 10:53 PM
To: Carter
Subject: Re: Does DecisionTree model in MLlib deal with missing values?

I do not recall seeing support for missing values.

Categorical values are encoded as 0.0, 1.0, 2.0, ... When training the model you indicate
which are interpreted as categorical with the categoricalFeaturesInfo parameter, which maps
feature offset to count of distinct categorical values for the feature.

On Sun, Jan 11, 2015 at 6:54 AM, Carter <> wrote:
> Hi, I am new to the MLlib in Spark. Can the DecisionTree model in
> MLlib deal with missing values? If so, what data structure should I use for the input?
> Moreover, my data has categorical features, but the LabeledPoint
> requires "double" data type, in this case what can I do?
> Thank you very much.
> --
> View this message in context:
> model-in-MLlib-deal-with-missing-values-tp21080.html
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: For
> additional commands, e-mail:

To unsubscribe, e-mail: For additional commands, e-mail:

Christopher Thom

Level 25, 8 Chifley, 8-12 Chifley Square
Sydney NSW 2000

T: +61 2 8222 3577
F: +61 2 9292 6444



The contents of this email, including attachments, may be confidential information. If you
are not the intended recipient, any use, disclosure or copying of the information is unauthorised.
If you have received this email in error, we would be grateful if you would notify us immediately
by email reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the message
from your system.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message