spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reed Villanueva <villanuevar...@gmail.com>
Subject Re: sparkml random forest classifier not learning (at all) compared to H2O implementation (on same data)?
Date Mon, 14 Jun 2021 04:11:20 GMT
I *think* solved issue.
Will update w/ details after further testing / inspection.

On Sun, Jun 13, 2021 at 3:29 PM Reed Villanueva <villanuevareed@gmail.com>
wrote:

> I am trying to train a random forest classifier w/ sparkml
> <https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier>
and
> am seeing that the *accuracy etc. is very bad (about the same as the
> dataset's response distribution itself), yet when using the same data in a
> random forest from the H2O
> <http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html#> module
> I actually do get OK results* (~80% accuracy, 0.90 F2, etc which at least
> implies that it's learning *something*). This big difference makes me
> suspect that this is not just a hyperparameter tuning issue. What could be
> going on here?
>
> My dataset is mostly categorical features (16 categorical, 2 integer) with
> a binary response distribution of about 36/64%.
>
> *Both sparkml and H2O implementations coded in a similar way as with the
> actual dataset seemed to perform well for my benchmarking dataset
> <https://archive.ics.uci.edu/ml/datasets/Car+Evaluation>* (note if you
> actually download the car.data file here it has no column names but I added
> that manually when I downloaded it) else I would think that something was
> just wrong with the way I implemented the sparkml code itself. I find it
> odd that the sparkml implementation seems to not even attribute any
> importance to any of the features (ie. the value of all the importances in
> the trained model's featureImportances are zero). *I would think there
> would at least be something in the featureImportances when mimicking the
> broad strokes of the H2O hyperparams (imbalanced classes or not)*. So not
> only is the sparkml implementation not learning anything vs the H2O
> version, it does not even seem to think that any of the provided features
> are important *at all* in making a decision about the samples). This big
> difference makes me suspect that this is not just a hyperparameter tuning
> issue.
>
> An example of how I'm building and training the sparkml pipeline can be
> found here: https://gist.github.com/reedv/409df80f516ec17e330510365f75f558 (my
> actual hyperparams are shown further down this post)
>
> An example of how I'm training the H2O implementation can the found here:
> https://gist.github.com/reedv/169856f9442354a404fc0e1e0d3e8aa8 (the
> example is using benchmarking data, but hyperparams for the actual
> H2ORandomForestClassifier are the same as with my actual dataset).
>
> My basic sparkml pipline and training looks like...
>
>
> training_features = <list of all the training features in the dataset>print("Training
features:")print(training_features)
> # convert response label to categorical index for spark
> label_idxer = StringIndexer(inputCol='outcome',
>                             outputCol="label").fit(dff)
> # convert all categorical features to indexes for spark
> feature_idxer = StringIndexer(inputCols=training_features,
>                               outputCols=[x + 'Index' for x in training_features],
>                               handleInvalid="keep").fit(dff)
>
> training_features = [x + 'Index' for x in training_features]print("Training features:")print(training_features)#
convert all features per record into single vectors to feed to spark ML transformer
> assembler = VectorAssembler(inputCols=training_features,
>                             outputCol="features")
> # create a RDF estimator
> rf = RandomForestClassifier(labelCol="label",
>                             featuresCol="features",
>                             seed=1819511352808605668)
> pprint.pprint(vars(rf))
> pp.pprint(rf.getRawPredictionCol())
> pp.pprint(rf.getProbabilityCol())
> pp.pprint(rf.getPredictionCol())
> # convert prediction output category indexes back to strings
> label_converter = IndexToString(inputCol=rf.getPredictionCol(),
>                                 outputCol="prediction_label",
>                                 labels=label_idxer.labels)
>
> pipeline = Pipeline(stages=[label_idxer, feature_idxer, assembler,
>                             rf,
>                             label_converter])  # type: pyspark.ml.pipeline.PipelineModel
> # we would normally then do something like...# pipline_transformer = pipeline.fit(spark_df)#
prediction_df = pipeline_transformer.transform(spark_df)# ...(see https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline-components)
# ...but instead, we are going to use cross validation to optimize the RDF w/in the pipeline
>
> rfparamGrid = ParamGridBuilder() \
>     .addGrid(rf.maxDepth, [20, 30, 60, 90]) \
>     .addGrid(rf.maxBins, [10000, 30000, 100000, 300000]) \
>     .addGrid(rf.numTrees, [37, 64, 280, 370]) \
>     .addGrid(rf.minInstancesPerNode, [1]) \
>     .addGrid(rf.minInfoGain, [0.0, 0.25, 1.0]) \
>     .addGrid(rf.subsamplingRate, [0.5, 0.75, 1.0]) \
>     .addGrid(rf.bootstrap, [True]) \
>     .addGrid(rf.featureSubsetStrategy, ['auto']) \
>     .build()
>
> crossval = CrossValidator(estimator=pipeline,
>                           estimatorParamMaps=rfparamGrid,
>                           evaluator=MulticlassClassificationEvaluator(
>                               labelCol="label",
>                               predictionCol=rf.getPredictionCol(),
>                               metricName="weightedFMeasure", beta=2),  # since my data
is imbalanced, I'm using F2 scoring metric
>                           numFolds=3)
>
> display(dff.head(n=3))
> (train_u, test_u) = dff.randomSplit([0.8, 0.2])
> assert train_u.dtypes == test_u.dtypes
> # fit on the cross validation estimstor to get the optimal pipline transformer/modelprint(datetime.datetime.now())
> best_rf_pipeline = crossval.fit(train_u)  # type: pyspark.ml.pipeline.PipelineModelprint(datetime.datetime.now())
> # now let's look at how it performs on the witheld test data as well as inspecting some
aspects of the RDF model w/in the pipeline
> test_prediction = best_rf_pipeline.transform(test_u)
>
> evals = MulticlassClassificationEvaluator(labelCol="label", predictionCol=rf.getPredictionCol())
>
> statistics = {
>     "acc": evals.evaluate(test_prediction, {evals.metricName: "accuracy"}),
>     "recall": evals.evaluate(test_prediction, {evals.metricName: "weightedRecall"}),
>     "precision": evals.evaluate(test_prediction, {evals.metricName: "weightedPrecision"}),
>     "f1": evals.evaluate(test_prediction, {evals.metricName: "f1"}),
>     "f2": evals.evaluate(test_prediction, {evals.metricName: "weightedFMeasure", evals.beta:
2}),
> }print("Model Information")for stat in statistics:
>     print(stat + ": " + str(statistics[stat]))
> print("Model Feature Importance:")print(type(best_rf_pipeline))print(type(best_rf_pipeline.bestModel))for
index, stage in enumerate(best_rf_pipeline.bestModel.stages):
>     print(f"{index}: {type(stage)}")
> best_rf = best_rf_pipeline.bestModel.stages[3]
> pp.pprint(type(best_rf.featureImportances))
> pp.pprint(best_rf.featureImportances)
>
>
>
> These are the configs for the H2O model (truncated based on what seemed
> highly non-relevant (again IDK what the issue is so there may still be
> things left in that are not relevant))
>
>
>
> {
>  'auc_type': {'actual': 'AUTO', 'default': 'AUTO', 'input': 'AUTO'},
>  'balance_classes': {'actual': True, 'default': False, 'input': True},
>  'binomial_double_trees': {'actual': True, 'default': False, 'input': True},
>  'build_tree_one_node': {'actual': False, 'default': False, 'input': False},
>  'calibrate_model': {'actual': False, 'default': False, 'input': False},
>  'calibration_frame': {'actual': None, 'default': None, 'input': None},
>  'categorical_encoding': {'actual': 'Enum', 'default': 'AUTO', 'input': 'AUTO'},
>  'check_constant_response': {'actual': True, 'default': True, 'input': True},
>  'checkpoint': {'actual': None, 'default': None, 'input': None},
>  'class_sampling_factors': {'actual': None, 'default': None, 'input': None},
>  'col_sample_rate_change_per_level': {'actual': 1.0,
>                                       'default': 1.0,
>                                       'input': 1.0},
>  'col_sample_rate_per_tree': {'actual': 1.0, 'default': 1.0, 'input': 1.0},
>  'custom_metric_func': {'actual': None, 'default': None, 'input': None},
>  'distribution': {'actual': 'multinomial',
>                   'default': 'AUTO',
>                   'input': 'multinomial'},
>  'export_checkpoints_dir': {'actual': None, 'default': None, 'input': None},
>  'fold_assignment': {'actual': None, 'default': 'AUTO', 'input': 'AUTO'},
>  'fold_column': {'actual': None, 'default': None, 'input': None},
>  'gainslift_bins': {'actual': -1, 'default': -1, 'input': -1},
>  'histogram_type': {'actual': 'UniformAdaptive',
>                     'default': 'AUTO',
>                     'input': 'AUTO'},
>  'ignore_const_cols': {'actual': True, 'default': True, 'input': True},
>  'keep_cross_validation_fold_assignment': {'actual': False,
>                                            'default': False,
>                                            'input': False},
>  'keep_cross_validation_models': {'actual': True,
>                                   'default': True,
>                                   'input': True},
>  'keep_cross_validation_predictions': {'actual': False,
>                                        'default': False,
>                                        'input': False},
>  'max_after_balance_size': {'actual': 5.0, 'default': 5.0, 'input': 5.0},
>  'max_confusion_matrix_size': {'actual': 20, 'default': 20, 'input': 20},
>  'max_depth': {'actual': 20, 'default': 20, 'input': 20},
>  'max_runtime_secs': {'actual': 10800.0, 'default': 0.0, 'input': 10800.0},
>  'min_rows': {'actual': 1.0, 'default': 1.0, 'input': 1.0},
>  'min_split_improvement': {'actual': 1e-05, 'default': 1e-05, 'input': 1e-05},
>  'mtries': {'actual': -1, 'default': -1, 'input': -1},
>  'nbins': {'actual': 32, 'default': 20, 'input': 32},
>  'nbins_cats': {'actual': 1024, 'default': 1024, 'input': 1024},
>  'nbins_top_level': {'actual': 1024, 'default': 1024, 'input': 1024},
>  'nfolds': {'actual': 0, 'default': 0, 'input': 0},
>  'ntrees': {'actual': 64, 'default': 50, 'input': 64},
>  'offset_column': {'actual': None, 'default': None, 'input': None},
>  'r2_stopping': {'actual': 1.7976931348623157e+308,
>                  'default': 1.7976931348623157e+308,
>                  'input': 1.7976931348623157e+308},
>  'response_column': {'actual': {'__meta': {'schema_name': 'ColSpecifierV3',
>                                            'schema_type': 'VecSpecifier',
>                                            'schema_version': 3},
>                                 'column_name': 'outcome',
>                                 'is_member_of_frames': None},
>                      'default': None,
>                      'input': {'__meta': {'schema_name': 'ColSpecifierV3',
>                                           'schema_type': 'VecSpecifier',
>                                           'schema_version': 3},
>                                'column_name': 'outcome',
>                                'is_member_of_frames': None}},
>  'sample_rate': {'actual': 0.632, 'default': 0.632, 'input': 0.632},
>  'sample_rate_per_class': {'actual': None, 'default': None, 'input': None},
>  'score_each_iteration': {'actual': False, 'default': False, 'input': False},
>  'score_tree_interval': {'actual': 0, 'default': 0, 'input': 0},
>  'seed': {'actual': 1819511352808605668, 'default': -1, 'input': -1},
>  'stopping_metric': {'actual': None, 'default': 'AUTO', 'input': 'AUTO'},
>  'stopping_rounds': {'actual': 0, 'default': 0, 'input': 0},
>  'stopping_tolerance': {'actual': 0.001, 'default': 0.001, 'input': 0.001},
>  'weights_column': {'actual': None, 'default': None, 'input': None}
> }
>
>
>
> These are the configs for the sparkml model (which I use in a 3-fold cross
> validation
> <https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation>).
> Note that I was not sure how to duplicate certain H2O hyperparams in spark
> (eg. H2O's random forest supports "binomial_double_trees
> <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/binomial_double_trees.html>"
> while sparkml doesn't and sparkml's RF required maxBins >= number of
> features while H2O's did not).
>
>
>
> rfparamGrid = ParamGridBuilder() \
>     .addGrid(rf.maxDepth, [20, 30, 60, 90]) \
>     .addGrid(rf.maxBins, [10000, 30000, 100000, 300000]) \
>     .addGrid(rf.numTrees, [37, 64, 280, 370]) \
>     .addGrid(rf.minInstancesPerNode, [1]) \
>     .addGrid(rf.minInfoGain, [0.0, 0.25, 1.0]) \
>     .addGrid(rf.subsamplingRate, [0.5, 0.75, 1.0]) \
>     .addGrid(rf.bootstrap, [True]) \
>     .addGrid(rf.featureSubsetStrategy, ['auto']) \
>     .build()
>
> (Even if just using the closest maxDepth, maxBins, numTrees values as in
> the H2O version, results still the same for the sparkml model; nothing
> learned beyond just the distribution of the responses themselves and all
> featureImportances of the sparkml model still all zeros).
>
> Anyone with more experience have any ideas what could be going on here?
> See any implementation / usage mistakes I'm making that could be causing
> the sparkml pipeline to train so poorly?
>

Mime
View raw message