Can you provide a sample of the expected and actual results?


The results in MulticlassMetrics is totally wrong. They are improperly calculated.
Confusion matrix may be true I don't know but for each label scores are wrong.

