spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chunnan Yao <>
Subject Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test
Date Thu, 12 Mar 2015 07:13:55 GMT
Hi everyone!
I am digging into MLlib of Spark 1.2.1 currently. When reading codes of
MLlib.stat.test, in the file ChiSqTest.scala under
/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test, I am confused
by the usage of mapPartitions API in the function  
def chiSquaredFeatures(data: RDD[LabeledPoint],
      methodName: String = Array[ChiSqTestResult]

According to my statistical testing knowledge, Chi-Square test requires
large numbers (>5 for 80% entries) in its contingency matrix in order to
satisfy good approximation
( Thus the number
of feature & label categories cannot be too large because if otherwise,
there would be too few items in each categories, which fails to meet  the
constraint in usage of Chi-square test. 

I do see in the function above, Spark will throw exceptions when
distinctLabels.size and distinctFeatures.size exceed maxCategories defined
as 10000, but the  two HashSets distinctLabels and distinctFeatures are
initialized inside mapPartition, which means Spark will only be sensitive to
the number of feature & label categories in one partition. This will make
the reduced result---contingency matrix still have exceeded number of
categories and thus small matrix entries which makes Chi-Square inaccurate.
I've made a unit test on this function, which proves the case. 

Maybe I am just being trapped by a misunderstanding. Could any one please
give me a hint on this issue?

Feel the sparking Spark!
View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message