spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <>
Subject Re: Filtering RDD Using Spark.mllib's ChiSqSelector
Date Sat, 16 Jul 2016 09:53:07 GMT
Hi Tobi,

The MLlib RDD-based API does support to apply transformation on both Vector
and RDD, but you did not use the appropriate way to do.
Suppose you have a RDD with LabeledPoint in each line, you can refer the
following code snippets to train a ChiSqSelectorModel model and do

from pyspark.mllib.regression import LabeledPoint

from pyspark.mllib.feature import ChiSqSelector

data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})),
LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})),
LabeledPoint(1.0, [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0,

rdd = sc.parallelize(data)

model = ChiSqSelector(1).fit(rdd)

filteredRDD = model.transform( lp: lp.features))


However, we strongly recommend you to migrate to DataFrame-based API since
the RDD-based API is switched to maintain mode.


2016-07-14 13:23 GMT-07:00 Tobi Bosede <>:

> Hi everyone,
> I am trying to filter my features based on the spark.mllib ChiSqSelector.
> filteredData = lp: LabeledPoint(lp.label,
> model.transform(lp.features)))
> However when I do the following I get the error below. Is there any other
> way to filter my data to avoid this error?
> filteredDataDF=filteredData.toDF()
> Exception: It appears that you are attempting to reference SparkContext from a broadcast
variable, action, or transforamtion. SparkContext can only be used on the driver, not in code
that it run on workers. For more information, see SPARK-5063.
> I would directly use the ChiSqSelector and work with dataframes, but I am on
spark 1.4 and using pyspark. So's ChiSqSelector is not available to me. filteredData
is of type piplelineRDD, if that helps. It is not a regular RDD. I think that may part of
why calling toDF() is not working.
> Thanks,
> Tobi

View raw message