spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <yblia...@gmail.com>
Subject Re: Filtering RDD Using Spark.mllib's ChiSqSelector
Date Sat, 16 Jul 2016 09:53:07 GMT
Hi Tobi,

The MLlib RDD-based API does support to apply transformation on both Vector
and RDD, but you did not use the appropriate way to do.
Suppose you have a RDD with LabeledPoint in each line, you can refer the
following code snippets to train a ChiSqSelectorModel model and do
transformation:

from pyspark.mllib.regression import LabeledPoint

from pyspark.mllib.feature import ChiSqSelector

data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})),
LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})),
LabeledPoint(1.0, [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0,
5.0])]

rdd = sc.parallelize(data)

model = ChiSqSelector(1).fit(rdd)

filteredRDD = model.transform(rdd.map(lambda lp: lp.features))

filteredRDD.collect()

However, we strongly recommend you to migrate to DataFrame-based API since
the RDD-based API is switched to maintain mode.

Thanks
Yanbo

2016-07-14 13:23 GMT-07:00 Tobi Bosede <ani.tobib@gmail.com>:

> Hi everyone,
>
> I am trying to filter my features based on the spark.mllib ChiSqSelector.
>
> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
> model.transform(lp.features)))
>
> However when I do the following I get the error below. Is there any other
> way to filter my data to avoid this error?
>
> filteredDataDF=filteredData.toDF()
>
> Exception: It appears that you are attempting to reference SparkContext from a broadcast
variable, action, or transforamtion. SparkContext can only be used on the driver, not in code
that it run on workers. For more information, see SPARK-5063.
>
>
> I would directly use the spark.ml ChiSqSelector and work with dataframes, but I am on
spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not available to me. filteredData
is of type piplelineRDD, if that helps. It is not a regular RDD. I think that may part of
why calling toDF() is not working.
>
>
> Thanks,
>
> Tobi
>
>

Mime
View raw message