spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <yblia...@gmail.com>
Subject Re: Filtering RDD Using Spark.mllib's ChiSqSelector
Date Mon, 18 Jul 2016 03:26:08 GMT
Hi Tobi,

Thanks for clarifying the question. It's very straight forward to convert
the filtered RDD to DataFrame, you can refer the following code snippets:

from pyspark.sql import Row

rdd2 = filteredRDD.map(lambda v: Row(features=v))

df = rdd2.toDF()


Thanks
Yanbo

2016-07-16 14:51 GMT-07:00 Tobi Bosede <ani.tobib@gmail.com>:

> Hi Yanbo,
>
> Appreciate the response. I might not have phrased this correctly, but I
> really wanted to know how to convert the pipeline rdd into a data frame. I
> have seen the example you posted. However I need to transform all my data,
> just not 1 line. So I did sucessfully use map to use the chisq selector to
> filter the chosen features of my data. I just want to convert it to a df so
> I can apply a logistic regression model from spark.ml.
>
> Trust me I would use the dataframes api if I could, but the chisq
> functionality is not available to me in the python spark 1.4 api.
>
> Regards,
> Tobi
>
> On Jul 16, 2016 4:53 AM, "Yanbo Liang" <ybliang8@gmail.com> wrote:
>
>> Hi Tobi,
>>
>> The MLlib RDD-based API does support to apply transformation on both
>> Vector and RDD, but you did not use the appropriate way to do.
>> Suppose you have a RDD with LabeledPoint in each line, you can refer the
>> following code snippets to train a ChiSqSelectorModel model and do
>> transformation:
>>
>> from pyspark.mllib.regression import LabeledPoint
>>
>> from pyspark.mllib.feature import ChiSqSelector
>>
>> data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), LabeledPoint(1.0, SparseVector(3,
{1: 9.0, 2: 6.0})), LabeledPoint(1.0, [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0, 5.0])]
>>
>> rdd = sc.parallelize(data)
>>
>> model = ChiSqSelector(1).fit(rdd)
>>
>> filteredRDD = model.transform(rdd.map(lambda lp: lp.features))
>>
>> filteredRDD.collect()
>>
>> However, we strongly recommend you to migrate to DataFrame-based API
>> since the RDD-based API is switched to maintain mode.
>>
>> Thanks
>> Yanbo
>>
>> 2016-07-14 13:23 GMT-07:00 Tobi Bosede <ani.tobib@gmail.com>:
>>
>>> Hi everyone,
>>>
>>> I am trying to filter my features based on the spark.mllib
>>> ChiSqSelector.
>>>
>>> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
>>> model.transform(lp.features)))
>>>
>>> However when I do the following I get the error below. Is there any
>>> other way to filter my data to avoid this error?
>>>
>>> filteredDataDF=filteredData.toDF()
>>>
>>> Exception: It appears that you are attempting to reference SparkContext from
a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver,
not in code that it run on workers. For more information, see SPARK-5063.
>>>
>>>
>>> I would directly use the spark.ml ChiSqSelector and work with dataframes, but
I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not available to me. filteredData
is of type piplelineRDD, if that helps. It is not a regular RDD. I think that may part of
why calling toDF() is not working.
>>>
>>>
>>> Thanks,
>>>
>>> Tobi
>>>
>>>
>>

Mime
View raw message