spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobi Bosede <ani.to...@gmail.com>
Subject Re: Filtering RDD Using Spark.mllib's ChiSqSelector
Date Wed, 20 Jul 2016 03:19:00 GMT
Thanks Yanbo, will try that!

On Sun, Jul 17, 2016 at 10:26 PM, Yanbo Liang <ybliang8@gmail.com> wrote:

> Hi Tobi,
>
> Thanks for clarifying the question. It's very straight forward to convert
> the filtered RDD to DataFrame, you can refer the following code snippets:
>
> from pyspark.sql import Row
>
> rdd2 = filteredRDD.map(lambda v: Row(features=v))
>
> df = rdd2.toDF()
>
>
> Thanks
> Yanbo
>
> 2016-07-16 14:51 GMT-07:00 Tobi Bosede <ani.tobib@gmail.com>:
>
>> Hi Yanbo,
>>
>> Appreciate the response. I might not have phrased this correctly, but I
>> really wanted to know how to convert the pipeline rdd into a data frame. I
>> have seen the example you posted. However I need to transform all my data,
>> just not 1 line. So I did sucessfully use map to use the chisq selector to
>> filter the chosen features of my data. I just want to convert it to a df so
>> I can apply a logistic regression model from spark.ml.
>>
>> Trust me I would use the dataframes api if I could, but the chisq
>> functionality is not available to me in the python spark 1.4 api.
>>
>> Regards,
>> Tobi
>>
>> On Jul 16, 2016 4:53 AM, "Yanbo Liang" <ybliang8@gmail.com> wrote:
>>
>>> Hi Tobi,
>>>
>>> The MLlib RDD-based API does support to apply transformation on both
>>> Vector and RDD, but you did not use the appropriate way to do.
>>> Suppose you have a RDD with LabeledPoint in each line, you can refer the
>>> following code snippets to train a ChiSqSelectorModel model and do
>>> transformation:
>>>
>>> from pyspark.mllib.regression import LabeledPoint
>>>
>>> from pyspark.mllib.feature import ChiSqSelector
>>>
>>> data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), LabeledPoint(1.0,
SparseVector(3, {1: 9.0, 2: 6.0})), LabeledPoint(1.0, [0.0, 9.0, 8.0]), LabeledPoint(2.0,
[8.0, 9.0, 5.0])]
>>>
>>> rdd = sc.parallelize(data)
>>>
>>> model = ChiSqSelector(1).fit(rdd)
>>>
>>> filteredRDD = model.transform(rdd.map(lambda lp: lp.features))
>>>
>>> filteredRDD.collect()
>>>
>>> However, we strongly recommend you to migrate to DataFrame-based API
>>> since the RDD-based API is switched to maintain mode.
>>>
>>> Thanks
>>> Yanbo
>>>
>>> 2016-07-14 13:23 GMT-07:00 Tobi Bosede <ani.tobib@gmail.com>:
>>>
>>>> Hi everyone,
>>>>
>>>> I am trying to filter my features based on the spark.mllib
>>>> ChiSqSelector.
>>>>
>>>> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
>>>> model.transform(lp.features)))
>>>>
>>>> However when I do the following I get the error below. Is there any
>>>> other way to filter my data to avoid this error?
>>>>
>>>> filteredDataDF=filteredData.toDF()
>>>>
>>>> Exception: It appears that you are attempting to reference SparkContext from
a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver,
not in code that it run on workers. For more information, see SPARK-5063.
>>>>
>>>>
>>>> I would directly use the spark.ml ChiSqSelector and work with dataframes,
but I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not available to me.
filteredData is of type piplelineRDD, if that helps. It is not a regular RDD. I think that
may part of why calling toDF() is not working.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Tobi
>>>>
>>>>
>>>
>

Mime
View raw message