spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Redd <andrewwr...@gmail.com>
Subject Re: Recover RFormula Column Names
Date Tue, 29 Oct 2019 12:49:36 GMT
Thanks Alessandro!

That did the trick. I all of the indices and interactions are in the
metadata. I also wanted to confirm that this solution works in pyspark as
the metadata is carried over.

Andrew

On Tue, Oct 29, 2019 at 5:26 AM Alessandro Solimando <
alessandro.solimando@gmail.com> wrote:

> Hello Andrew,
> few years ago I had the same need and I found this SO's answer
> <https://stackoverflow.com/a/36306784/898154> the way to go.
>
> Here an extract of my (Scala) code (which was doing other things on
> top), I have removed the irrelevant parts but without testing it, so it
> might not work out of the box, nonetheless it should help you starting:
>
>    private def getEncodedVectorLookupTable(df: DataFrame,
>
>                                           featuresColName: String):
>> Map[Long, String] = {
>
>      val meta = df.select(featuresColName)
>>       .schema.fields.head.metadata
>>       .getMetadata("ml_attr")
>>       .getMetadata("attrs")
>>
>
>
>     /* REFLECTION START */
>>     val field = meta.getClass.getDeclaredField("map")
>>     field.setAccessible(true)
>>     val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet
>>     field.setAccessible(false)
>>     /* REFLECTION END */
>
>
>
>     keys.flatMap(
>>       meta.getMetadataArray(_)
>>         .map(m => m.getLong("idx") -> m.getString("name"))
>>     ).toMap
>
>  }
>
>
> It looks like there is some support now for achieving this, but I have
> never tried it:
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/r/RWrapperUtils.html
>
> Best regards,
> Alessandro
>
> On Mon, 28 Oct 2019 at 21:01, Andrew Redd <andrewwredd@gmail.com> wrote:
>
>>
>> Hi All!
>>
>> I'm performing an econometric analysis over several billion rows of data
>> and would like to use the Pyspark SparkML implementation of linear
>> regression. In the example below I'm trying to interact hour of day and
>> month of year indicators. The StringIndexer documentation tells you what
>> it's doing when it's one hot encoding string/factor columns (i.e. taking
>> out the most/least common value or first/last when sorted alphabetically)
>> but doesn't allow you to recover your coefficient names. This feels like
>> such a general case that I must be missing something. How can I get my
>> column names back post regression to map to coefficient values? Do I need
>> to basically rebuild the RFormula logic in if this isn't already
>> implemented? Would be happy to use a different Spark language (Scala/Java
>> etc. ) if implemented there.
>>
>> Thanks in advance
>>
>> Andrew
>>
>> rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day +
>> month_of_year + hour_of_day:month_of_year + additional_column",
>>                  featuresCol="features",
>>                  labelCol="label")
>>
>>     rform_regression_input =
>> rform.fit(regression_input).transform(regression_input)
>>
>>     lr = LinearRegression(featuresCol='features',
>>                          labelCol='label',
>>                          solver='normal')
>>
>>     lr_model = lr.fit(rform_regression_input)
>>     coefs = [ *lr_model.coefficients, lr_model.intercept]
>>
>>     return pd.DataFrame(
>>         {"pvalues": lr_model.summary.pValues,
>>          "tvalues": lr_model.summary.tValues,
>>          "std_errs": lr_model.summary.coefficientStandardErrors,
>>          "coefs": coefs}
>>     )
>>
>>

Mime
View raw message