Hi All!

I'm performing an econometric analysis over several billion rows of data and would like to use the Pyspark SparkML implementation of linear regression. In the example below I'm trying to interact hour of day and month of year indicators. The StringIndexer documentation tells you what it's doing when it's one hot encoding string/factor columns (i.e. taking out the most/least common value or first/last when sorted alphabetically) but doesn't allow you to recover your coefficient names. This feels like such a general case that I must be missing something. How can I get my column names back post regression to map to coefficient values? Do I need to basically rebuild the RFormula logic in if this isn't already implemented? Would be happy to use a different Spark language (Scala/Java etc. ) if implemented there.

Thanks in advance


rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day + month_of_year + hour_of_day:month_of_year + additional_column",
    rform_regression_input = rform.fit(regression_input).transform(regression_input)

    lr = LinearRegression(featuresCol='features',

    lr_model = lr.fit(rform_regression_input)
    coefs = [ *lr_model.coefficients, lr_model.intercept]

    return pd.DataFrame(
        {"pvalues": lr_model.summary.pValues,
         "tvalues": lr_model.summary.tValues,
         "std_errs": lr_model.summary.coefficientStandardErrors,
         "coefs": coefs}