spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Du, Yi" <>
Subject Ask about Pyspark ML interaction
Date Mon, 09 Nov 2020 13:53:13 GMT

How are you doing?

Please first introduce myself to you. I am Yi Du, working in a mortgage insurance company
called ‘Arch Capital Group’ based in Washington DC office in US. I find your profile under
the repo Spark of Github and would like to ask you one particular coding issue under Spark
ML. I tried to read the documentation of Spark and also asked in Stackoverflow but still have
no clue.

I am using Pyspark and using ML to build models. I have categorical variables as predictors
and would like to have interactions between two categorical variables in the model as well.

I was trying to follow the example here:
to create the interaction between two categorical variables.

Here is my snippet of code:

stringIndexer = StringIndexer(inputCols=['fico_group','ltv_group'], outputCols=['fico_groupIndex1','ltv_groupIndex1'],
trs_data_index =

interaction = Interaction(inputCols=['fico_groupIndex1','ltv_groupIndex1'], outputCol="interactedCol")
trs_data_interacted_temp = interaction.transform(trs_data_index)

encoder = OneHotEncoder(inputCols=['interactedCol'], outputCols=['interactedColVec'])
trs_data_interacted =

I basically index ‘fico_group’ and ‘ltv_group’ first and interact them together and
use onehotencoder to create the final column ‘interactedColVec’ for use.

However, the final results didn’t come as expected. My ‘fico_group’ has 5 levels and
so does ‘ltv_group’. So there are 5*5 = 25 combinations. But in the model estimates, one
level should be treated as base so I expected to see 25-1 = 24 interactions in the final estimates.
However, by using the above code, I have 25 interactions in the model estimates.

This is my post under Stackoverflow.

I don’t know if I articulated my question/issues clearly to you. But I do really appreciate
your help if possible or if you can direct me to the person who knows this.

Again, thank you very much for your help.



The information contained in this e-mail message may be privileged and confidential information
and is intended only for the use of the individual and/or entity identified in the alias address
of this message. If the reader of this message is not the intended recipient, or an employee
or agent responsible to deliver it to the intended recipient, you are hereby requested not
to distribute or copy this communication. If you have received this communication in error,
please notify us immediately by telephone or return e-mail and delete the original message
from your system.

View raw message