spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Ask about Pyspark ML interaction
Date Mon, 09 Nov 2020 14:58:06 GMT
I think you have this flipped around - you want to one-hot encode, then
compute interactions. As it is you are treating the product of {0,1,2,3,4}
x {0,1,2,3,4} as if it's a categorical index. That doesn't have nearly 25
possible values and probably is not what you intend.

On Mon, Nov 9, 2020 at 7:53 AM Du, Yi <YDu@archcapservices.com> wrote:

> Hi,
>
>
>
> How are you doing?
>
>
>
> Please first introduce myself to you. I am Yi Du, working in a mortgage
> insurance company called ‘Arch Capital Group’ based in Washington DC office
> in US. I find your profile under the repo Spark of Github and would like to
> ask you one particular coding issue under Spark ML. I tried to read the
> documentation of Spark and also asked in Stackoverflow but still have no
> clue.
>
>
>
> I am using Pyspark and using ML to build models. I have categorical
> variables as predictors and would like to have interactions between two
> categorical variables in the model as well.
>
>
>
> I was trying to follow the example here:
> https://spark.apache.org/docs/latest/ml-features#interaction to create
> the interaction between two categorical variables.
>
>
>
> Here is my snippet of code:
>
>
>
> ```python
>
> stringIndexer = StringIndexer(inputCols=['fico_group','ltv_group'],
> outputCols=['fico_groupIndex1','ltv_groupIndex1'],
> stringOrderType='frequencyAsc')
>
> trs_data_index = stringIndexer.fit(trs_data).transform(trs_data)
>
>
>
> interaction =
> Interaction(inputCols=['fico_groupIndex1','ltv_groupIndex1'],
> outputCol="interactedCol")
>
> trs_data_interacted_temp = interaction.transform(trs_data_index)
>
>
>
> encoder = OneHotEncoder(inputCols=['interactedCol'],
> outputCols=['interactedColVec'])
>
> trs_data_interacted =
> encoder.fit(trs_data_interacted_temp).transform(trs_data_interacted_temp)
>
> ```
>
>
>
> I basically index ‘fico_group’ and ‘ltv_group’ first and interact them
> together and use onehotencoder to create the final column
> ‘interactedColVec’ for use.
>
>
>
> However, the final results didn’t come as expected. My ‘fico_group’ has 5
> levels and so does ‘ltv_group’. So there are 5*5 = 25 combinations. But in
> the model estimates, one level should be treated as base so I expected to
> see 25-1 = 24 interactions in the final estimates. However, by using the
> above code, I have 25 interactions in the model estimates.
>
>
>
> This is my post under Stackoverflow.
> https://stackoverflow.com/questions/64602060/add-interaction-term-to-ml
>
>
>
> I don’t know if I articulated my question/issues clearly to you. But I do
> really appreciate your help if possible or if you can direct me to the
> person who knows this.
>
>
>
> Again, thank you very much for your help.
>
>
>
> Best,
>
> Yi
>
>
>
>
> ------------------------------
>
> The information contained in this e-mail message may be privileged and
> confidential information and is intended only for the use of the individual
> and/or entity identified in the alias address of this message. If the
> reader of this message is not the intended recipient, or an employee or
> agent responsible to deliver it to the intended recipient, you are hereby
> requested not to distribute or copy this communication. If you have
> received this communication in error, please notify us immediately by
> telephone or return e-mail and delete the original message from your system.
>

Mime
View raw message