spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Du, Yi" <...@archcapservices.com>
Subject RE: Ask about Pyspark ML interaction
Date Mon, 09 Nov 2020 19:51:03 GMT
Do you mean I need to index them, onehotencode and interact them?

I tried both ways:

Index -> interact -> onehotencode: it gave me 25 combinations.

Index -> onehotencode -> interact: it gave me 16 combinations.

Neither of them gave me expected 24 combinations. Did I miss something?

Thanks,

From: Sean Owen [mailto:srowen@gmail.com]
Sent: Monday, November 9, 2020 9:58 AM
To: Du, Yi <YDu@archcapservices.com>
Cc: user@spark.apache.org
Subject: Re: Ask about Pyspark ML interaction

CAUTION: External email.
I think you have this flipped around - you want to one-hot encode, then compute interactions.
As it is you are treating the product of {0,1,2,3,4} x {0,1,2,3,4} as if it's a categorical
index. That doesn't have nearly 25 possible values and probably is not what you intend.

On Mon, Nov 9, 2020 at 7:53 AM Du, Yi <YDu@archcapservices.com<mailto:YDu@archcapservices.com>>
wrote:
Hi,

How are you doing?

Please first introduce myself to you. I am Yi Du, working in a mortgage insurance company
called ‘Arch Capital Group’ based in Washington DC office in US. I find your profile under
the repo Spark of Github and would like to ask you one particular coding issue under Spark
ML. I tried to read the documentation of Spark and also asked in Stackoverflow but still have
no clue.

I am using Pyspark and using ML to build models. I have categorical variables as predictors
and would like to have interactions between two categorical variables in the model as well.

I was trying to follow the example here: https://spark.apache.org/docs/latest/ml-features#interaction<https://spark.apache.org/docs/latest/ml-features#interaction>
to create the interaction between two categorical variables.

Here is my snippet of code:

```python
stringIndexer = StringIndexer(inputCols=['fico_group','ltv_group'], outputCols=['fico_groupIndex1','ltv_groupIndex1'],
stringOrderType='frequencyAsc')
trs_data_index = stringIndexer.fit(trs_data).transform(trs_data)

interaction = Interaction(inputCols=['fico_groupIndex1','ltv_groupIndex1'], outputCol="interactedCol")
trs_data_interacted_temp = interaction.transform(trs_data_index)

encoder = OneHotEncoder(inputCols=['interactedCol'], outputCols=['interactedColVec'])
trs_data_interacted = encoder.fit(trs_data_interacted_temp).transform(trs_data_interacted_temp)
```

I basically index ‘fico_group’ and ‘ltv_group’ first and interact them together and
use onehotencoder to create the final column ‘interactedColVec’ for use.

However, the final results didn’t come as expected. My ‘fico_group’ has 5 levels and
so does ‘ltv_group’. So there are 5*5 = 25 combinations. But in the model estimates, one
level should be treated as base so I expected to see 25-1 = 24 interactions in the final estimates.
However, by using the above code, I have 25 interactions in the model estimates.

This is my post under Stackoverflow. https://stackoverflow.com/questions/64602060/add-interaction-term-to-ml<https://stackoverflow.com/questions/64602060/add-interaction-term-to-ml>

I don’t know if I articulated my question/issues clearly to you. But I do really appreciate
your help if possible or if you can direct me to the person who knows this.

Again, thank you very much for your help.

Best,
Yi



________________________________

The information contained in this e-mail message may be privileged and confidential information
and is intended only for the use of the individual and/or entity identified in the alias address
of this message. If the reader of this message is not the intended recipient, or an employee
or agent responsible to deliver it to the intended recipient, you are hereby requested not
to distribute or copy this communication. If you have received this communication in error,
please notify us immediately by telephone or return e-mail and delete the original message
from your system.

________________________________

The information contained in this e-mail message may be privileged and confidential information
and is intended only for the use of the individual and/or entity identified in the alias address
of this message. If the reader of this message is not the intended recipient, or an employee
or agent responsible to deliver it to the intended recipient, you are hereby requested not
to distribute or copy this communication. If you have received this communication in error,
please notify us immediately by telephone or return e-mail and delete the original message
from your system.
Mime
View raw message