spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From FireFly <zhaoming...@bankofamerica.com>
Subject Issue with using Generalized Linear Regression for Logistic Regression modeling
Date Fri, 09 Mar 2018 17:22:12 GMT
The Logistic Regression (LR) offered by Spark has rather limited model
statistics output. I would like to have access to q-value, AIC, standard
error etc. Generalized Linear Regression (GLR) does offer these statistics
in the model output, and can be used as as LR if one specifies
family="binomial", link="logit" in the GLR. The issue I ran into is that
some models converge nicely using Logistic Regression, but not using
Generalized Linear Regression. For other models, I do see they converge to
the same result using either LR or GLR.

I played around with the solver options in GLR, it didn't help. The option
that does make a difference is the weightCol. Without it, both LR and GLR
converge to the same thing, making sense of not aside. With the weightCol
included, LR converge, in about 10 iterations, to the same result as what we
got using SAS; GLR just won't converge (I tried 10000 iterations) and the
model coefficients at the end of the run, where the maximum number of
iteration was hit, are in the 10^12 range, which are way off.

I am using Spark 2.2.0 currently. The relevant part of the code is pasted
below.

    trainingData =
sqlContext.read.load(args.input_df_name).repartition(args.repartition)

   
catCol=['rwdproduct2','state_final','mixed','ocup','SECURED','o_channel','season','sa_C_ten_buck','sa_C_fico_buck','sa_C_otb_buck']
    numCol=['PRIME_ma_6L36']

    colNameModStr="_class"
    catColClass=[colName + colNameModStr for colName in catCol]

    stages = []
    for col in catCol:
        stringIndexer = StringIndexer(inputCol=col, outputCol=col+"Index")
        encoder = OneHotEncoder(inputCol=stringIndexer.getOutputCol(),
outputCol=col+colNameModStr)
        stages += [stringIndexer, encoder]

    assembler = VectorAssembler(inputCols=catColClass + numCol,
outputCol='features')

    glr=GeneralizedLinearRegression(family="binomial", link="logit",
solver="SGD", weightCol = "wt", labelCol="bad", maxIter=20, tol=1.0E-12,
regParam=0)

    pipeline = Pipeline(stages=stages + [assembler, glr])

    modelDF = pipeline.fit(trainingData)

    # --- Output some modeling results
    print("Model Betas is
{}".format(modelDF.stages.__getitem__(-1).coefficients))

Appreciate any help you would offer to resolve this.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message