spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sunny Khatri <sunny.k...@gmail.com>
Subject Re: return probability \ confidence instead of actual class
Date Mon, 06 Oct 2014 17:35:04 GMT
One diff I can find is you may have different kernel functions for your
training, In Spark, you end up using Linear Kernel whereas for scikit you
are using rbk kernel. That can explain the different in the coefficients
you are getting.

On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais <
adamantios.corais@gmail.com> wrote:

> Hi again,
>
> Finally, I found the time to play around with your suggestions.
> Unfortunately, I noticed some unusual behavior in the MLlib results, which
> is more obvious when I compare them against their scikit-learn equivalent.
> Note that I am currently using spark 0.9.2. Long story short: I find it
> difficult to interpret the result: scikit-learn SVM always returns a value
> between 0 and 1 which makes it easy for me to set-up a threshold in order
> to keep only the most significant classifications (this is the case for
> both short and long input vectors). On the other hand, Spark MLlib makes it
> impossible to interpret the results; results are hardly ever bounded
> between -1 and +1 and hence it is impossible to choose a good cut-off value
> - results are of no practical use. And here is the strangest thing ever:
> although - it seems that - MLlib does NOT generate the right weights and
> intercept, when I feed the MLlib with the weights and intercept from
> scikit-learn the results become pretty accurate!!!! Any ideas about what is
> happening? Any suggestion is highly appreciated.
>
> PS: to make thinks easier I have quoted both of my implantations as well
> as results, bellow.
>
> //////////////////////////////////////////////////
>
> SPARK (short input):
> training_error: Double = 0.0
> res2: Array[Double] = Array(-1.4420684459128205E-19,
> -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749999999999999,
> 0.7499999999999998, 0.7499999999999998, 0.7499999999999998)
>
> SPARK (long input):
> training_error: Double = 0.0
> res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241,
> -0.782207630902241, 0.9522394329769612, 2.6866864968561632,
> 2.6866864968561632, 2.6866864968561632)
>
> PYTHON (short input):
> array([[-1.00000001],
>        [-1.00000001],
>        [-1.00000001],
>        [-0.        ],
>        [ 1.00000001],
>        [ 1.00000001],
>        [ 1.00000001]])
>
> PYTHON (long input):
> array([[-1.00000001],
>        [-1.00000001],
>        [-1.00000001],
>        [-0.        ],
>        [ 1.00000001],
>        [ 1.00000001],
>        [ 1.00000001]])
>
> //////////////////////////////////////////////////
>
> import analytics.MSC
>
> import java.util.Calendar
> import java.text.SimpleDateFormat
> import scala.collection.mutable
> import scala.collection.JavaConversions._
> import org.apache.spark.SparkContext._
> import org.apache.spark.mllib.classification.SVMWithSGD
> import org.apache.spark.mllib.regression.LabeledPoint
> import org.apache.spark.mllib.optimization.L1Updater
> import com.datastax.bdp.spark.connector.CassandraConnector
> import com.datastax.bdp.spark.SparkContextCassandraFunctions._
>
> val sc = MSC.sc
> val lg = MSC.logger
>
> //val s_users_double_2 = Seq(
> //  (0.0,Seq(0.0, 0.0, 0.0)),
> //  (0.0,Seq(0.0, 0.0, 0.0)),
> //  (0.0,Seq(0.0, 0.0, 0.0)),
> //  (1.0,Seq(1.0, 1.0, 1.0)),
> //  (1.0,Seq(1.0, 1.0, 1.0)),
> //  (1.0,Seq(1.0, 1.0, 1.0))
> //)
> val s_users_double_2 = Seq(
>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0))
> )
> val s_users_double = sc.parallelize(s_users_double_2)
>
> val s_users_parsed = s_users_double.map{line=>
>   LabeledPoint(line._1, line._2.toArray)
> }.cache()
>
> val iterations = 100
>
> val model = SVMWithSGD.train(s_users_parsed, iterations)
>
> val predictions1 = s_users_parsed.map{point=>
>   (point.label, model.predict(point.features))
> }.cache()
>
> val training_error = predictions1.filter(r=> r._1 !=
> r._2).count().toDouble / s_users_parsed.count()
>
> val TP = predictions1.map(s=> if (s._1==1.0 && s._2==1.0) true else
> false).filter(t=> t).count()
> val FP = predictions1.map(s=> if (s._1==0.0 && s._2==1.0) true else
> false).filter(t=> t).count()
> val TN = predictions1.map(s=> if (s._1==0.0 && s._2==0.0) true else
> false).filter(t=> t).count()
> val FN = predictions1.map(s=> if (s._1==1.0 && s._2==0.0) true else
> false).filter(t=> t).count()
>
> val weights = model.weights
>
> val intercept = model.intercept
>
> //val m_users_double_2 = Seq(
> //  Seq(0.0, 0.0, 0.0),
> //  Seq(0.0, 0.0, 0.0),
> //  Seq(0.0, 0.0, 0.0),
> //  Seq(0.5, 0.5, 0.5),
> //  Seq(1.0, 1.0, 1.0),
> //  Seq(1.0, 1.0, 1.0),
> //  Seq(1.0, 1.0, 1.0)
> //)
> val m_users_double_2 = Seq(
>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>       Seq(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5),
>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)
> )
> val m_users_double = sc.parallelize(m_users_double_2)
>
> val predictions2 = m_users_double.map{point=>
>   point.zip(weights).map(a=> a._1 * a._2).sum + intercept
> }.cache()
>
> predictions2.collect()
>
> //////////////////////////////////////////////////
>
> from sklearn import svm
>
> flag = 'short' # 'long'
>
> if flag == 'long':
>     X = [
>         [0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0],
>         [1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0]
>     ]
>     Y = [
>         0.0,
>         0.0,
>         0.0,
>         1.0,
>         1.0,
>         1.0
>     ]
>     T = [
>         [0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0],
>         [0.5, 0.5, 0.5],
>         [1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0]
>     ]
>
> if flag == 'long':
>     X = [
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
>     ]
>     Y = [
>         0.0,
>         0.0,
>         0.0,
>         1.0,
>         1.0,
>         1.0
>     ]
>     T = [
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
>     ]
>
> clf = svm.SVC()
> clf.fit(X, Y)
> svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
> gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,
> shrinking=True, tol=0.001, verbose=False)
> clf.decision_function(T)
>
> ///////////////////////////////////////////////////
>
>
>
>
> On Thu, Sep 25, 2014 at 2:25 AM, Sunny Khatri <sunny.kh03@gmail.com>
> wrote:
>
>> For multi-class you can use the same SVMWithSGD (for binary
>> classification) with One-vs-All approach constructing respective training
>> corpuses consisting one Class i as positive samples and Rest of the classes
>> as negative one, and then use the same method provided by Aris as a measure
>> of how far Class i is from the decision boundary.
>>
>> On Wed, Sep 24, 2014 at 4:06 PM, Aris <arisofalaska@gmail.com> wrote:
>>
>>> Χαίρε Αδαμάντιε Κοραή....έαν είναι πράγματι
το όνομα σου..
>>>
>>> Just to follow up on Liquan, you might be interested in removing the
>>> thresholds, and then treating the predictions as a probability from 0..1
>>> inclusive. SVM with the linear kernel is a straightforward linear
>>> classifier -- so you with the model.clearThreshold() you can just get the
>>> raw predicted scores, removing the threshold which simple translates that
>>> into a positive/negative class.
>>>
>>> API is here
>>> http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>>
>>> Enjoy!
>>> Aris
>>>
>>> On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei <liquanpei@gmail.com>
>>> wrote:
>>>
>>>> HI Adamantios,
>>>>
>>>> For your first question, after you train the SVM, you get a model with
>>>> a vector of weights w and an intercept b, point x such that  w.dot(x) + b
=
>>>> 1 and w.dot(x) + b = -1 are points that on the decision boundary. The
>>>> quantity w.dot(x) + b for point x is a confidence measure of
>>>> classification.
>>>>
>>>> Code wise, suppose you trained your model via
>>>> val model = SVMWithSGD.train(...)
>>>>
>>>> and you can set a threshold by calling
>>>>
>>>> model.setThreshold(your threshold here)
>>>>
>>>> to set the threshold that separate positive predictions from negative
>>>> predictions.
>>>>
>>>> For more info, please take a look at
>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>>>
>>>> For your second question, SVMWithSGD only supports binary
>>>> classification.
>>>>
>>>> Hope this helps,
>>>>
>>>> Liquan
>>>>
>>>> On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais <
>>>> adamantios.corais@gmail.com> wrote:
>>>>
>>>>> Nobody?
>>>>>
>>>>> If that's not supported already, can please, at least, give me a few
>>>>> hints on how to implement it?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
>>>>> adamantios.corais@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am working with the SVMWithSGD classification algorithm on Spark.
>>>>>> It works fine for me, however, I would like to recognize the instances
that
>>>>>> are classified with a high confidence from those with a low one.
How do we
>>>>>> define the threshold here? Ultimately, I want to keep only those
for which
>>>>>> the algorithm is very *very* certain about its its decision! How
to do
>>>>>> that? Is this feature supported already by any MLlib algorithm? What
if I
>>>>>> had multiple categories?
>>>>>>
>>>>>> Any input is highly appreciated!
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Liquan Pei
>>>> Department of Physics
>>>> University of Massachusetts Amherst
>>>>
>>>
>>>
>>
>

Mime
View raw message