spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <vii...@gmail.com>
Subject Re: how to construct parameter for model.transform() from datafile
Date Wed, 15 Mar 2017 03:50:15 GMT

Just found that you can specify number of features when loading libsvm
source:

val df = spark.read.option("numFeatures", "100").format("libsvm")



Liang-Chi Hsieh wrote
> As the libsvm format can't specify number of features, and looks like
> NaiveBayes doesn't have such parameter, if your training/testing data is
> sparse, the number of features inferred from the data files can be
> inconsistent.
> 
> We may need to fix this.
> 
> Before a fixing going into NaiveBayes, currently a workaround is to align
> the number of features between training and testing data before fitting
> the model.
> 
> jinhong lu wrote
>> After train the mode, I got the result look like this:
>> 
>> 
>> 	scala>  predictionResult.show()
>> 
>> +-----+--------------------+--------------------+--------------------+----------+
>> 	|label|            features|       rawPrediction|        
>> probability|prediction|
>> 
>> +-----+--------------------+--------------------+--------------------+----------+
>> 	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|      
>> 0.0|
>> 	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|      
>> 0.0|
>> 	|  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|      
>> 1.0|
>> 
>> And then, I transform() the data by these code:
>> 
>> 	import org.apache.spark.ml.linalg.Vectors
>> 	import org.apache.spark.ml.linalg.Vector
>> 	import scala.collection.mutable
>> 
>> 	   def lineToVector(line:String ):Vector={
>> 	    val seq = new mutable.Queue[(Int,Double)]
>> 	    val content = line.split(" ");
>> 	    for( s <- content){
>> 	      val index = s.split(":")(0).toInt
>> 	      val value = s.split(":")(1).toDouble
>> 	       seq += ((index,value))
>> 	    }
>> 	    return Vectors.sparse(144109, seq)
>> 	  }
>> 
>> 	 val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
>> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
>> =>
>> (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid",
>> "features")
>> 	 val predictionResult = model.transform(df)
>> 	 predictionResult.show()
>> 
>> 
>> But I got the error look like this:
>> 
>>  Caused by: java.lang.IllegalArgumentException: requirement failed: You
>> may not write an element to index 804201 because the declared size of
>> your vector is 144109
>>   at scala.Predef$.require(Predef.scala:224)
>>   at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
>>   at lineToVector(
>> <console>
>> :55)
>>   at $anonfun$4.apply(
>> <console>
>> :50)
>>   at $anonfun$4.apply(
>> <console>
>> :50)
>>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>   at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
>>   at
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>   at
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>>   at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>>   at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>>   at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>>   at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>> 
>> So I change    
>> 
>>  	return Vectors.sparse(144109, seq)
>> 
>> to 
>> 
>> 	return Vectors.sparse(804202, seq)
>> 
>> Another error occurs:
>> 
>> 	Caused by: java.lang.IllegalArgumentException: requirement failed: The
>> columns of A don't match the number of elements of x. A: 144109, x:
>> 804202
>> 	  at scala.Predef$.require(Predef.scala:224)
>> 	  at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
>> 	  at
>> org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
>> 	  at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
>> 
>> what should I do?
>>> 在 2017年3月13日,16:31,jinhong lu &lt;

>> lujinhong2@

>> &gt; 写道:
>>> 
>>> Hi, all:
>>> 
>>> I got these training data:
>>> 
>>> 	0 31607:17
>>> 	0 111905:36
>>> 	0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
>>> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
>>> 112109:4 123305:48 142509:1
>>> 	0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>>> 	0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19
>>> 31607:19
>>> 	0 19109:7 29705:4 123305:32
>>> 	0 15309:1 43005:1 108509:1
>>> 	1 604:1 6401:1 6503:1 15207:4 31607:40
>>> 	0 1807:19
>>> 	0 301:14 501:1 1502:14 2507:12 123305:4
>>> 	0 607:14 19109:460 123305:448
>>> 	0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48
>>> 128209:1
>>> 	1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2
>>> 27709:2 56509:8 122705:62 123305:31 124005:2
>>> 
>>> And then I train the model by spark:
>>> 
>>> 	import org.apache.spark.ml.classification.NaiveBayes
>>> 	import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>>> 	import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>>> 	import org.apache.spark.sql.SparkSession
>>> 
>>> 	val spark =
>>> SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>>> 	val data =
>>> spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>>> 	val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),
>>> seed = 1234L)
>>> 	//val model = new NaiveBayes().fit(trainingData)
>>> 	val model = new
>>> NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>>> 	val predictions = model.transform(testData)
>>> 	predictions.show()
>>> 
>>> 
>>> OK, I have got my model by the cole above, but how can I use this model
>>> to predict the classfication of other data like these:
>>> 
>>> 	ID1	509:2 5102:4 25909:1 31709:4 121905:19
>>> 	ID2	800201:1
>>> 	ID3	116005:4
>>> 	ID4	800201:1
>>> 	ID5	19109:1  21708:1 23208:1 49809:1 88609:1
>>> 	ID6	800201:1
>>> 	ID7	43505:7 106405:7
>>> 
>>> I know I can use the transform() method, but how to contrust the
>>> parameter for transform() method?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Thanks,
>>> lujinhong
>>> 
>> 
>> Thanks,
>> lujinhong
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

>> dev-unsubscribe@.apache





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-how-to-construct-parameter-for-model-transform-from-datafile-tp21155p21180.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message