spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Meeraj Kunnumpurath <mee...@servicesymphony.com>
Subject Re: Logistic Regression Match Error
Date Sat, 19 Nov 2016 18:59:11 GMT
Digging through it looks like an issue with reading CSV. Some of the data
have embedded commas in them, these fields are rightly quoted. However, the
CSV reader seems to be getting to a pickle, when the records contain quoted
and unquoted data. Fields are only quoted, when there are commas within the
fields, otherwise they are unquoted.

Regards
Meeraj

On Sat, Nov 19, 2016 at 10:10 PM, Meeraj Kunnumpurath <
meeraj@servicesymphony.com> wrote:

> Hello,
>
> I have the following code that trains a mapping of review text to ratings.
> I use a tokenizer to get all the words from the review, and use a count
> vectorizer to get all the words. However, when I train the classifier I get
> a match error. Any pointers will be very helpful.
>
> The code is below,
>
> val spark = SparkSession.builder().appName("Logistic Regression").master("local").getOrCreate()
> import spark.implicits._
>
> val df = spark.read.option("header", "true").option("inferSchema", "true").csv("data/amazon_baby.csv")
> val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
> val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")
>
> val isGood = udf((x: Int) => if (x >= 4) 1 else 0)
>
> val words = tk.transform(df.withColumn("label", isGood('rating)))
> val Array(training, test) = cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2),
1)
>
> val classifier = new LogisticRegression()
>
> training.show(10)
>
> val simpleModel = classifier.fit(training)
> simpleModel.evaluate(test).predictions.select("words", "label", "prediction", "probability").show(10)
>
>
> And the error I get is below.
>
> 16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID
> 9)
> scala.MatchError: [null,1.0,(257358,[0,1,2,3,4,
> 5,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,
> 58,68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,
> 169,208,219,221,235,249,255,260,353,355,371,431,442,641,
> 711,972,1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,
> 6288,7294,8951,9758,12203,18319,21779,48525,72732,75420,
> 146476,192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,
> 2.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,
> 1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
> at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.
> apply(LogisticRegression.scala:266)
> at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.
> apply(LogisticRegression.scala:266)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(
> MemoryStore.scala:214)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:919)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:910)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
> at org.apache.spark.storage.BlockManager.doPutIterator(
> BlockManager.scala:910)
> at org.apache.spark.storage.BlockManager.getOrElseUpdate(
> BlockManager.scala:668)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>
> Many thanks
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169meeraj@servicesymphony.com <meeraj@servicesymphony.com>*
>



-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169meeraj@servicesymphony.com <meeraj@servicesymphony.com>*

Mime
View raw message