I triied .option("quote", "\""), which I believe is the default, still the same error. This is the offending record.

Primo 4-In-1 Soft Seat Toilet Trainer and Step Stool White with Pastel Blue Seat,"I chose this potty for my son because of the good reviews. I do not like it. I'm honestly baffled by all the great reviews now that I have this thing in front of me.1)It is made of cheap material, feels flimsy, the grips on the bottom of the thing do nothing to keep it in place when the child sits on it.2)It comes apart into 5 or 6 different pieces and all my son likes to do is take it apart. I did not want a potty that would turn into a toy, and this has just become like a puzzle for him, with all the different pieces.3)It is a little big for him. He is young still but he's a big boy for his age. I looked at one of the pictures posted and he looks about the same size as the curly haired kid reading the book, but the potty in that picture is NOT this potty! This one is a little bigger and he can't get quite touch his feet on the ground, which is important.4)And one final thing, maybe most importantly, the ""soft"" seat is not so soft. Doesn't seem very comfortable to me. It's just plastic on top of plastic... and after my son sits on it for just a few minutes his butt has horrible red marks all over it! Definitely not comfortable.So, overall, i'm not impressed at all.I gave it 2 stars because... it gets the job done I suppose, and for a child a little bit older than my son it might fit a little better. Also I really liked the idea that it was 4-in-1.Overall though, I do not suggest getting this potty. Look elseware!It's probably best to actually go to a store and look at them first hand, and not order online. That's what I should have done in the first place.",2


On Sat, Nov 19, 2016 at 10:59 PM, Meeraj Kunnumpurath <meeraj@servicesymphony.com> wrote:
Digging through it looks like an issue with reading CSV. Some of the data have embedded commas in them, these fields are rightly quoted. However, the CSV reader seems to be getting to a pickle, when the records contain quoted and unquoted data. Fields are only quoted, when there are commas within the fields, otherwise they are unquoted.

Regards
Meeraj

On Sat, Nov 19, 2016 at 10:10 PM, Meeraj Kunnumpurath <meeraj@servicesymphony.com> wrote:
Hello,

I have the following code that trains a mapping of review text to ratings. I use a tokenizer to get all the words from the review, and use a count vectorizer to get all the words. However, when I train the classifier I get a match error. Any pointers will be very helpful.

The code is below,

val spark = SparkSession.builder().appName("Logistic Regression").master("local").getOrCreate()
import spark.implicits._

val df = spark.read.option("header", "true").option("inferSchema", "true").csv("data/amazon_baby.csv")
val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")

val isGood = udf((x: Int) => if (x >= 4) 1 else 0)

val words = tk.transform(df.withColumn("label", isGood('rating)))
val Array(training, test) = cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1)

val classifier = new LogisticRegression()

training.show(10)

val simpleModel = classifier.fit(training)
simpleModel.evaluate(test).predictions.select("words", "label", "prediction", "probability").show(10)

And the error I get is below.

16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 9)
scala.MatchError: [null,1.0,(257358,[0,1,2,3,4,5,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,58,68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,169,208,219,221,235,249,255,260,353,355,371,431,442,641,711,972,1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,6288,7294,8951,9758,12203,18319,21779,48525,72732,75420,146476,192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,2.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)

Many thanks
--
Meeraj Kunnumpurath
Director and Executive Principal
Service Symphony Ltd
00 44 7702 693597
00 971 50 409 0169
meeraj@servicesymphony.com




--
Meeraj Kunnumpurath
Director and Executive Principal
Service Symphony Ltd
00 44 7702 693597
00 971 50 409 0169
meeraj@servicesymphony.com




--
Meeraj Kunnumpurath
Director and Executive Principal
Service Symphony Ltd
00 44 7702 693597
00 971 50 409 0169
meeraj@servicesymphony.com