spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Cason <joshua.aaron.ca...@gmail.com>
Subject Fwd: NoSuchElementException in ChiSqSelector fit method (version 1.6.0)
Date Mon, 28 Mar 2016 01:50:40 GMT
Hi All,

I'm running into an error that's not making a lot of sense to me, and
couldn't find sufficient info on the web to answer it myself.

BTW, you can also reply on Stack Overflow:
http://stackoverflow.com/questions/36254005/nosuchelementexception-in-chisqselector-fit-method-version-1-6-0


I've written code to generate a list of (String, ArrayBuffer[String]) pairs
and then use HashingTF to convert the features column to vectors (bc it's
for NLP research on parsing where I end up with a whole lot of unique
features; long story). Then I convert the string labels using
StringIndexer. I get the "key not found" error when running
ChiSqSelector.fit on the training data. The stack trace points to a hashmap
lookup in ChiSqTest for labels. This struck me as strange, because I could
sort of reason that perhaps I was using it wrong and had not somehow
accounted for unseen labels -- except this was the fit method on training
data.

Anyway, here's the interesting bit of my code followed by the important
part of the stack trace. Any help would be very much appreciated!!


>     val parSdp = sc.parallelize(sdp.take(100)) // it dies on a small
> amount of data
>     val insts: RDD[(String, ArrayBuffer[String])] =
>         parSdp.flatMap(x=> TrainTest.transformGraphSpark(x))
>
>     val indexer = new StringIndexer()
>         .setInputCol("labels")
>         .setOutputCol("labelIndex")
>
>     val instDF = sqlContext.createDataFrame(insts)
>         .toDF("labels","feats")
>     val hash = new HashingTF()
>         .setInputCol("feats")
>         .setOutputCol("hashedFeats")
>         .setNumFeatures(1000000)
>     val readyDF = hash.transform(indexer
>         .fit(instDF)
>         .transform(instDF))
>
>     val selector = new ChiSqSelector()
>         .setNumTopFeatures(100)
>         .setFeaturesCol("hashedFeats")
>         .setLabelCol("labelIndex")
>         .setOutputCol("selectedFeatures")
>
>     val Array(training, dev,test) =
> readyDF.randomSplit(Array(0.9,0.1,0.1), seed = 12345)
>
>     val chisq = selector.fit(training)


And the stack trace:

    java.util.NoSuchElementException: key not found: 23.0
>
>         at scala.collection.MapLike$class.default(MapLike.scala:228)
>         at scala.collection.AbstractMap.default(Map.scala:58)
>         at scala.collection.MapLike$class.apply(MapLike.scala:141)
>         at scala.collection.AbstractMap.apply(Map.scala:58)
>         at
> org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4$$anonfun$apply$4.apply(ChiSqTest.scala:131)
>         at
> org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4$$anonfun$apply$4.apply(ChiSqTest.scala:129)
>         at
> scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
>         at
> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
>         at
> org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:129)
>         at
> org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:125)
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>         at
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>         at
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>         at
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>         at
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>         at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>         at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>         at
> org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredFeatures(ChiSqTest.scala:125)
>         at
> org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:176)
>         at
> org.apache.spark.mllib.feature.ChiSqSelector.fit(ChiSqSelector.scala:193)
>         at
> org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:86)
>         at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:89)
>         at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:122)
>         ... etc etc


--
Best Wishes,
Joshua Cason

Mime
View raw message