spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devl Devel <devl.developm...@gmail.com>
Subject Re: K-Means And Class Tags
Date Fri, 09 Jan 2015 10:41:54 GMT
Hi Joseph

Thanks for the suggestion, however retag is a private method and when I
call in Scala:

val retaggedInput = parsedData.retag(classOf[Vector])

I get:

Symbol retag is inaccessible from this place

However I can do this from Java, and it works in Scala:

return words.rdd().retag(Vector.class);

Dev



On Thu, Jan 8, 2015 at 9:35 PM, Joseph Bradley <joseph@databricks.com>
wrote:

> I believe you're running into an erasure issue which we found in
> DecisionTree too.  Check out:
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala#L134
>
> That retags RDDs which were created from Java to prevent the exception
> you're running into.
>
> Hope this helps!
> Joseph
>
> On Thu, Jan 8, 2015 at 12:48 PM, Devl Devel <devl.development@gmail.com>
> wrote:
>
>> Thanks for the suggestion, can anyone offer any advice on the ClassCast
>> Exception going from Java to Scala? Why does JavaRDD.rdd() and then a
>> collect() result in this exception?
>>
>> On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska <yana.kadiyska@gmail.com>
>> wrote:
>>
>> > How about
>> >
>> >
>> data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]),
>> > Double.parseDouble(good_entry[1]))
>> > ‚Äč
>> > (full disclosure, I didn't actually run this). But after the first map
>> you
>> > should have an RDD[Array[String]], then you'd discard everything shorter
>> > than 2, and convert the rest to dense vectors?...In fact if you're
>> > expecting length exactly 2 might want to filter ==2...
>> >
>> >
>> > On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel <devl.development@gmail.com
>> >
>> > wrote:
>> >
>> >> Hi All,
>> >>
>> >> I'm trying a simple K-Means example as per the website:
>> >>
>> >> val parsedData = data.map(s =>
>> >> Vectors.dense(s.split(',').map(_.toDouble)))
>> >>
>> >> but I'm trying to write a Java based validation method first so that
>> >> missing values are omitted or replaced with 0.
>> >>
>> >> public RDD<Vector> prepareKMeans(JavaRDD<String> data) {
>> >>         JavaRDD<Vector> words = data.flatMap(new
>> FlatMapFunction<String,
>> >> Vector>() {
>> >>             public Iterable<Vector> call(String s) {
>> >>                 String[] split = s.split(",");
>> >>                 ArrayList<Vector> add = new ArrayList<Vector>();
>> >>                 if (split.length != 2) {
>> >>                     add.add(Vectors.dense(0, 0));
>> >>                 } else
>> >>                 {
>> >>                     add.add(Vectors.dense(Double.parseDouble(split[0]),
>> >>                Double.parseDouble(split[1])));
>> >>                 }
>> >>
>> >>                 return add;
>> >>             }
>> >>         });
>> >>
>> >>         return words.rdd();
>> >> }
>> >>
>> >> When I then call from scala:
>> >>
>> >> val parsedData=dc.prepareKMeans(data);
>> >> val p=parsedData.collect();
>> >>
>> >> I get Exception in thread "main" java.lang.ClassCastException:
>> >> [Ljava.lang.Object; cannot be cast to
>> >> [Lorg.apache.spark.mllib.linalg.Vector;
>> >>
>> >> Why is the class tag is object rather than vector?
>> >>
>> >> 1) How do I get this working correctly using the Java validation
>> example
>> >> above or
>> >> 2) How can I modify val parsedData = data.map(s =>
>> >> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size
>> <2
>> >> I
>> >> ignore the line? or
>> >> 3) Is there a better way to do input validation first?
>> >>
>> >> Using spark and mlib:
>> >> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %
>> "1.2.0"
>> >> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" %
>> "1.2.0"
>> >>
>> >> Many thanks in advance
>> >> Dev
>> >>
>> >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message