spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <yanboha...@gmail.com>
Subject Re: IDF model error
Date Thu, 27 Nov 2014 06:48:45 GMT
Hi Shivani,

In Spark transformations are lazy operations that define a new RDD, while
actions launch computation to return value or write data to external
storage.
So your code will start execute when it reach an action operation when
call DocumentFrequencyAggregator
in org.apache.spark.mllib.feature.IDF. And the exception will be throw and
print the error stack at that point.

2014-11-27 10:30 GMT+08:00 Shivani Rao <raoshivani@gmail.com>:

> Thanks Yanbo,
>
> I wonder why does SSV does not complain when i create using " new SSV(4,
> Array(1, 3, 5, 7)"? Is there no error check for this even in the breeze
> sparse vector's constructor? That is very strange
>
> Shivani
>
> On Tue, Nov 25, 2014 at 7:25 PM, Yanbo Liang <yanbohappy@gmail.com> wrote:
>
>> Hi Shivani,
>>
>> You misunderstand the parameter of SparseVector.
>>
>> class SparseVector(
>>     override val size: Int,
>>     val indices: Array[Int],
>>     val values: Array[Double]) extends Vector {
>> }
>>
>> The first parameter is the total length of the Vector rather than the
>> length of non-zero elements.
>> So it need greater than the maximum non-zero element index which is 21 in
>> your case.
>> The following code can work:
>>
>> val doc1s = new IndexedRow(1L, new SSV(22, Array(1, 3, 5, 7),Array(1.0,
>> 1.0, 0.0, 5.0)))
>> val doc2s = new IndexedRow(2L, new SSV(22, Array(1, 2, 4, 13),
>> Array(0.0, 1.0, 2.0, 0.0)))
>> val doc3s = new IndexedRow(3L, new SSV(22, Array(10, 14, 20,
>> 21),Array(2.0, 0.0, 2.0, 1.0)))
>> val doc4s = new IndexedRow(4L, new SSV(22, Array(3, 7, 13,
>> 20),Array(2.0, 0.0, 2.0, 1.0)))
>>
>> 2014-11-26 10:09 GMT+08:00 Shivani Rao <raoshivani@gmail.com>:
>>
>>> Hello Spark fans,
>>>
>>> I am trying to use the IDF model available in the spark mllib to create
>>> an tf-idf representation of a n RDD[Vectors]. Below i have attached my MWE
>>>
>>> I get the following error
>>>
>>> "java.lang.IndexOutOfBoundsException: 7 not in [-4,4)
>>> at breeze.linalg.DenseVector.apply$mcI$sp(DenseVector.scala:70)
>>> at breeze.linalg.DenseVector.apply(DenseVector.scala:69)
>>> at
>>> org.apache.spark.mllib.feature.IDF$DocumentFrequencyAggregator.add(IDF.scala:81)
>>> "
>>>
>>> Any ideas?
>>>
>>> Regards,
>>> Shivani
>>>
>>> import org.apache.spark.mllib.feature.VectorTransformer
>>>
>>> import
>>> com.box.analytics.ml.dms.vector.{SparkSparseVector,SparkDenseVector}
>>>
>>> import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector
>>> => SSV}
>>>
>>> import org.apache.spark.mllib.linalg.{Vector => SparkVector}
>>>
>>> import org.apache.spark.mllib.linalg.distributed.{IndexedRow,
>>> IndexedRowMatrix}
>>>
>>> import org.apache.spark.mllib.feature._
>>>
>>>
>>>     val doc1s = new IndexedRow(1L, new SSV(4, Array(1, 3, 5,
>>> 7),Array(1.0, 1.0, 0.0, 5.0)))
>>>
>>>     val doc2s = new IndexedRow(2L, new SSV(4, Array(1, 2, 4, 13),
>>> Array(0.0, 1.0, 2.0, 0.0)))
>>>
>>>     val doc3s = new IndexedRow(3L, new SSV(4, Array(10, 14, 20,
>>> 21),Array(2.0, 0.0, 2.0, 1.0)))
>>>
>>>     val doc4s = new IndexedRow(4L, new SSV(4, Array(3, 7, 13,
>>> 20),Array(2.0, 0.0, 2.0, 1.0)))
>>>
>>>  val indata =
>>> sc.parallelize(List(doc1s,doc2s,doc3s,doc4s)).map(e=>e.vector)
>>>
>>> (new IDF()).fit(indata).idf
>>>
>>> --
>>> Software Engineer
>>> Analytics Engineering Team@ Box
>>> Mountain View, CA
>>>
>>
>>
>
>
> --
> Software Engineer
> Analytics Engineering Team@ Box
> Mountain View, CA
>

Mime
View raw message