>
> not quite sure why it is called zipWithIndex since zipping is not involved
>
It isn't?
http://stackoverflow.com/questions/1115563/what-is-zip-functional-programming
On Wed, Mar 11, 2015 at 5:18 PM, Steve Lewis <lordjoe2000@gmail.com> wrote:
>
> ---------- Forwarded message ----------
> From: Steve Lewis <lordjoe2000@gmail.com>
> Date: Wed, Mar 11, 2015 at 9:13 AM
> Subject: Re: Numbering RDD members Sequentially
> To: "Daniel, Ronald (ELS-SDG)" <R.Daniel@elsevier.com>
>
>
> perfect - exactly what I was looking for, not quite sure why it is called zipWithIndex
> since zipping is not involved
> my code does something like this where IMeasuredSpectrum is a large class
> we want to set an index for
>
> public static JavaRDD<IMeasuredSpectrum> indexSpectra(JavaRDD<IMeasuredSpectrum>
pSpectraToScore) {
>
> JavaPairRDD<IMeasuredSpectrum,Long> indexed = pSpectraToScore.zipWithIndex();
>
> pSpectraToScore = indexed.map(new AddIndexToSpectrum()) ;
> return pSpectraToScore;
> }
>
> public class AddIndexToSpectrum implements Function<Tuple2<IMeasuredSpectrum,
java.lang.Long>, IMeasuredSpectrum> {
> @Override
> public IMeasuredSpectrum doCall(final Tuple2<IMeasuredSpectrum, java.lang.Long>
v1) throws Exception {
> IMeasuredSpectrum spec = v1._1();
> long index = v1._2();
> spec.setIndex( index + 1 );
> return spec;
> }
>
> }
>
> }
>
>
> On Wed, Mar 11, 2015 at 6:57 AM, Daniel, Ronald (ELS-SDG) <
> R.Daniel@elsevier.com> wrote:
>
>> Have you looked at zipWithIndex?
>>
>>
>>
>> *From:* Steve Lewis [mailto:lordjoe2000@gmail.com]
>> *Sent:* Tuesday, March 10, 2015 5:31 PM
>> *To:* user@spark.apache.org
>> *Subject:* Numbering RDD members Sequentially
>>
>>
>>
>> I have Hadoop Input Format which reads records and produces
>>
>>
>>
>> JavaPairRDD<String,String> locatedData where
>>
>> _1() is a formatted version of the file location - like
>>
>> "000012690",, "000024386 ."000027523 ...
>>
>> _2() is data to be processed
>>
>>
>>
>> For historical reasons I want to convert _1() into in integer
>> representing the record number.
>>
>> so keys become "00000001", "0000002" ...
>>
>>
>>
>> (Yes I know this cannot be done in parallel) The PairRDD may be too large
>> to collect and work on one machine but small enough to handle on a single
>> machine.
>> I could use toLocalIterator to guarantee execution on one machine but
>> last time I tried this all kinds of jobs were launched to get the next
>> element of the iterator and I was not convinced this approach was efficient.
>>
>>
>>
>
>
|