spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sudhindra Magadi <smag...@gmail.com>
Subject Re: filling missing values in a sequence
Date Mon, 19 Sep 2016 06:39:57 GMT
that is correct

On Mon, Sep 19, 2016 at 12:09 PM, ayan guha <guha.ayan@gmail.com> wrote:

> Ok, so if you see
>
> 1,3,4,6.....
>
> Will you say 2,5 are missing?
>
> On Mon, Sep 19, 2016 at 4:15 PM, Sudhindra Magadi <smagadi@gmail.com>
> wrote:
>
>> Each of the records will be having a sequence id .No duplicates
>>
>> On Mon, Sep 19, 2016 at 11:42 AM, ayan guha <guha.ayan@gmail.com> wrote:
>>
>>> And how do you define missing sequence? Can you give an example?
>>>
>>> On Mon, Sep 19, 2016 at 3:48 PM, Sudhindra Magadi <smagadi@gmail.com>
>>> wrote:
>>>
>>>> Hi Jorn ,
>>>>  We have a file with billion records.We want to find if there any
>>>> missing sequences here .If so what are they ?
>>>> Thanks
>>>> Sudhindra
>>>>
>>>> On Mon, Sep 19, 2016 at 11:12 AM, Jörn Franke <jornfranke@gmail.com>
>>>> wrote:
>>>>
>>>>> I am not sure what you try to achieve here. Can you please tell us
>>>>> what the goal of the program is. Maybe with some example data?
>>>>>
>>>>> Besides this, I have the feeling that it will fail once it is not used
>>>>> in a single node scenario due to the reference to the global counter
>>>>> variable.
>>>>>
>>>>> Also unclear why you collect the data first to parallelize it again.
>>>>>
>>>>> On 18 Sep 2016, at 14:26, sudhindra <smagadi@gmail.com> wrote:
>>>>>
>>>>> Hi i have coded something like this , pls tell me how bad it is .
>>>>>
>>>>> package Spark.spark;
>>>>> import java.util.List;
>>>>> import java.util.function.Function;
>>>>>
>>>>> import org.apache.spark.SparkConf;
>>>>> import org.apache.spark.SparkContext;
>>>>> import org.apache.spark.api.java.JavaRDD;
>>>>> import org.apache.spark.api.java.JavaSparkContext;
>>>>> import org.apache.spark.sql.DataFrame;
>>>>> import org.apache.spark.sql.Dataset;
>>>>> import org.apache.spark.sql.Row;
>>>>> import org.apache.spark.sql.SQLContext;
>>>>>
>>>>>
>>>>>
>>>>> public class App
>>>>> {
>>>>>    static long counter=1;
>>>>>    public static void main( String[] args )
>>>>>    {
>>>>>
>>>>>
>>>>>
>>>>>        SparkConf conf = new
>>>>> SparkConf().setAppName("sorter").setMaster("local[2]").set("
>>>>> spark.executor.memory","1g");
>>>>>        JavaSparkContext sc = new JavaSparkContext(conf);
>>>>>
>>>>>        SQLContext sqlContext = new org.apache.spark.sql.SQLContex
>>>>> t(sc);
>>>>>
>>>>>        DataFrame df = sqlContext.read().json("path");
>>>>>        DataFrame sortedDF = df.sort("id");
>>>>>        //df.show();
>>>>>        //sortedDF.printSchema();
>>>>>
>>>>>        System.out.println(sortedDF.collectAsList().toString());
>>>>>        JavaRDD<Row> distData = sc.parallelize(sortedDF.collec
>>>>> tAsList());
>>>>>
>>>>>
>>>>>     List<String >missingNumbers=distData.map(new
>>>>> org.apache.spark.api.java.function.Function<Row, String>() {
>>>>>
>>>>>
>>>>>            public String call(Row arg0) throws Exception {
>>>>>                // TODO Auto-generated method stub
>>>>>
>>>>>
>>>>>                if(counter!=new Integer(arg0.getString(0)).intValue())
>>>>>                {
>>>>>                    StringBuffer misses = new StringBuffer();
>>>>>                    long newCounter=counter;
>>>>>                    while(newCounter!=new Integer(arg0.getString(0)).int
>>>>> Value())
>>>>>                    {
>>>>>                        misses.append(new String(new Integer((int)
>>>>> counter).toString()) );
>>>>>                        newCounter++;
>>>>>
>>>>>                    }
>>>>>                    counter=new Integer(arg0.getString(0)).int
>>>>> Value()+1;
>>>>>                    return misses.toString();
>>>>>
>>>>>                }
>>>>>                counter++;
>>>>>                return null;
>>>>>
>>>>>
>>>>>
>>>>>            }
>>>>>        }).collect();
>>>>>
>>>>>
>>>>>
>>>>>        for (String name: missingNumbers) {
>>>>>              System.out.println(name);
>>>>>            }
>>>>>
>>>>>
>>>>>
>>>>>    }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context: http://apache-spark-user-list.
>>>>> 1001560.n3.nabble.com/filling-missing-values-in-a-sequence-t
>>>>> p5708p27748.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards
>>>> Sudhindra S Magadi
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Thanks & Regards
>> Sudhindra S Magadi
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Thanks & Regards
Sudhindra S Magadi

Mime
View raw message