spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: UseCase_Design_Help
Date Wed, 05 Oct 2016 15:12:04 GMT
Hi

You can "generate" a sql through program. Python Example:

>>> schema
['id', 'Mammals', 'Birds', 'Fish', 'Reptiles', 'Amphibians']
>>>

>>> count_stmt=[ "count(distinct <tag>) as <tag>".replace("<tag>",x)
for x
in schema]
>>> count_stmt
['count(distinct id) as id', 'count(distinct Mammals) as Mammals',
'count(distinct Birds) as Birds', 'count(distinct Fish) as Fish',
'count(distinct Reptiles) as Reptiles', 'count(distinct Amphibians) as
Amphibians']

>>> sql = "select "
>>> sql = sql +  ", ".join(count_stmt) + " from t "
>>> sql
'select count(distinct id) as id, count(distinct Mammals) as Mammals,
count(distinct Birds) as Birds, count(distinct Fish) as Fish,
count(distinct Reptiles) as Reptiles, count(distinct Amphibians) as
Amphibians from t '
>>>

>>> df.show()
+---+--------+----------+--------+--------+----------+
| id| Mammals|     Birds|    Fish|Reptiles|Amphibians|
+---+--------+----------+--------+--------+----------+
|  1|    Dogs|     Eagle|Goldfish|    NULL|      Frog|
|  2|    Cats|   Peacock|   Guppy|  Turtle|Salamander|
|  3|Dolphins|     Eagle|  Zander|    NULL|      Frog|
|  4|  Whales|    Parrot|   Guppy|   Snake|      Frog|
|  5|  Horses|       Owl|   Guppy|   Snake|      Frog|
|  6|Dolphins|Kingfisher|  Zander|  Turtle|      Frog|
|  7|    Dogs|   Sparrow|Goldfish|    NULL|Salamander|
+---+--------+----------+--------+--------+----------+

>>> cnr = sqlContext.sql(sql)
>>> cnr.show()
+---+-------+-----+----+--------+----------+
| id|Mammals|Birds|Fish|Reptiles|Amphibians|
+---+-------+-----+----+--------+----------+
|  7|      5|    6|   3|       3|         2|
+---+-------+-----+----+--------+----------+

Does that help?

On Thu, Oct 6, 2016 at 1:44 AM, Ajay Chander <itschevva@gmail.com> wrote:

> Hi Ayan,
>
> My Schema for DF2 is fixed but it has around 420 columns (70 Animal type
> columns and 350 other columns).
>
> Thanks,
> Ajay
>
> On Wed, Oct 5, 2016 at 10:37 AM, ayan guha <guha.ayan@gmail.com> wrote:
>
>> Is your schema for df2 is fixed? ie do you have 70 category columns?
>>
>> On Thu, Oct 6, 2016 at 12:50 AM, Daniel Siegmann <
>> dsiegmann@securityscorecard.io> wrote:
>>
>>> I think it's fine to read animal types locally because there are only 70
>>> of them. It's just that you want to execute the Spark actions in parallel.
>>> The easiest way to do that is to have only a single action.
>>>
>>> Instead of grabbing the result right away, I would just add a column for
>>> the animal type and union the datasets for the animal types. Something like
>>> this (not sure if the syntax is correct):
>>>
>>> val animalCounts: DataFrame = animalTypes.map { anmtyp =>
>>>     sqlContext.sql("select lit("+anmtyp+") as animal_type,
>>> count(distinct("+anmtyp+")) from TEST1 ")
>>> }.reduce(_.union(_))
>>>
>>> animalCounts.foreach( /* print the output */ )
>>>
>>> On Wed, Oct 5, 2016 at 12:42 AM, Daniel <daniel.tizon@gmail.com> wrote:
>>>
>>>> First of all, if you want to read a txt file in Spark, you should use
>>>> sc.textFile, because you are using "Source.fromFile", so you are reading
it
>>>> with Scala standard api, so it will be read sequentially.
>>>>
>>>> Furthermore you are going to need create a schema if you want to use
>>>> dataframes.
>>>>
>>>> El 5/10/2016 1:53, "Ajay Chander" <itschevva@gmail.com> escribió:
>>>>
>>>>> Right now, I am doing it like below,
>>>>>
>>>>> import scala.io.Source
>>>>>
>>>>> val animalsFile = "/home/ajay/dataset/animal_types.txt"
>>>>> val animalTypes = Source.fromFile(animalsFile).getLines.toArray
>>>>>
>>>>> for ( anmtyp <- animalTypes ) {
>>>>>       val distinctAnmTypCount = sqlContext.sql("select
>>>>> count(distinct("+anmtyp+")) from TEST1 ")
>>>>>       println("Calculating Metrics for Animal Type: "+anmtyp)
>>>>>       if( distinctAnmTypCount.head().getAs[Long](0) <= 10 ){
>>>>>         println("Animal Type: "+anmtyp+" has <= 10 distinct values")
>>>>>       } else {
>>>>>         println("Animal Type: "+anmtyp+" has > 10 distinct values")
>>>>>       }
>>>>>     }
>>>>>
>>>>> But the problem is it is running sequentially.
>>>>>
>>>>> Any inputs are appreciated. Thank you.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Ajay
>>>>>
>>>>>
>>>>> On Tue, Oct 4, 2016 at 7:44 PM, Ajay Chander <itschevva@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> I have a use-case where I have two Dataframes like below,
>>>>>>
>>>>>> 1) First Dataframe(DF1) contains,
>>>>>>
>>>>>> *    ANIMALS    *
>>>>>> Mammals
>>>>>> Birds
>>>>>> Fish
>>>>>> Reptiles
>>>>>> Amphibians
>>>>>>
>>>>>> 2) Second Dataframe(DF2) contains,
>>>>>>
>>>>>> *    ID, Mammals, Birds, Fish, Reptiles, Amphibians    *
>>>>>> 1,      Dogs,      Eagle,      Goldfish,      NULL,      Frog
>>>>>> 2,      Cats,      Peacock,      Guppy,     Turtle,      Salamander
>>>>>> 3,      Dolphins,      Eagle,      Zander,      NULL,      Frog
>>>>>> 4,      Whales,      Parrot,      Guppy,      Snake,      Frog
>>>>>> 5,      Horses,      Owl,      Guppy,      Snake,      Frog
>>>>>> 6,      Dolphins,      Kingfisher,      Zander,      Turtle,    
 Frog
>>>>>> 7,      Dogs,      Sparrow,      Goldfish,      NULL,      Salamander
>>>>>>
>>>>>> Now I want to take each row from DF1 and find out its distinct count
>>>>>> in DF2. Example, pick Mammals from DF1 then find out
>>>>>> count(distinct(Mammals)) from DF2 i.e. 5
>>>>>>
>>>>>> DF1 has 70 distinct rows/Animal types
>>>>>> DF2 has some million rows
>>>>>>
>>>>>> Whats the best way to achieve this efficiently using parallelism
?
>>>>>>
>>>>>> Any inputs are helpful. Thank you.
>>>>>>
>>>>>> Regards,
>>>>>> Ajay
>>>>>>
>>>>>>
>>>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Mime
View raw message