spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sathish Kumaran Vairavelu <vsathishkuma...@gmail.com>
Subject Re: PySpark - Expand rows into dataframes via function
Date Mon, 02 Oct 2017 23:52:22 GMT
It's possible with array function combined with struct construct. Below is
a SQL example

select Array(struct(ip1,hashkey), struct(ip2,hashkey))
from (select substr(col1,1,2) as ip1, substr(col1,3,3) as ip2, etc, hashkey
from object) a

If you want dynamic ip ranges; you need to dynamically construct structs
based on the range values. Hope this helps.


Thanks

Sathish

On Mon, Oct 2, 2017 at 9:01 AM Patrick McCarthy <pmccarthy@dstillery.com>
wrote:

> Hello,
>
> I'm trying to map ARIN registry files into more explicit IP ranges. They
> provide a number of IPs in the range (here it's 8192) and a starting IP,
> and I'm trying to map it into all the included /24 subnets. For example,
>
> Input:
>
> array(['arin', 'US', 'ipv4', '23.239.160.0', 8192, 20131104.0, 'allocated',
>
>        'ff26920a408f15613096aa7fe0ddaa57'], dtype=object)
>
>
> Output:
>
> array([['23', '239', '160', 'ff26920a408f15613096aa7fe0ddaa57'],
>        ['23', '239', '161', 'ff26920a408f15613096aa7fe0ddaa57'],
>        ['23', '239', '162', 'ff26920a408f15613096aa7fe0ddaa57'],
>
> ...
>
>
> I have the input lookup table in a pyspark DF, and a python function to do the conversion
into the mapped output. I think to produce the full mapping I need a UDTF but this concept
doesn't seem to exist in PySpark. What's the best approach to do this mapping and recombine
into a new DataFrame?
>
>
> Thanks,
>
> Patrick
>
>

Mime
View raw message