spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick McCarthy <pmccar...@dstillery.com>
Subject PySpark - Expand rows into dataframes via function
Date Mon, 02 Oct 2017 14:01:19 GMT
Hello,

I'm trying to map ARIN registry files into more explicit IP ranges. They
provide a number of IPs in the range (here it's 8192) and a starting IP,
and I'm trying to map it into all the included /24 subnets. For example,

Input:

array(['arin', 'US', 'ipv4', '23.239.160.0', 8192, 20131104.0, 'allocated',

       'ff26920a408f15613096aa7fe0ddaa57'], dtype=object)


Output:

array([['23', '239', '160', 'ff26920a408f15613096aa7fe0ddaa57'],
       ['23', '239', '161', 'ff26920a408f15613096aa7fe0ddaa57'],
       ['23', '239', '162', 'ff26920a408f15613096aa7fe0ddaa57'],

...


I have the input lookup table in a pyspark DF, and a python function
to do the conversion into the mapped output. I think to produce the
full mapping I need a UDTF but this concept doesn't seem to exist in
PySpark. What's the best approach to do this mapping and recombine
into a new DataFrame?


Thanks,

Patrick

Mime
View raw message