spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aakash Basu <aakash.spark....@gmail.com>
Subject RE: Split column with dynamic data
Date Mon, 30 Oct 2017 21:27:43 GMT
Hey buddy,


Thanks a TON! Issue resolved.

Thanks again,
Aakash.

On 30-Oct-2017 11:44 PM, "Hondros, Constantine (ELS-AMS)" <
C.Hondros@elsevier.com> wrote:

> You should just use regexp_replace to remove all the leading number
> information (assuming it ends with a full-stop, and catering for the
> possibility of a capital letter).
>
>
>
> This is untested, but it shoud do the trick based on your examples so far:
>
>
>
> df.withColumn(“new_column”, regexp_replace($”Description”, “^\d+A-Z?\.”,
> “”))
>
>
>
>
>
> *From:* Aakash Basu [mailto:aakash.spark.raj@gmail.com]
> *Sent:* 30 October 2017 18:53
> *To:* user
> *Subject:* Split column with dynamic data
>
>
>
> **** External email: use caution ****
>
>
>
> Hi all,
>
>
>
> I've a requirement to split a column and fetch only the description where
> I have numbers appended before that for some rows whereas other rows have
> only the description -
>
>
>
> Eg - (Description is the column header)
>
>
>
> *Description*
>
> Inventory Tree
>
> Products
>
> 1. AT&T Services
>
> 2. Accessories
>
> 4. Miscellaneous
>
> 5. Service Center Items
>
> 3. 3rd Party Services
>
> Integrated Service
>
> 6. Demo Devices
>
> IT Department
>
> Merchandising
>
> 2A. Impulse
>
> 2. Handsets
>
> 3. Strategic Products
>
> 1. Opportunities
>
> 5. Features
>
> 6. Rate Plans
>
> 7. Other AT&T Incentives
>
> 8. Wired
>
> 4. Rate Plan Tier Change
>
> Integration SKUs (March 2016)
>
> 9. Financing SKUs
>
> 1. Smartphone
>
>
>
>
> From the above, I only want the words of description and remove the
> numbers. But since they're abrupt, it is difficult to use a single logic to
> have it done.
>
>
>
> Trying with the following options -
>
>
>
> 1) Using split by *fullstop*, together -
>
>
>
> *split_col = split(CategoryInp_DF['description'], '.')*
>
>
>
> *CategoryInp_DF= CategoryInp_DF.withColumn('unneeded',
> split_col.getItem(0))*
>
> *CategoryInp_DF= CategoryInp_DF.withColumn('description',
> split_col.getItem(1))*
>
>
>
>
>
> Result -> But both the output columns come as nulls.
>
>
>
> 2) Using split by *fullstop and space*, together -
>
> *split_col = split(CategoryInp_DF['description'], '. ')*
>
>
>
> *CategoryInp_DF= CategoryInp_DF.withColumn('unneeded',
> split_col.getItem(0))*
>
> *CategoryInp_DF= CategoryInp_DF.withColumn('description',
> split_col.getItem(1))*
>
>
>
> Result -> This is perfectly working for the numbered rows, but is
> unnecessarily splitting the non-numbered rows too.
>
>
>
> Any help would be greatly appreciated.
>
>
>
>
>
> Thanks,
>
> Aakash.
>
> ------------------------------
>
> Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The
> Netherlands
> <https://maps.google.com/?q=Radarweg+29,+1043+NX+Amsterdam,+The+Netherlands&entry=gmail&source=g>,
> Registration No. 33156677, Registered in The Netherlands.
>

Mime
View raw message