spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aakash Basu <aakash.spark....@gmail.com>
Subject Re: Best way to process this dataset
Date Tue, 19 Jun 2018 07:40:18 GMT
Georg, just asking, can Pandas handle such a big dataset? If that data is
further passed into using any of the sklearn modules?

On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler <georg.kf.heiler@gmail.com>
wrote:

> use pandas or dask
>
> If you do want to use spark store the dataset as parquet / orc. And then
> continue to perform analytical queries on that dataset.
>
> Raymond Xie <xie3208080@gmail.com> schrieb am Di., 19. Juni 2018 um
> 04:29 Uhr:
>
>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment
>> is 20GB ssd harddisk and 2GB RAM.
>>
>> The dataset comes with
>> User ID: 987,994
>> Item ID: 4,162,024
>> Category ID: 9,439
>> Behavior type ('pv', 'buy', 'cart', 'fav')
>> Unix Timestamp: span between November 25 to December 03, 2017
>>
>> I would like to hear any suggestion from you on how should I process the
>> dataset with my current environment.
>>
>> Thank you.
>>
>> *------------------------------------------------*
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>

Mime
View raw message