spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Efficient sampling from a Hive table
Date Wed, 26 Aug 2015 16:36:36 GMT
Have you tried tablesample? You find the exact syntax in the documentation,
but it exlxactly does what you want

Le mer. 26 août 2015 à 18:12, Thomas Dudziak <tomdzk@gmail.com> a écrit :

> Sorry, I meant without reading from all splits. This is a single partition
> in the table.
>
> On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak <tomdzk@gmail.com> wrote:
>
>> I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows
>> from and I don't particularly care which rows. Doing a LIMIT unfortunately
>> results in two stages where the first stage reads the whole table, and the
>> second then performs the limit with a single worker, which is not very
>> efficient.
>> Is there a better way to sample a subset of rows in Spark without,
>> ideally in a single stage without reading all partitions.
>>
>> cheers,
>> Tom
>>
>
>

Mime
View raw message