datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eyal Allweil (JIRA)" <>
Subject [jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number
Date Sun, 19 Nov 2017 13:12:00 GMT


Eyal Allweil commented on DATAFU-63:

I wonder if the gradlew script is there because it can be "just used" on some systems. On
one computer I couldn't find a gradle installation at all - maybe I skipped the bootstrapping
step and the script worked for me. I'm no gradle expert so I don't know if this is feasible.
If not, it sounds like a good idea to open a new issue for removing the gradlew, following
your suggestion.

I looked at the code you linked, and I'll try again to answer your questions (in this issue
and in the comments in the code).

The way the Algebraic interface works in Pig is that Pig calls EvalFunc.getInitial to get
the name of the (inner) class with the Initial implementation, and creates it. You can see
this code in POUserFunc in the Apache Pig project, it's not part of DataFu. The Initial class
is then instantiated, and the exec there is called. The output is gathered and then the same
process happens for Intermediate, and the output from Initial becomes the input for Intermediate.
Finally, all these outputs from various mappers are sent to a reducer and the getFinal method
is used to instantiate the Final class.

This means the implementations you've suggested would only work if this UDF was implementing
the EvalFunc or AccumulatorEvalFunc interfaces. But in that case we already have a UDF that
gives us a comparable implementation in DataFu - [ReservoirSample|],
so I don't think it's what is desired in this particular Jira issue.

This brings us back to what Matthew wrote way back in 2014 - implementing algorithm 6 from
[this paper|]. I think what that algorithm
provides is a way to generate a _k_-sized sample without ever holding all k items in memory
in the mapper (which is the limitation for ReservoirSample, it won't work if _k_ is too big).

Does that make sense?

> SimpleRandomSample by a fixed number
> ------------------------------------
>                 Key: DATAFU-63
>                 URL:
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: jian wang
>            Assignee: jian wang
> SimpleRandomSample currently supports random sampling by probability, it does not support
random sample a fixed number of items. ReserviorSample may do the work but since it relies
on an in-memory priority queue, memory issue may happen if we are going to sample a huge number
of items, eg: sample 100M from 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that uses Manuver's
rejection threshold to reject items whose weight exceeds the threshold as we go from mapper
to combiner to reducer. The majority part of the algorithm will be very similar to SimpleRandomSample,
except that we do not use Berstein's theory to accept items and replace probability p = k
/ n,  k is the number of items to sample, n is the total number of items local in mapper,
combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a specified
number of rows from grouped data? I’m currently doing this, since it appears that the SAMPLE
operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>            FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the ORDER/LIMIT
isn’t great for performance). Is there a less awkward way to do this?
> Thanks,
> "

This message was sent by Atlassian JIRA

View raw message