datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "OlgaK (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DATAFU-63) SimpleRandomSample by a fixed number
Date Mon, 13 Nov 2017 17:48:00 GMT

    [ https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249870#comment-16249870
] 

OlgaK edited comment on DATAFU-63 at 11/13/17 5:47 PM:
-------------------------------------------------------

I'm on Linux/Fedora. I've not modified  gradlew file manually just, as it pointed in the docs
`gradle -b bootstrap.gradle`. I've done it with my 3.1 gradle. To remove a file from the repo:
move the file somewhere else, then commit, ten move it back and add to gitignore. Especially
if it's platform/versions dependent and by the docs should be generated locally.   
The changes in the file are substantial about 1/3 of the file. For example
{noformat}git diff gradlew
diff --git a/gradlew b/gradlew
index 16bbbbf..9aa616c 100755
--- a/gradlew
+++ b/gradlew
@@ -6,12 +6,30 @@
 ##
 ##############################################################################
 
-# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to p
-DEFAULT_JVM_OPTS="-XX:MaxPermSize=512m"
+# Attempt to set APP_HOME
+# Resolve links: $0 may be a link
+PRG="$0"
+# Need this for relative symlinks.
+while [ -h "$PRG" ] ; do
+    ls=`ls -ld "$PRG"`
+    link=`expr "$ls" : '.*-> \(.*\)$'`
+    if expr "$link" : '/.*' > /dev/null; then
+        PRG="$link"
+    else
+        PRG=`dirname "$PRG"`"/$link"
+    fi
+done
.....
{noformat}


was (Author: cur4so):
I'm on Linux/Fedora. I've not modified  gradlew file manually just, as it pointed in the docs
`gradle -b bootstrap.gradle`. I've done it with my 3.1 gradle. To remove a file from the repo:
move the file somewhere else, then commit, ten move it back and add to gitignore. Especially
if it's platform/versions dependent and by the docs should be generated locally.   
The changes in the file are substantial about 1/3 of the file. For example
{quote}git diff gradlew
diff --git a/gradlew b/gradlew
index 16bbbbf..9aa616c 100755
--- a/gradlew
+++ b/gradlew
@@ -6,12 +6,30 @@
 ##
 ##############################################################################
 
-# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to p
-DEFAULT_JVM_OPTS="-XX:MaxPermSize=512m"
+# Attempt to set APP_HOME
+# Resolve links: $0 may be a link
+PRG="$0"
+# Need this for relative symlinks.
+while [ -h "$PRG" ] ; do
+    ls=`ls -ld "$PRG"`
+    link=`expr "$ls" : '.*-> \(.*\)$'`
+    if expr "$link" : '/.*' > /dev/null; then
+        PRG="$link"
+    else
+        PRG=`dirname "$PRG"`"/$link"
+    fi
+done
.....
{quote}  

> SimpleRandomSample by a fixed number
> ------------------------------------
>
>                 Key: DATAFU-63
>                 URL: https://issues.apache.org/jira/browse/DATAFU-63
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: jian wang
>            Assignee: jian wang
>
> SimpleRandomSample currently supports random sampling by probability, it does not support
random sample a fixed number of items. ReserviorSample may do the work but since it relies
on an in-memory priority queue, memory issue may happen if we are going to sample a huge number
of items, eg: sample 100M from 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that uses Manuver's
rejection threshold to reject items whose weight exceeds the threshold as we go from mapper
to combiner to reducer. The majority part of the algorithm will be very similar to SimpleRandomSample,
except that we do not use Berstein's theory to accept items and replace probability p = k
/ n,  k is the number of items to sample, n is the total number of items local in mapper,
combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a specified
number of rows from grouped data? I’m currently doing this, since it appears that the SAMPLE
operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>            FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the ORDER/LIMIT
isn’t great for performance). Is there a less awkward way to do this?
> Thanks,
> "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message