From commits-return-630-apmail-datafu-commits-archive=datafu.apache.org@datafu.apache.org Thu Mar 22 19:01:15 2018 Return-Path: X-Original-To: apmail-datafu-commits-archive@minotaur.apache.org Delivered-To: apmail-datafu-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 57A6118C7A for ; Thu, 22 Mar 2018 19:01:15 +0000 (UTC) Received: (qmail 19570 invoked by uid 500); 22 Mar 2018 19:01:15 -0000 Delivered-To: apmail-datafu-commits-archive@datafu.apache.org Received: (qmail 19492 invoked by uid 500); 22 Mar 2018 19:01:15 -0000 Mailing-List: contact commits-help@datafu.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@datafu.apache.org Delivered-To: mailing list commits@datafu.apache.org Received: (qmail 19409 invoked by uid 99); 22 Mar 2018 19:01:15 -0000 Received: from Unknown (HELO svn01-us-west.apache.org) (209.188.14.144) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Mar 2018 19:01:15 +0000 Received: from svn01-us-west.apache.org (localhost [127.0.0.1]) by svn01-us-west.apache.org (ASF Mail Server at svn01-us-west.apache.org) with ESMTP id 4AA693A0D3E for ; Thu, 22 Mar 2018 19:01:13 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1827525 [23/49] - in /datafu/site/docs: datafu/1.4.0/ datafu/1.4.0/datafu/ datafu/1.4.0/datafu/pig/ datafu/1.4.0/datafu/pig/bags/ datafu/1.4.0/datafu/pig/geo/ datafu/1.4.0/datafu/pig/hash/ datafu/1.4.0/datafu/pig/hash/lsh/ datafu/1.4.0/dat... Date: Thu, 22 Mar 2018 19:01:10 -0000 To: commits@datafu.apache.org From: mhayes@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20180322190113.4AA693A0D3E@svn01-us-west.apache.org> Added: datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/CondEntropy.html URL: http://svn.apache.org/viewvc/datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/CondEntropy.html?rev=1827525&view=auto ============================================================================== --- datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/CondEntropy.html (added) +++ datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/CondEntropy.html Thu Mar 22 19:01:04 2018 @@ -0,0 +1,485 @@ + + + + + +CondEntropy (datafu-pig 1.4.0 API) + + + + + + + + + + + +
+
datafu.pig.stats.entropy
+

Class CondEntropy

+
+
+
    +
  • java.lang.Object
  • +
  • +
      +
    • org.apache.pig.EvalFunc<T>
    • +
    • +
        +
      • org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
      • +
      • +
          +
        • datafu.pig.stats.entropy.CondEntropy
        • +
        +
      • +
      +
    • +
    +
  • +
+
+
    +
  • +
    +
    All Implemented Interfaces:
    +
    org.apache.pig.Accumulator<java.lang.Double>
    +
    +
    +
    +
    public class CondEntropy
    +extends org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
    +
    Calculate conditional entropy H(Y|X) of random variables X and Y following conditional entropy's + wiki definition, + X is the conditional variable and Y is the variable that conditions on X. + +

    + Each tuple of the input bag has 2 fields, the 1st field is an object instance of variable X and + the 2nd field is an object instance of variable Y. An exception will be thrown if the number of fields is not 2. +

    + +

    + This UDF's constructor definition and parameters are the same as that of Entropy +

    + + Note: +
      +
    • The input bag to this UDF must be sorted on X and Y, with X in the first sort order. + An exception will be thrown if the input bag is not sorted. +
    • The returned entropy value is of double type. +
    + +

    + How to use: +

    + +

    + This UDF calculates conditional entropy given raw data tuples of X and Y without the need to pre-compute per tuple occurrence frequency. +

    + +

    + It could be used in a nested FOREACH after a GROUP BY, in which we sort the inner bag and use the sorted bag as this UDF's input. +

    + + Example: +
    + --define empirical conditional entropy with Euler's number as the logarithm base
    + define CondEntropy datafu.pig.stats.entropy.CondEntropy();
    +
    + input = LOAD 'input' AS (grp: chararray, valX: double, valY: double);
    +
    + -- calculate conditional entropy H(Y|X) in each group
    + input_group_g = GROUP input BY grp;
    + entropy_group = FOREACH input_group_g {
    +   input_val = input.(valX, valY)
    +   input_ordered = ORDER input_val BY $0, $1;
    +   GENERATE FLATTEN(group) AS group, CondEntropy(input_ordered) AS cond_entropy; 
    + }
    + 
    + 
    + + Use case to calculate mutual information: +
    + ------------
    + -- calculate mutual information I(X, Y) using conditional entropy UDF and entropy UDF
    + -- I(X, Y) = H(Y) - H(Y|X)
    + ------------
    +
    + define CondEntropy datafu.pig.stats.entropy.CondEntropy();
    + define Entropy datafu.pig.stats.entropy.Entropy();
    +
    + input = LOAD 'input' AS (grp: chararray, valX: double, valY: double);
    +
    + -- calculate the I(X,Y) in each group
    + input_group_g = GROUP input BY grp;
    + mutual_information = FOREACH input_group_g {
    +      input_val_x_y = input.(valX, valY);
    +      input_val_x_y_ordered = ORDER input_val_x_y BY $0,$1;
    +      input_val_y = input.valY;
    +      input_val_y_ordered = ORDER input_val_y BY $0;
    +      cond_h_x_y = CondEntropy(input_val_x_y_ordered);
    +      h_y = Entropy(input_val_y_ordered);
    +      GENERATE FLATTEN(group), h_y - cond_h_x_y;
    + }
    + 
    + 
    +
    See Also:
    Entropy
    +
  • +
+
+
+
    +
  • + +
      +
    • + + +

      Nested Class Summary

      +
        +
      • + + +

        Nested classes/interfaces inherited from class org.apache.pig.EvalFunc

        +org.apache.pig.EvalFunc.SchemaType
      • +
      +
    • +
    + +
      +
    • + + +

      Field Summary

      +
        +
      • + + +

        Fields inherited from class org.apache.pig.EvalFunc

        +log, pigLogger, reporter, returnType
      • +
      +
    • +
    + +
      +
    • + + +

      Constructor Summary

      + + + + + + + + + + + + + + +
      Constructors 
      Constructor and Description
      CondEntropy() 
      CondEntropy(java.lang.String type) 
      CondEntropy(java.lang.String type, + java.lang.String base) 
      +
    • +
    + +
      +
    • + + +

      Method Summary

      + + + + + + + + + + + + + + + + + + + + + + +
      Methods 
      Modifier and TypeMethod and Description
      voidaccumulate(org.apache.pig.data.Tuple input) 
      voidcleanup() 
      java.lang.DoublegetValue() 
      org.apache.pig.impl.logicalLayer.schema.SchemaoutputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input) 
      +
        +
      • + + +

        Methods inherited from class org.apache.pig.AccumulatorEvalFunc

        +exec
      • +
      +
        +
      • + + +

        Methods inherited from class org.apache.pig.EvalFunc

        +allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
      • +
      +
        +
      • + + +

        Methods inherited from class java.lang.Object

        +clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • +
      +
    • +
    +
  • +
+
+
+
    +
  • + +
      +
    • + + +

      Constructor Detail

      + + + +
        +
      • +

        CondEntropy

        +
        public CondEntropy()
        +            throws org.apache.pig.backend.executionengine.ExecException
        +
        Throws:
        +
        org.apache.pig.backend.executionengine.ExecException
        +
      • +
      + + + +
        +
      • +

        CondEntropy

        +
        public CondEntropy(java.lang.String type)
        +            throws org.apache.pig.backend.executionengine.ExecException
        +
        Throws:
        +
        org.apache.pig.backend.executionengine.ExecException
        +
      • +
      + + + +
        +
      • +

        CondEntropy

        +
        public CondEntropy(java.lang.String type,
        +           java.lang.String base)
        +            throws org.apache.pig.backend.executionengine.ExecException
        +
        Throws:
        +
        org.apache.pig.backend.executionengine.ExecException
        +
      • +
      +
    • +
    + +
      +
    • + + +

      Method Detail

      + + + +
        +
      • +

        accumulate

        +
        public void accumulate(org.apache.pig.data.Tuple input)
        +                throws java.io.IOException
        +
        +
        Specified by:
        +
        accumulate in interface org.apache.pig.Accumulator<java.lang.Double>
        +
        Specified by:
        +
        accumulate in class org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
        +
        Throws:
        +
        java.io.IOException
        +
      • +
      + + + +
        +
      • +

        getValue

        +
        public java.lang.Double getValue()
        +
        +
        Specified by:
        +
        getValue in interface org.apache.pig.Accumulator<java.lang.Double>
        +
        Specified by:
        +
        getValue in class org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
        +
        +
      • +
      + + + +
        +
      • +

        cleanup

        +
        public void cleanup()
        +
        +
        Specified by:
        +
        cleanup in interface org.apache.pig.Accumulator<java.lang.Double>
        +
        Specified by:
        +
        cleanup in class org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
        +
        +
      • +
      + + + +
        +
      • +

        outputSchema

        +
        public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
        +
        +
        Overrides:
        +
        outputSchema in class org.apache.pig.EvalFunc<java.lang.Double>
        +
        +
      • +
      +
    • +
    +
  • +
+
+
+ + + + + + + Added: datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Final.html URL: http://svn.apache.org/viewvc/datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Final.html?rev=1827525&view=auto ============================================================================== --- datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Final.html (added) +++ datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Final.html Thu Mar 22 19:01:04 2018 @@ -0,0 +1,318 @@ + + + + + +EmpiricalCountEntropy.Final (datafu-pig 1.4.0 API) + + + + + + + + + + + +
+
datafu.pig.stats.entropy
+

Class EmpiricalCountEntropy.Final

+
+
+
    +
  • java.lang.Object
  • +
  • +
      +
    • org.apache.pig.EvalFunc<java.lang.Double>
    • +
    • +
        +
      • datafu.pig.stats.entropy.EmpiricalCountEntropy.Final
      • +
      +
    • +
    +
  • +
+
+
    +
  • +
    +
    Enclosing class:
    +
    EmpiricalCountEntropy
    +
    +
    +
    +
    public static class EmpiricalCountEntropy.Final
    +extends org.apache.pig.EvalFunc<java.lang.Double>
    +
  • +
+
+
+
    +
  • + +
      +
    • + + +

      Nested Class Summary

      +
        +
      • + + +

        Nested classes/interfaces inherited from class org.apache.pig.EvalFunc

        +org.apache.pig.EvalFunc.SchemaType
      • +
      +
    • +
    + +
      +
    • + + +

      Field Summary

      +
        +
      • + + +

        Fields inherited from class org.apache.pig.EvalFunc

        +log, pigLogger, reporter, returnType
      • +
      +
    • +
    + + + +
      +
    • + + +

      Method Summary

      + + + + + + + + + + +
      Methods 
      Modifier and TypeMethod and Description
      java.lang.Doubleexec(org.apache.pig.data.Tuple input) 
      +
        +
      • + + +

        Methods inherited from class org.apache.pig.EvalFunc

        +allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, outputSchema, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
      • +
      +
        +
      • + + +

        Methods inherited from class java.lang.Object

        +clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • +
      +
    • +
    +
  • +
+
+
+
    +
  • + +
      +
    • + + +

      Constructor Detail

      + + + +
        +
      • +

        EmpiricalCountEntropy.Final

        +
        public EmpiricalCountEntropy.Final()
        +
      • +
      + + + +
        +
      • +

        EmpiricalCountEntropy.Final

        +
        public EmpiricalCountEntropy.Final(java.lang.String base)
        +
      • +
      +
    • +
    + +
      +
    • + + +

      Method Detail

      + + + +
        +
      • +

        exec

        +
        public java.lang.Double exec(org.apache.pig.data.Tuple input)
        +                      throws java.io.IOException
        +
        +
        Specified by:
        +
        exec in class org.apache.pig.EvalFunc<java.lang.Double>
        +
        Throws:
        +
        java.io.IOException
        +
      • +
      +
    • +
    +
  • +
+
+
+ + + + + + + Added: datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Initial.html URL: http://svn.apache.org/viewvc/datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Initial.html?rev=1827525&view=auto ============================================================================== --- datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Initial.html (added) +++ datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Initial.html Thu Mar 22 19:01:04 2018 @@ -0,0 +1,318 @@ + + + + + +EmpiricalCountEntropy.Initial (datafu-pig 1.4.0 API) + + + + + + + + + + + +
+
datafu.pig.stats.entropy
+

Class EmpiricalCountEntropy.Initial

+
+
+
    +
  • java.lang.Object
  • +
  • +
      +
    • org.apache.pig.EvalFunc<org.apache.pig.data.Tuple>
    • +
    • +
        +
      • datafu.pig.stats.entropy.EmpiricalCountEntropy.Initial
      • +
      +
    • +
    +
  • +
+
+
    +
  • +
    +
    Enclosing class:
    +
    EmpiricalCountEntropy
    +
    +
    +
    +
    public static class EmpiricalCountEntropy.Initial
    +extends org.apache.pig.EvalFunc<org.apache.pig.data.Tuple>
    +
  • +
+
+
+
    +
  • + +
      +
    • + + +

      Nested Class Summary

      +
        +
      • + + +

        Nested classes/interfaces inherited from class org.apache.pig.EvalFunc

        +org.apache.pig.EvalFunc.SchemaType
      • +
      +
    • +
    + +
      +
    • + + +

      Field Summary

      +
        +
      • + + +

        Fields inherited from class org.apache.pig.EvalFunc

        +log, pigLogger, reporter, returnType
      • +
      +
    • +
    + + + +
      +
    • + + +

      Method Summary

      + + + + + + + + + + +
      Methods 
      Modifier and TypeMethod and Description
      org.apache.pig.data.Tupleexec(org.apache.pig.data.Tuple input) 
      +
        +
      • + + +

        Methods inherited from class org.apache.pig.EvalFunc

        +allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, outputSchema, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
      • +
      +
        +
      • + + +

        Methods inherited from class java.lang.Object

        +clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • +
      +
    • +
    +
  • +
+
+
+
    +
  • + +
      +
    • + + +

      Constructor Detail

      + + + +
        +
      • +

        EmpiricalCountEntropy.Initial

        +
        public EmpiricalCountEntropy.Initial()
        +
      • +
      + + + +
        +
      • +

        EmpiricalCountEntropy.Initial

        +
        public EmpiricalCountEntropy.Initial(java.lang.String base)
        +
      • +
      +
    • +
    + +
      +
    • + + +

      Method Detail

      + + + +
        +
      • +

        exec

        +
        public org.apache.pig.data.Tuple exec(org.apache.pig.data.Tuple input)
        +                               throws java.io.IOException
        +
        +
        Specified by:
        +
        exec in class org.apache.pig.EvalFunc<org.apache.pig.data.Tuple>
        +
        Throws:
        +
        java.io.IOException
        +
      • +
      +
    • +
    +
  • +
+
+
+ + + + + + + Added: datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Intermediate.html URL: http://svn.apache.org/viewvc/datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Intermediate.html?rev=1827525&view=auto ============================================================================== --- datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Intermediate.html (added) +++ datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.Intermediate.html Thu Mar 22 19:01:04 2018 @@ -0,0 +1,318 @@ + + + + + +EmpiricalCountEntropy.Intermediate (datafu-pig 1.4.0 API) + + + + + + + + + + + +
+
datafu.pig.stats.entropy
+

Class EmpiricalCountEntropy.Intermediate

+
+
+
    +
  • java.lang.Object
  • +
  • +
      +
    • org.apache.pig.EvalFunc<org.apache.pig.data.Tuple>
    • +
    • +
        +
      • datafu.pig.stats.entropy.EmpiricalCountEntropy.Intermediate
      • +
      +
    • +
    +
  • +
+
+
    +
  • +
    +
    Enclosing class:
    +
    EmpiricalCountEntropy
    +
    +
    +
    +
    public static class EmpiricalCountEntropy.Intermediate
    +extends org.apache.pig.EvalFunc<org.apache.pig.data.Tuple>
    +
  • +
+
+
+
    +
  • + +
      +
    • + + +

      Nested Class Summary

      +
        +
      • + + +

        Nested classes/interfaces inherited from class org.apache.pig.EvalFunc

        +org.apache.pig.EvalFunc.SchemaType
      • +
      +
    • +
    + +
      +
    • + + +

      Field Summary

      +
        +
      • + + +

        Fields inherited from class org.apache.pig.EvalFunc

        +log, pigLogger, reporter, returnType
      • +
      +
    • +
    + + + +
      +
    • + + +

      Method Summary

      + + + + + + + + + + +
      Methods 
      Modifier and TypeMethod and Description
      org.apache.pig.data.Tupleexec(org.apache.pig.data.Tuple input) 
      +
        +
      • + + +

        Methods inherited from class org.apache.pig.EvalFunc

        +allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, outputSchema, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
      • +
      +
        +
      • + + +

        Methods inherited from class java.lang.Object

        +clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • +
      +
    • +
    +
  • +
+
+
+
    +
  • + +
      +
    • + + +

      Constructor Detail

      + + + +
        +
      • +

        EmpiricalCountEntropy.Intermediate

        +
        public EmpiricalCountEntropy.Intermediate()
        +
      • +
      + + + +
        +
      • +

        EmpiricalCountEntropy.Intermediate

        +
        public EmpiricalCountEntropy.Intermediate(java.lang.String base)
        +
      • +
      +
    • +
    + +
      +
    • + + +

      Method Detail

      + + + +
        +
      • +

        exec

        +
        public org.apache.pig.data.Tuple exec(org.apache.pig.data.Tuple input)
        +                               throws java.io.IOException
        +
        +
        Specified by:
        +
        exec in class org.apache.pig.EvalFunc<org.apache.pig.data.Tuple>
        +
        Throws:
        +
        java.io.IOException
        +
      • +
      +
    • +
    +
  • +
+
+
+ + + + + + + Added: datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.html URL: http://svn.apache.org/viewvc/datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.html?rev=1827525&view=auto ============================================================================== --- datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.html (added) +++ datafu/site/docs/datafu/1.4.0/datafu/pig/stats/entropy/EmpiricalCountEntropy.html Thu Mar 22 19:01:04 2018 @@ -0,0 +1,573 @@ + + + + + +EmpiricalCountEntropy (datafu-pig 1.4.0 API) + + + + + + + + + + + +
+
datafu.pig.stats.entropy
+

Class EmpiricalCountEntropy

+
+
+
    +
  • java.lang.Object
  • +
  • +
      +
    • org.apache.pig.EvalFunc<T>
    • +
    • +
        +
      • org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
      • +
      • +
          +
        • datafu.pig.stats.entropy.EmpiricalCountEntropy
        • +
        +
      • +
      +
    • +
    +
  • +
+
+
    +
  • +
    +
    All Implemented Interfaces:
    +
    org.apache.pig.Accumulator<java.lang.Double>, org.apache.pig.Algebraic
    +
    +
    +
    +
    public class EmpiricalCountEntropy
    +extends org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
    +implements org.apache.pig.Algebraic
    +
    Calculate the empirical entropy of random variable X given its occurrence frequencies, following entropy's + wiki definition. + +

    + This UDF's constructor takes 1 argument: the logarithm base, whose definition is the same as that defined in Entropy +

    + + Note: +
      +
    • Unlike Entropy, which calculates entropy from sorted raw data bag in accumulative mode, + this UDF calculates entropy from the data's occurrence frequencies which does not need to be sorted, either in accumulative or algebraic mode.
    • +
    • Each tuple of the UDF's input bag must only have 1 field, the occurrence frequency of a data instance, + and the data type of this field must be int or long. Otherwise, an exception will be thrown.
    • +
    • Negative frequency number will be silently discarded and a warning message will be logged in the job's log file.
    • +
    • The returned entropy value is of double type.
    • +
    + +

    + How to use: +

    + +

    + To use this UDF, customer needs to pre-compute the occurrence frequency of each data instance, often in an outer GROUP BY + , and then use this UDF to calculate entropy with those frequency numbers in another outer GROUP BY. +

    + +

    + Compared with Entropy, this UDF is more scalable when we need to handle a very large data set, + since it could distribute computation onto mappers and take advantage of combiners to reduce intermedidate output from mappers to reducers. +

    + + Example: +
    + define Entropy datafu.pig.stats.entropy.EmpiricalCountEntropy();
    +
    + input = LOAD 'input' AS (val: double);
    +
    + -- calculate the occurrence of each instance
    + counts_g = GROUP input BY val;
    + counts = FOREACh counts_g GENERATE COUNT(input) AS cnt;
    +
    + -- calculate entropy
    + input_counts_g = GROUP counts ALL;
    + entropy = FOREACH input_counts_g GENERATE Entropy(counts) AS entropy;
    + 
    + 
    + + Use case to calculate mutual information using EmpiricalCountEntropy: + +
    + define Entropy datafu.pig.stats.entropy.EmpiricalCountEntropy();
    +
    + input = LOAD 'input' AS (valX: double, valY: double);
    +
    + ------------
    + -- calculate mutual information I(X, Y) using entropy
    + -- I(X, Y) = H(X) + H(Y) -  H(X, Y)
    + ------------
    +
    + input_x_y_g = GROUP input BY (valX, valY);
    + input_x_y_cnt = FOREACH input_x_y_g GENERATE flatten(group) as (valX, valY), COUNT(input) AS cnt;
    +
    + input_x_g = GROUP input_x_y_cnt BY valX;
    + input_x_cnt = FOREACH input_x_g GENERATE flatten(group) as valX, SUM(input_x_y_cnt.cnt) AS cnt;
    +
    + input_y_g = GROUP input_x_y_cnt BY valY;
    + input_y_cnt = FOREACH input_y_g GENERATE flatten(group) as valY, SUM(input_x_y_cnt.cnt) AS cnt;
    +
    + input_x_y_entropy_g = GROUP input_x_y_cnt ALL;
    + input_x_y_entropy = FOREACH input_x_y_entropy_g {
    +                         input_x_y_entropy_cnt = input_x_y_cnt.cnt;
    +                         GENERATE Entropy(input_x_y_entropy_cnt) AS x_y_entropy;
    +                     }
    +
    + input_x_entropy_g = GROUP input_x_cnt ALL;
    + input_x_entropy = FOREACH input_x_entropy_g {
    +                         input_x_entropy_cnt = input_x_cnt.cnt;
    +                         GENERATE Entropy(input_x_entropy_cnt) AS x_entropy;
    +                   }
    +
    + input_y_entropy_g = GROUP input_y_cnt ALL;
    + input_y_entropy = FOREACH input_y_entropy_g {
    +                         input_y_entropy_cnt = input_y_cnt.cnt;
    +                         GENERATE Entropy(input_y_entropy_cnt) AS y_entropy;
    +                   }
    +
    + input_mi_cross = CROSS input_x_y_entropy, input_x_entropy, input_y_entropy;
    + input_mi = FOREACH input_mi_cross GENERATE (input_x_entropy::x_entropy +
    +                                             input_y_entropy::y_entropy - 
    +                                             input_x_y_entropy::x_y_entropy) AS mi;
    + 
    + 
    +
    See Also:
    Entropy
    +
  • +
+
+
+
    +
  • + + + +
      +
    • + + +

      Field Summary

      +
        +
      • + + +

        Fields inherited from class org.apache.pig.EvalFunc

        +log, pigLogger, reporter, returnType
      • +
      +
    • +
    + + + +
      +
    • + + +

      Method Summary

      + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
      Methods 
      Modifier and TypeMethod and Description
      voidaccumulate(org.apache.pig.data.Tuple input) 
      voidcleanup() 
      protected static org.apache.pig.data.Tuplecombine(org.apache.pig.data.DataBag values) 
      java.lang.StringgetFinal() 
      java.lang.StringgetInitial() 
      java.lang.StringgetIntermed() 
      java.lang.DoublegetValue() 
      org.apache.pig.impl.logicalLayer.schema.SchemaoutputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input) 
      +
        +
      • + + +

        Methods inherited from class org.apache.pig.AccumulatorEvalFunc

        +exec
      • +
      +
        +
      • + + +

        Methods inherited from class org.apache.pig.EvalFunc

        +allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
      • +
      +
        +
      • + + +

        Methods inherited from class java.lang.Object

        +clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • +
      +
    • +
    +
  • +
+
+
+
    +
  • + +
      +
    • + + +

      Constructor Detail

      + + + +
        +
      • +

        EmpiricalCountEntropy

        +
        public EmpiricalCountEntropy()
        +                      throws org.apache.pig.backend.executionengine.ExecException
        +
        Throws:
        +
        org.apache.pig.backend.executionengine.ExecException
        +
      • +
      + + + +
        +
      • +

        EmpiricalCountEntropy

        +
        public EmpiricalCountEntropy(java.lang.String base)
        +                      throws org.apache.pig.backend.executionengine.ExecException
        +
        Throws:
        +
        org.apache.pig.backend.executionengine.ExecException
        +
      • +
      +
    • +
    + +
      +
    • + + +

      Method Detail

      + + + +
        +
      • +

        getFinal

        +
        public java.lang.String getFinal()
        +
        +
        Specified by:
        +
        getFinal in interface org.apache.pig.Algebraic
        +
        +
      • +
      + + + +
        +
      • +

        getInitial

        +
        public java.lang.String getInitial()
        +
        +
        Specified by:
        +
        getInitial in interface org.apache.pig.Algebraic
        +
        +
      • +
      + + + +
        +
      • +

        getIntermed

        +
        public java.lang.String getIntermed()
        +
        +
        Specified by:
        +
        getIntermed in interface org.apache.pig.Algebraic
        +
        +
      • +
      + + + +
        +
      • +

        combine

        +
        protected static org.apache.pig.data.Tuple combine(org.apache.pig.data.DataBag values)
        +                                            throws org.apache.pig.backend.executionengine.ExecException
        +
        Throws:
        +
        org.apache.pig.backend.executionengine.ExecException
        +
      • +
      + + + +
        +
      • +

        accumulate

        +
        public void accumulate(org.apache.pig.data.Tuple input)
        +                throws java.io.IOException
        +
        +
        Specified by:
        +
        accumulate in interface org.apache.pig.Accumulator<java.lang.Double>
        +
        Specified by:
        +
        accumulate in class org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
        +
        Throws:
        +
        java.io.IOException
        +
      • +
      + + + +
        +
      • +

        getValue

        +
        public java.lang.Double getValue()
        +
        +
        Specified by:
        +
        getValue in interface org.apache.pig.Accumulator<java.lang.Double>
        +
        Specified by:
        +
        getValue in class org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
        +
        +
      • +
      + + + +
        +
      • +

        cleanup

        +
        public void cleanup()
        +
        +
        Specified by:
        +
        cleanup in interface org.apache.pig.Accumulator<java.lang.Double>
        +
        Specified by:
        +
        cleanup in class org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
        +
        +
      • +
      + + + +
        +
      • +

        outputSchema

        +
        public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
        +
        +
        Overrides:
        +
        outputSchema in class org.apache.pig.EvalFunc<java.lang.Double>
        +
        +
      • +
      +
    • +
    +
  • +
+
+
+ + + + + + +