pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mário Sérgio Fujikawa Ferreira <lioux...@gmail.com>
Subject Re: Re: Submitting multiple Pig Scripts on the same Session
Date Mon, 04 Feb 2019 08:05:19 GMT
We're wondering if there was something like Apache Hive LLAP: 
https://cwiki.apache.org/confluence/display/Hive/LLAP

We submit scripts asynchronously throughout the day. Never more than 20 
a time up to a thousand a day. Input file size varies from less than a 
megabyte to a couple terabytes.

1. Hadoop distribution is Hortonworks HDP 2.6.3

2. Apache pig 0.16 using TEZ.

3. SQL database is Pivotal HAWQ 2.3.0.0. Data is sent to the database 
for both insert and joins using Pivotal HAWQ external tables (CSV 
files). Data is retrieved from database using external tables as well.

    3.1.
    https://hdb.docs.pivotal.io/230/hawq/datamgmt/load/g-working-with-file-based-ext-tables.html

    3.2.
    https://hdb.docs.pivotal.io/230/hawq/pxf/PXFExternalTableandAPIReference.html

4. All processing is done on HDFS and all intermediate files are 
compressed with lzo.

We orchestrate everything using python (not jython).

1. python script detects new input files.

2. Prepares a pig script according to rules parameterized on a SQL database.

3. Submits pig script by pig command line client (-exectype tez).

4. Use output files (CSV file generated by pig script on item 3) for 
join operations on the SQL database.

5. Prepares another pig script against result (CSV file generated by 
Pivotal HAWQ on item 4) of join operation on SQL database.

6. Submits pig script by pig command line client (-exectype tez).

7. Finally, loads table (CSV file generated from script from item 5) on 
SQL  database.

We're considering some optimizations:

1. Share AM/tez sessions across different scripts using something 
similar to Hive LLAP. A continuously running YARN daemon that can share 
resources across different pig scripts. I haven't found anything 
similar. Unfortunately, I have no idea how where to begin if we were to 
code this. It's just an out there idea. Any pointers/suggestions would 
be appreciated.

2. Write a pig UDF to arbitrarily submit SQL statements to the database 
so that we don't have to run 2 separate pig script with 2 SQL statements 
in between. It would be a single script as follows:

       1st_pig_script_statements;

       exec;

       sql_udf_run;

       exec;

       2nd_script_statements;

       exec;

       sql_udf_run;

    2.1. This would submit everything under a single AM thus sharing
    resources and reducing overall run time (less start/stop script
    overhead). Is the sql_query_submit_UDF idea feasible? Should I just
    bite the bullet and use jython instead? At least, for the pig
    scripts? Can I just write a standard UDF and run it against a fake
    one  line input file?

3. Set pig.auto.local.enabled to true to reduce some overhead on small 
input files for faster (less time) processing. Unfortunately, I haven't 
seen much gain here on 100 megabytes input files when testing with 
exectype tez_local. Furthermore, the pig script on tez_local mode 
wouldn't find the input files. I had to prefix file paths with hdfs:///

Any help is appreciated. We've been using Apache PIG for ETL purposes 
for more than an year and we're very satisfied with it's 
performance/ease of use.

Best regards,
   Mário Sérgio

On 22/01/2019 16:49, Rohini Palaniswamy wrote:
> If you are using PigServer and submitting programmatically via same jvm, it
> should automatically reuse the application if the requested AM resources
> are same.
>
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezSessionManager.java#L242-L245
>
> On Fri, Jan 18, 2019 at 12:20 PM Diego Pereira <diego.ns.pereira@gmail.com>
> wrote:
>
>> Hi!
>>
>> We are developing an application that is looking for new files on a folder,
>> running a few Pig Scripts to prepare those files and, finally, loading them
>> into our database.
>>
>> The problem is that, for small files, the time that Pig / Tez / Yarn take
>> to create a new application master and spawn new containers is way longer
>> than the time it takes processing.
>>
>> Since Tez Sessions already allows a single Pig script to run multiple DAGs
>> against the same application master, is there a way to reuse that
>> application master and it´s containers for multiple Pig Scripts submissions
>> ?
>>
>> Regards,
>>
>> Diego
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message