airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Laura Lorenz <llor...@industrydive.com>
Subject Re: parsing task instance log files
Date Fri, 10 Feb 2017 19:49:30 GMT
Ah I see, sorry not a Hive user so I didn't realize important info was
being stored exclusively in their logs. Seems annoying 😅

On Fri, Feb 10, 2017 at 2:22 PM, Boris Tyukin <boris@boristyukin.com> wrote:

> Hi Max, I was hoping you would chime in - thanks!
>
> this is exactly why I am doing it - I want to check row counts before I
> swap production table partitions with staging ones.
>
> Wow I did not realize that run_cli returns log and I thought I had checked
> the source code! this is definitely makes it easier for me to parse and
> looks like the best option for me at the moment.
>
> `return_stats=False` in `HiveCliHook.run_cli`  - I like this idea but still
> learning Python. Let me see if I can contribute to this! The only issue I
> see is that one can run multiple queries at the time and then it would be
> hard to figure out the stats. Also some queries might launch more than job
> so the question would be which job log to capture (or just the very last
> one which will make it much easier).
>
> Great idea about using Presto for counts check - I don't have it but I have
> Impala, will give it a try. With impala though, I have to invalidate
> metadata for the table to get proper count which sometimes might be a time
> consuming thing.
>
> I love the idea about stats collection bot and can see how valuable it can
> be - I hope we will see some advances soon!
>
> Oh and I just noticed you are the key contributor to Superset, I need to
> check it out asap then :) I hope it supports Impala.
>
>
>
>
> On Fri, Feb 10, 2017 at 1:00 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > A few related thoughts:
> >
> > * it sucks to have to parse Hive log files, though it seems like there
> are
> > no clean ways around it sometimes. We're doing similar for Superset to
> get
> > an idea of % of progress of the query
> > * really the HiveServer2 thrift client should allow to return structured
> > stats about the job after its execution, but that's not the case and
> > probably never will be...
> > * the `HiveCliHook.run_cli` does return the log after it's done (now I'm
> > noticing it could yield as it goes instead btw...), but that means you
> > don't need the `on_success_callback`, you can do what you need to do in
> > your PythonOperator which will simplify things a little
> > * knowing that parsing Hive logs is a common necessary pattern, I
> wouldn't
> > be opposed to adding some of the key elements in HiveCliHook. Maybe
> there's
> > a new arg `return_stats=False` in `HiveCliHook.run_cli` that returns a
> dict
> > with info about the query that just ran (row processed, cpu time,
> mappers /
> > reducers, number of phases, ...)
> > * if you have Presto, you could issue a quick COUNT(*) query right after
> > your job. At Airbnb we have this common subdag pattern that stages the
> data
> > in Hive, runs a set of data quality checks in Presto, and exchanges the
> > partition in Hive when it passes the tests
> >
> > A total side note is we have a stats collection bot at Airbnb that finds
> > all the tables/partitions that have changed recently in the metastore
> > (every 5 minutes), and issues a single scan Presto query that is
> > dynamically generated to get tons of stats for each column (% of null,
> min,
> > max, avg, count distinct, number of characters, ...) and stores results
> in
> > MySQL. This is super useful for capacity planning, debugging, data
> quality
> > checks, anomaly detection, ... I've been talking with Carlos Bueno from
> > Lyft yesterday who might be interested in taking this code (and perhaps
> > other Airbnb projects), generalizing the code, cleaning it up,
> documenting
> > it and open sourcing it. Most of these little projects are stack specific
> > and only useful to companies that happen to be running on the same stack
> as
> > we are.
> >
> > Max
> >
> >
> > On Fri, Feb 10, 2017 at 7:43 AM, Laura Lorenz <llorenz@industrydive.com>
> > wrote:
> >
> > > I don't use the HiveCliHook so I'm not sure how it works, but is the
> only
> > > place you can retrieve these counts the logfiles? If you have them at
> > time
> > > of query in your python callable, you could push them anywhere you like
> > > inline at the conclusion of the task. Or, you may prefer to have your
> > > PythonOperator `return` some data structure with those counts, which
> will
> > > be stored by default in the airflow metadata database per the XCom
> system
> > > <https://airflow.incubator.apache.org/concepts.html#xcoms>. Then,
> > > depending
> > > what you want to do with that, you could query those out of the
> metadata
> > > database with the ad-hoc querying or charting UIs right within Airflow,
> > or
> > > a later task altogether.
> > >
> > > On Fri, Feb 10, 2017 at 8:58 AM, Boris Tyukin <boris@boristyukin.com>
> > > wrote:
> > >
> > > > please...?
> > > >
> > > > On Thu, Feb 9, 2017 at 8:35 AM, Boris Tyukin <boris@boristyukin.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am using HiveCliHook called from PythonOperator to run a series
> of
> > > > > queries and want to capture record counts for auditing and
> validation
> > > > > purposes.
> > > > >
> > > > > *I am thinking to use on_success_callback to invoke python function
> > > that
> > > > > will read the log file, produced by airflow and then parse it out
> > using
> > > > > regex. *
> > > > >
> > > > > *I am going to use this method from models to get to the file log:*
> > > > >
> > > > > *def log_filepath(self): iso = self.execution_date.isoformat()
> log =
> > > > > os.path.expanduser(configuration.get('core', 'BASE_LOG_FOLDER'))
> > > return
> > > > (
> > > > > "{log}/{self.dag_id}/{self.task_id}/{iso}.log".format(**
> locals()))*
> > > > > Is this a good strategy or there is an easier way? I wondering if
> > > someone
> > > > > did something similar.
> > > > >
> > > > > Another challenge is that the same log file contains multiple
> > attempts
> > > > and
> > > > > reruns of the same task so I guess I need to parse the file
> > backwards.
> > > > >
> > > > > thanks,
> > > > > Boris
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message