From dev-return-5422-apmail-metron-dev-archive=metron.apache.org@metron.incubator.apache.org Mon Jan 16 16:28:18 2017 Return-Path: X-Original-To: apmail-metron-dev-archive@minotaur.apache.org Delivered-To: apmail-metron-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9BB9F190A0 for ; Mon, 16 Jan 2017 16:28:18 +0000 (UTC) Received: (qmail 90589 invoked by uid 500); 16 Jan 2017 16:28:18 -0000 Delivered-To: apmail-metron-dev-archive@metron.apache.org Received: (qmail 90539 invoked by uid 500); 16 Jan 2017 16:28:18 -0000 Mailing-List: contact dev-help@metron.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@metron.incubator.apache.org Delivered-To: mailing list dev@metron.incubator.apache.org Received: (qmail 90524 invoked by uid 99); 16 Jan 2017 16:28:18 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Jan 2017 16:28:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id AC477C0E52 for ; Mon, 16 Jan 2017 16:28:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.949 X-Spam-Level: * X-Spam-Status: No, score=1.949 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id KdfyrHdUGvX5 for ; Mon, 16 Jan 2017 16:28:15 +0000 (UTC) Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id EBE2F5FB33 for ; Mon, 16 Jan 2017 16:28:14 +0000 (UTC) Received: by mail-io0-f169.google.com with SMTP id l66so95328844ioi.1 for ; Mon, 16 Jan 2017 08:28:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=7jJcUGQWHfNl+a+wOEf+55MlMq95vBFwHsREx6WMN6M=; b=SfVY6OC4LfTOlVNgzlg3lzYELJWSv/BbGE6mTQNP6YLjJLqYSFR9Jtq6UOu6HnRCxH Oe/yUN+bsadb4OIpd56R75o0RXbkpDsj+sViAQ7oukpZcCvQvhPOzyn9wMBPxIE2Kl9X lR5XASU1nWTACBBoWbI15Gqei00AvUYdKjG4ikdhdpV0NPKiZFmlCPWTGBYUoM6xsFbe BuChwj/y0lGR+nzFDOF00gZFGphKOz+3jzoDEQ4y0sbyp0nIrvKN2nS9YoERVy/s2tWH vWIZrRggZov+MBCr4px1Bf3LQfHXhwqV8W/ELGC8vdHeRzS9Lrj384r1HpMEM0EwjgWH 2N5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=7jJcUGQWHfNl+a+wOEf+55MlMq95vBFwHsREx6WMN6M=; b=ui17/aoMjyEZzuanTjy7+QpqU+80JxTmwpYQd9cGgw0yh/lJ9iuoUvJ7EGeJ+5MOyU V9HBBREdmX9z0H9X3tbvYOGFViBn6krPQciCpvRbzsA8Dif4DKVd13jXwHRACKMdMQR/ hnmD/fcL3pWpSg2KXokLhXxOus1WOWuuYo3rt/RC11qCBh83s9ogrW9vCGDBHhPX2piL EQpMg4jRiQHfBugKka0CkfoYvhjxVS7pdnPRi6dk6m7j7vayvct5tv6pAzV9g0KQg9i/ rDaf5LvEdPRPBJB6c2bdMkn60h/IZ8JefwlaG1Vw8fEEgDzdleK1uLADy232NHjMkNWE RP3w== X-Gm-Message-State: AIkVDXIrCEOVkIY3JGaHMTL65H458JxorhzXljQv1UuMOLFEjK0liYoOvtPvQhnCojg/X8f3pGfQyvwtaK0+0g== X-Received: by 10.107.173.95 with SMTP id w92mr6744727ioe.136.1484584093507; Mon, 16 Jan 2017 08:28:13 -0800 (PST) MIME-Version: 1.0 Received: by 10.79.152.220 with HTTP; Mon, 16 Jan 2017 08:28:13 -0800 (PST) In-Reply-To: References: <30AD7DC7157E7A4BAE9ABECC9C9689D322D20308@EXCHSRVRTPA03.sstech.internal> From: David Lyle Date: Mon, 16 Jan 2017 11:28:13 -0500 Message-ID: Subject: Re: [DISCUSS] Moving GeoIP management away from MySQL To: dev@metron.incubator.apache.org Content-Type: multipart/alternative; boundary=001a11449b548869e0054638abc3 --001a11449b548869e0054638abc3 Content-Type: text/plain; charset=UTF-8 I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an HBase enrichment. If our current caching isn't enough to mitigate the above issues, we have a problem, don't we? Or do we not recommend HBase enrichment for per message enrichment in general? Also- can you elaborate on how MapDB would not require a network hop? Doesn't this mean we would have to sync the enrichment data to each Storm supervisor? HDFS could (probably would) have a network hop too, no? Fwiw - "In its place, I've looked at using MapDB, which is a really easy to use library for creating Java collections backed by a file (This is NOT a separate installation of anything, it's just a jar that manages interaction with the file system). Given the slow churn of the GeoIP files (I believe they get updated once a week), we can have a script that can be run when needed, downloads the MaxMind tar file, builds the MapDB file that will be used by the bolts, and places it into HDFS. Finally, we update a config to point to the new file, the bolts get the updated config callback and can update their db files. Inside the code, we wrap the MapDB portions to make it transparent to downstream code." Seems a bit more complex than "refresh the hbase table". Afaik, either approach would require some sort of translation between GeoIP source format and target format, so that part is a wash imo. So, I'd really like to see, at least, an attempt to leverage HBase enrichment. -D... On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella wrote: > I think that it's a sensible thing to use MapDB for the geo enrichment. > Let me state my reasoning: > > - An HBase implementation would necessitate a HBase scan possibly > hitting HDFS, which is expensive per-message. > - An HBase implementation would necessitate a network hop and MapDB > would not. > > I also think this might be the beginning of a more general purpose support > in Stellar for locally shipped, read-only MapDB lookups, which might be > interesting. > > In short, all quotes about premature optimization are sure to apply to my > reasoning, but I can't help but have my spidey senses tingle when we > introduce a scan-per-message architecture. > > Casey > > On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov > wrote: > > > Hello Justin, > > > > Considering that Metron uses hbase tables for storing enrichment and > > threatintel feeds, can we use Hbase for geo enrichment as well? > > Or MapDB can be used for enrichment and threatintel feeds instead of > hbase? > > > > - Dima > > > > On 01/16/2017 04:17 PM, Justin Leet wrote: > > > Hi all, > > > > > > As a bit of background, right now, GeoIP data is loaded into and > managed > > by > > > MySQL (the connectors are LGPL licensed and we need to sever our Maven > > > dependency on it before next release). We currently depend on and > install > > > an instance of MySQL (in each of the Management Pack, Ansible, and > Docker > > > installs). In the topology, we use the JDBCAdapter to connect to MySQL > > and > > > query for a given IP. Additionally, it's a single point of failure for > > > that particular enrichment right now. If MySQL is down, geo enrichment > > > can't occur. > > > > > > I'm proposing that we eliminate the use of MySQL entirely, through all > > > installation paths (which, unless I missed some, includes Ansible, the > > > Ambari Management Pack, and Docker). We'd do this by dropping all the > > > various MySQL setup and management through the code, along with all the > > > DDL, etc. The JDBCAdapter would stay, so that anybody who wants to > setup > > > their own databases for enrichments and install connectors is able to > do > > so. > > > > > > In its place, I've looked at using MapDB, which is a really easy to use > > > library for creating Java collections backed by a file (This is NOT a > > > separate installation of anything, it's just a jar that manages > > interaction > > > with the file system). Given the slow churn of the GeoIP files (I > > believe > > > they get updated once a week), we can have a script that can be run > when > > > needed, downloads the MaxMind tar file, builds the MapDB file that will > > be > > > used by the bolts, and places it into HDFS. Finally, we update a > config > > to > > > point to the new file, the bolts get the updated config callback and > can > > > update their db files. Inside the code, we wrap the MapDB portions to > > make > > > it transparent to downstream code. > > > > > > The particularly nice parts about using MapDB are that its ease of use > > plus > > > it offers the utilities we need out of the box to be able to support > the > > > operations we need on this (Keep in mind the GeoIP files use IP ranges > > and > > > we need to be able to easily grab the appropriate range). > > > > > > The main point of concern I have about this is that when we grab the > HDFS > > > file during an update, given that multiple JVMs can be running, we > don't > > > want them to clobber each other. I believe this can be avoided by > simply > > > using each worker's working directory to store the file (and > > appropriately > > > ensure threads on the same JVM manage multithreading). This should > keep > > > the JVMs (and the underlying DB files) entirely independent. > > > > > > This script would get called by the various installations during > startup > > to > > > do the initial setup. After install, it can then be called on demand > in > > > order. > > > > > > At this point, we should be all set, with everything running and > > updatable. > > > > > > Justin > > > > > > > > --001a11449b548869e0054638abc3--