sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Attila Szabo (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (SQOOP-2906) Optimization of AvroUtil.toAvroIdentifier
Date Wed, 11 May 2016 14:15:13 GMT

     [ https://issues.apache.org/jira/browse/SQOOP-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Attila Szabo updated SQOOP-2906:
    Comment: was deleted

(was: Hi Joeri,

I've joined the Sqoop community only a few weeks ago, so maybe I don't see all of the pitfalls,
but let me raise a few suggestions/concerns:
You're fix seems to be okay, but I would suggest a bit more changes processing wise:
- First of all, I would not do the conversion for all of the column names, but rather create
a Map<String, String> which would contain the "original" VS. "converted" names, and
thus in most of the cases we would just have to lookup the name in O(1) time, rather doing
the conversion all the time (even if it's now much faster and cheaper).
- I would also not convert those entry.getKey() values if those got a hit in the schema (schema.getField
returns not null), as in that case they're valid values, but maybe this optimization is neglectable
if you implement the first proposal. 
- I was also considering to do the mapping in advance before the import (after we've got the
DB metadata and the avro schema), but for not RDBMS system it might cause problems (different
sets of columns for each row e.g.), so I'm not sure that would help, but from algorithmic/clean
code POV that would be the cleanest solution if possible.

Would you tell what do you think about these suggestions?
My 2cents,
Attila (Maugli))

> Optimization of AvroUtil.toAvroIdentifier
> -----------------------------------------
>                 Key: SQOOP-2906
>                 URL: https://issues.apache.org/jira/browse/SQOOP-2906
>             Project: Sqoop
>          Issue Type: Improvement
>            Reporter: Joeri Hermans
>            Assignee: Joeri Hermans
>              Labels: avro, hadoop, optimization
>         Attachments: diff.txt
> Hi all
> Our distributed profiler indicated some inefficiencies in the AvroUtil.toAvroIdentifier
method, more specifically, the use of Regex patterns. This can be directly observed from the
FlameGraph generated by this profiler (https://jhermans.web.cern.ch/jhermans/sqoop_avro_flamegraph.svg).
We implemented an optimization, and compared this with the original method. On our testing
machine, the optimization by itself is about 500% (on average) more efficient compared to
the original implementation. We have yet to test how this optimization will influence the
performance of user jobs.
> Any suggestions or remarks are welcome.
> Kind regards,
> Joeri
> https://github.com/apache/sqoop/pull/18
> Writeup:
> https://db-blog.web.cern.ch/blog/joeri-hermans/2016-04-hadoop-performance-troubleshooting-stack-tracing-introduction

This message was sent by Atlassian JIRA

View raw message