beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-3184) Not able to access GCS API when submitting Python jobs behind corporate firewall
Date Tue, 02 Jan 2018 19:51:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308610#comment-16308610
] 

ASF GitHub Bot commented on BEAM-3184:
--------------------------------------

chamikaramj closed pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() &
GetNewHttp() methods for GCS
URL: https://github.com/apache/beam/pull/4136
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/sdks/python/apache_beam/io/gcp/gcsio.py b/sdks/python/apache_beam/io/gcp/gcsio.py
index 68ca0265601..4a4702d0d6a 100644
--- a/sdks/python/apache_beam/io/gcp/gcsio.py
+++ b/sdks/python/apache_beam/io/gcp/gcsio.py
@@ -87,6 +87,44 @@
 MAX_BATCH_OPERATION_SIZE = 100
 
 
+def proxy_info_from_environment_var(proxy_env_var):
+  """Reads proxy info from the environment and converts to httplib2.ProxyInfo.
+
+  Args:
+    proxy_env_var: environment variable string to read, http_proxy or
+       https_proxy (in lower case).
+       Example: http://myproxy.domain.com:8080
+
+  Returns:
+    httplib2.ProxyInfo constructed from the environment string.
+  """
+  proxy_url = os.environ.get(proxy_env_var)
+  if not proxy_url:
+    return None
+  proxy_protocol = proxy_env_var.lower().split('_')[0]
+  if not re.match('^https?://', proxy_url, flags=re.IGNORECASE):
+    logging.warn("proxy_info_from_url requires a protocol, which is always "
+                 "http or https.")
+    proxy_url = proxy_protocol + '://' + proxy_url
+  return httplib2.proxy_info_from_url(proxy_url, method=proxy_protocol)
+
+
+def get_new_http():
+  """Creates and returns a new httplib2.Http instance.
+
+  Returns:
+    An initialized httplib2.Http instance.
+  """
+  proxy_info = None
+  for proxy_env_var in ['http_proxy', 'https_proxy']:
+    if os.environ.get(proxy_env_var):
+      proxy_info = proxy_info_from_environment_var(proxy_env_var)
+      break
+  # Use a non-infinite SSL timeout to avoid hangs during network flakiness.
+  return httplib2.Http(proxy_info=proxy_info,
+                       timeout=DEFAULT_HTTP_TIMEOUT_SECONDS)
+
+
 def parse_gcs_path(gcs_path):
   """Return the bucket and object names of the given gs:// path."""
   match = re.match('^gs://([^/]+)/(.+)$', gcs_path)
@@ -118,7 +156,7 @@ def __new__(cls, storage_client=None):
         storage_client = storage.StorageV1(
             credentials=credentials,
             get_credentials=False,
-            http=httplib2.Http(timeout=DEFAULT_HTTP_TIMEOUT_SECONDS))
+            http=get_new_http())
         local_state.gcsio_instance = (
             super(GcsIO, cls).__new__(cls, storage_client))
         local_state.gcsio_instance.client = storage_client
@@ -234,7 +272,7 @@ def delete_batch(self, paths):
       request = storage.StorageObjectsDeleteRequest(
           bucket=bucket, object=object_path)
       batch_request.Add(self.client.objects, 'Delete', request)
-    api_calls = batch_request.Execute(self.client._http)  # pylint: disable=protected-access
+    api_calls = batch_request.Execute(self.client._http) # pylint: disable=protected-access
     result_statuses = []
     for i, api_call in enumerate(api_calls):
       path = paths[i]


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Not able to access GCS API when submitting Python jobs behind corporate firewall
> --------------------------------------------------------------------------------
>
>                 Key: BEAM-3184
>                 URL: https://issues.apache.org/jira/browse/BEAM-3184
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: David Sabater
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> We should modify gcsio.py module in Python sdk to add methods to pick up proxy settings
from environment variables in httplib2 library. This will allow submitting jobs from behind
a corporate proxy.
> I do have the fix implemented in my forked repository.
> https://github.com/dsdinter/beam/commit/83a54b5b5695783967a175c4623af31997e52b35



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message