nutch-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sna...@apache.org
Subject [nutch] branch master updated: NUTCH-2676 Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver NUTCH-2460 use the headless option of firefox and chrome in protocol-selenium - upgrade of Selenium plugin related packages - added the use of headless mode when using Selenium nodes(chrome & firefox) - obsolete code for Selenium plugin removed - fix of a bug occurring during the build of the Nutch docker container - added possibility to use a Selenium Hub orchestrator [...]
Date Sat, 23 Feb 2019 23:06:50 GMT
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
     new 8f421a4  NUTCH-2676 Update to the latest selenium and add code to use chrome and
firefox headless mode with the remote web driver NUTCH-2460 use the headless option of firefox
and chrome in protocol-selenium - upgrade of Selenium plugin related packages - added the
use of headless mode when using Selenium nodes(chrome & firefox) - obsolete code for Selenium
plugin removed - fix of a bug occurring during the build of  the Nutch docker container -
added possibility to use a Seleniu [...]
     new dfd8602  Merge pull request #430 from sbatururimi/NUTCH-2676
8f421a4 is described below

commit 8f421a4114f2d3e5be8726ca735766c6b9b19dbb
Author: Stas Batururimi <s.batururimi@gmail.com>
AuthorDate: Thu Nov 15 12:12:58 2018 +0000

    NUTCH-2676 Update to the latest selenium and add code to use chrome and firefox headless
mode with the remote web driver
    NUTCH-2460 use the headless option of firefox and chrome in protocol-selenium
    - upgrade of Selenium plugin related packages
    - added the use of headless mode when using Selenium nodes(chrome & firefox)
    - obsolete code for Selenium plugin removed
    - fix of a bug occurring during the build of  the Nutch docker container
    - added possibility to use a Selenium Hub orchestrator in multi-containers docker mode
    - added several examples of using Nutch+Solr+Selenium Hub+Selenium Nodes in a network
of Docker containers
---
 .gitignore                                         |   1 +
 conf/nutch-default.xml                             |  26 +-
 src/plugin/lib-selenium/README.md                  |  13 +
 src/plugin/lib-selenium/build-ivy.xml              |   2 +-
 src/plugin/lib-selenium/ivy.xml                    |  11 +-
 src/plugin/lib-selenium/plugin.xml                 | 120 ++-----
 .../nutch/protocol/selenium/HttpWebClient.java     | 352 +++++++++++++--------
 7 files changed, 286 insertions(+), 239 deletions(-)

diff --git a/.gitignore b/.gitignore
index 732ca05..61e42e0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -13,3 +13,4 @@ ivy/ivy-2.3.0.jar
 ivy/ivy-2.4.0.jar
 ivy/ivy-2.5.0-rc1.jar
 naivebayes-model
+.gitconfig
diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 97e1801..dadf30d 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -2525,10 +2525,11 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
   <description>
     A String value representing the flavour of Selenium 
     WebDriver() to use. Currently the following options
-    exist - 'firefox', 'chrome', 'safari', 'opera', 'phantomjs' and 'remote'.
+    exist - 'firefox', 'chrome', 'safari', 'opera' and 'remote'.
     If 'remote' is used it is essential to also set correct properties for
     'selenium.hub.port', 'selenium.hub.path', 'selenium.hub.host',
-    'selenium.hub.protocol', 'selenium.grid.driver' and 'selenium.grid.binary'.
+    'selenium.hub.protocol', 'selenium.grid.driver', 'selenium.grid.binary'
+    and 'selenium.enable.headless'.
   </description>
 </property>
 
@@ -2560,8 +2561,9 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
   <name>selenium.grid.driver</name>
   <value>firefox</value>
   <description>A String value representing the flavour of Selenium 
-    WebDriver() used on the selenium grid. Currently the following options
-    exist - 'firefox', 'phantomjs' </description>
+    WebDriver() used on the selenium grid. We must set `selenium.driver` to `remote` first.
+    Currently the following options
+    exist - 'firefox', 'chrome', 'random' </description>
 </property>
 
 <property>
@@ -2572,6 +2574,14 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
  </description>
 </property>
 
+<!-- headless options for Firefox and Chrome-->
+<property>
+  <name>selenium.enable.headless</name>
+  <value>false</value>
+  <description>A Boolean value representing the headless option
+    for Firefix and Chrome drivers
+  </description>
+</property>
 <!-- selenium firefox configuration; 
      applies to protocol-selenium and protocol-interactiveselenium plugins -->
 <property>
@@ -2622,6 +2632,14 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
   Currently this option exist for - 'firefox' </description>
 </property>
 
+<!-- selenium chrome configurations -->
+<property>
+  <name>webdriver.chrome.driver</name>
+  <value>/root/chromedriver</value>
+  <description>The path to the ChromeDriver binary</description>
+</property>
+<!-- end of selenium chrome configurations -->
+
 <!-- protocol-interactiveselenium configuration -->
 <property>
   <name>interactiveselenium.handlers</name>
diff --git a/src/plugin/lib-selenium/README.md b/src/plugin/lib-selenium/README.md
new file mode 100644
index 0000000..1c6b37c
--- /dev/null
+++ b/src/plugin/lib-selenium/README.md
@@ -0,0 +1,13 @@
+# Updates
+* The use of phantomjs has been deprecated. Check [Wikipedia](https://en.wikipedia.org/wiki/PhantomJS)
for more info.
+* The updated code for Safari webriver is under development as starting Safari 10 on OS X
El Capitan and macOS Sierra, Safari comes bundled with a new driver implementation.
+* Opera is now based on ChromeDriver and has been adapted by Opera that enables programmatic
automation of Chromium-based Opera products but hasn't been updated since April 5, 2017. We
have suspended its support and removed from the code.([link](https://github.com/operasoftware/operachromiumdriver))

+* Headless mode has been added for Chrome and Firefox. Set `selenium.enable.headless` to
`true` in nutch-default.xml or nutch-site.xml to use it.
+
+
+Your can run Nutch in Docker.  Check  some examples at https://github.com/sbatururimi/nutch-test.
+Don't forget to update Dockefile to point to the original Nutch repository when updated.
+
+# Contributors
+Stas Batururimi [s.batururimi@gmail.com]
+
diff --git a/src/plugin/lib-selenium/build-ivy.xml b/src/plugin/lib-selenium/build-ivy.xml
index 3abcf6d..fe919e5 100644
--- a/src/plugin/lib-selenium/build-ivy.xml
+++ b/src/plugin/lib-selenium/build-ivy.xml
@@ -17,7 +17,7 @@
 -->
 <project name="lib-selenium" default="deps-jar" xmlns:ivy="antlib:org.apache.ivy.ant">
 
-    <property name="ivy.install.version" value="2.1.0" />
+    <property name="ivy.install.version" value="2.4.0" />
     <condition property="ivy.home" value="${env.IVY_HOME}">
       <isset property="env.IVY_HOME" />
     </condition>
diff --git a/src/plugin/lib-selenium/ivy.xml b/src/plugin/lib-selenium/ivy.xml
index 701b725..d70dfaf 100644
--- a/src/plugin/lib-selenium/ivy.xml
+++ b/src/plugin/lib-selenium/ivy.xml
@@ -37,16 +37,13 @@
 
   <dependencies>
     <!-- begin selenium dependencies -->
-    <dependency org="org.seleniumhq.selenium" name="selenium-java" rev="2.48.2" />
-    
+    <dependency org="org.seleniumhq.selenium" name="selenium-java" rev="3.141.5" />
+    <!-- 
     <dependency org="com.opera" name="operadriver" rev="1.5">
       <exclude org="org.seleniumhq.selenium" name="selenium-remote-driver" />
     </dependency>
-    <dependency org="com.codeborne" name="phantomjsdriver" rev="1.2.1" >
-      <exclude org="org.seleniumhq.selenium" name="selenium-remote-driver" />
-      <exclude org="org.seleniumhq.selenium" name="selenium-java" />
-    </dependency>
+    -->
     <!-- end selenium dependencies -->
   </dependencies>
-  
+
 </ivy-module>
diff --git a/src/plugin/lib-selenium/plugin.xml b/src/plugin/lib-selenium/plugin.xml
index a86d665..bf50ca0 100644
--- a/src/plugin/lib-selenium/plugin.xml
+++ b/src/plugin/lib-selenium/plugin.xml
@@ -29,147 +29,65 @@
         <export name="*"/>
      </library>
      <!-- all classes from dependent libraries are exported -->
-     <library name="cglib-nodep-2.1_3.jar">
+     <library name="animal-sniffer-annotations-1.14.jar">
        <export name="*"/>
      </library>
-     <library name="commons-codec-1.10.jar">
+     <library name="byte-buddy-1.8.15.jar">
        <export name="*"/>
      </library>
-     <library name="commons-collections-3.2.1.jar">
+     <library name="checker-compat-qual-2.0.0.jar">
        <export name="*"/>
      </library>
      <library name="commons-exec-1.3.jar">
        <export name="*"/>
      </library>
-     <library name="commons-io-2.4.jar">
+     <library name="error_prone_annotations-2.1.3.jar">
        <export name="*"/>
      </library>
-     <library name="commons-jxpath-1.3.jar">
+     <library name="guava-25.0-jre.jar">
        <export name="*"/>
      </library>
-     <library name="commons-lang3-3.4.jar">
+     <library name="j2objc-annotations-1.1.jar">
        <export name="*"/>
      </library>
-     <library name="commons-logging-1.2.jar">
+     <library name="jsr305-1.3.9.jar">
        <export name="*"/>
      </library>
-     <library name="cssparser-0.9.16.jar">
+     <library name="okhttp-3.11.0.jar">
        <export name="*"/>
      </library>
-     <library name="gson-2.3.1.jar">
+     <library name="okio-1.14.0.jar">
        <export name="*"/>
      </library>
-     <library name="guava-18.0.jar">
+     <library name="selenium-api-3.141.5.jar">
        <export name="*"/>
      </library>
-     <library name="htmlunit-2.18.jar">
+     <library name="selenium-chrome-driver-3.141.5.jar">
        <export name="*"/>
      </library>
-     <library name="htmlunit-core-js-2.17.jar">
+     <library name="selenium-edge-driver-3.141.5.jar">
        <export name="*"/>
      </library>
-     <library name="httpclient-4.5.1.jar">
+     <library name="selenium-firefox-driver-3.141.5.jar">
        <export name="*"/>
      </library>
-     <library name="httpcore-4.4.3.jar">
+     <library name="selenium-ie-driver-3.141.5.jar">
        <export name="*"/>
      </library>
-     <library name="httpmime-4.5.jar">
+     <library name="selenium-java-3.141.5.jar">
        <export name="*"/>
      </library>
-     <library name="ini4j-0.5.2.jar">
+     <library name="selenium-opera-driver-3.141.5.jar">
        <export name="*"/>
      </library>
-     <library name="jetty-io-9.2.12.v20150709.jar">
+     <library name="selenium-remote-driver-3.141.5.jar">
        <export name="*"/>
      </library>
-     <library name="jetty-util-9.2.12.v20150709.jar">
+     <library name="selenium-safari-driver-3.141.5.jar">
        <export name="*"/>
      </library>
-     <library name="jna-4.1.0.jar">
-       <export name="*"/>
-     </library>
-     <library name="jna-platform-4.1.0.jar">
-       <export name="*"/>
-     </library>
-     <library name="nekohtml-1.9.22.jar">
-       <export name="*"/>
-     </library>
-     <library name="netty-3.5.2.Final.jar">
-       <export name="*"/>
-     </library>
-     <library name="operadriver-1.5.jar">
-       <export name="*"/>
-     </library>
-     <library name="operalaunchers-1.1.jar">
-       <export name="*"/>
-     </library>
-     <library name="phantomjsdriver-1.2.1.jar">
-       <export name="*"/>
-     </library>
-     <library name="protobuf-java-2.4.1.jar">
-       <export name="*"/>
-     </library>
-     <library name="sac-1.3.jar">
-       <export name="*"/>
-     </library>
-     <library name="selenium-api-2.48.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="selenium-chrome-driver-2.48.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="selenium-edge-driver-2.48.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="selenium-firefox-driver-2.48.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="selenium-htmlunit-driver-2.48.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="selenium-ie-driver-2.48.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="selenium-java-2.48.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="selenium-leg-rc-2.48.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="selenium-remote-driver-2.48.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="selenium-safari-driver-2.48.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="selenium-support-2.48.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="serializer-2.7.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="webbit-0.4.14.jar">
-       <export name="*"/>
-     </library>
-     <library name="websocket-api-9.2.12.v20150709.jar">
-       <export name="*"/>
-     </library>
-     <library name="websocket-client-9.2.12.v20150709.jar">
-       <export name="*"/>
-     </library>
-     <library name="websocket-common-9.2.12.v20150709.jar">
-       <export name="*"/>
-     </library>
-     <library name="xalan-2.7.2.jar">
-       <export name="*"/>
-     </library>
-     <library name="xercesImpl-2.11.0.jar">
-       <export name="*"/>
-     </library>
-     <library name="xml-apis-1.4.01.jar">
+     <library name="selenium-support-3.141.5.jar">
        <export name="*"/>
      </library>
    </runtime>
-
 </plugin>
diff --git a/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
b/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
index 6e137f9..6af20b0 100644
--- a/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
+++ b/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
@@ -24,182 +24,274 @@ import java.io.InputStream;
 import java.io.OutputStream;
 import java.net.URL;
 import java.util.concurrent.TimeUnit;
+import java.util.Random;
 
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.IOUtils;
+
 import org.openqa.selenium.By;
+import org.openqa.selenium.Capabilities;
 import org.openqa.selenium.OutputType;
 import org.openqa.selenium.TakesScreenshot;
 import org.openqa.selenium.TimeoutException;
 import org.openqa.selenium.WebDriver;
+
 import org.openqa.selenium.chrome.ChromeDriver;
-import org.openqa.selenium.firefox.FirefoxBinary;
+import org.openqa.selenium.chrome.ChromeOptions;
+
+//import org.openqa.selenium.firefox.FirefoxBinary;
 import org.openqa.selenium.firefox.FirefoxDriver;
-import org.openqa.selenium.firefox.FirefoxProfile;
+//import org.openqa.selenium.firefox.FirefoxProfile;
+import org.openqa.selenium.firefox.FirefoxOptions;
+
 import org.openqa.selenium.io.TemporaryFilesystem;
+
 import org.openqa.selenium.remote.DesiredCapabilities;
 import org.openqa.selenium.remote.RemoteWebDriver;
-import org.openqa.selenium.safari.SafariDriver;
-import org.openqa.selenium.phantomjs.PhantomJSDriver;
-import org.openqa.selenium.phantomjs.PhantomJSDriverService;
+
+//import org.openqa.selenium.safari.SafariDriver;
+
+//import org.openqa.selenium.phantomjs.PhantomJSDriver;
+//import org.openqa.selenium.phantomjs.PhantomJSDriverService;
+
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-import com.opera.core.systems.OperaDriver;
+import org.openqa.selenium.opera.OperaOptions;
+import org.openqa.selenium.opera.OperaDriver;
+//import com.opera.core.systems.OperaDriver;
 
 public class HttpWebClient {
 
   private static final Logger LOG = LoggerFactory
       .getLogger(MethodHandles.lookup().lookupClass());
 
-  public static ThreadLocal<WebDriver> threadWebDriver = new ThreadLocal<WebDriver>()
{
-
-    @Override
-    protected WebDriver initialValue()
-    {
-      FirefoxProfile profile = new FirefoxProfile();
-      profile.setPreference("permissions.default.stylesheet", 2);
-      profile.setPreference("permissions.default.image", 2);
-      profile.setPreference("dom.ipc.plugins.enabled.libflashplayer.so", "false");
-      profile.setPreference(FirefoxProfile.ALLOWED_HOSTS_PREFERENCE, "localhost");
-      WebDriver driver = new FirefoxDriver(profile);
-      return driver;          
-    };
-  };
-
   public static WebDriver getDriverForPage(String url, Configuration conf) {
-      WebDriver driver = null;
-      DesiredCapabilities capabilities = null;
-      long pageLoadWait = conf.getLong("page.load.delay", 3);
+    WebDriver driver = null;
+    long pageLoadWait = conf.getLong("page.load.delay", 3);
 
-      try {
-        String driverType  = conf.get("selenium.driver", "firefox");
-        switch (driverType) {
-          case "firefox":
-          	String allowedHost = conf.get("selenium.firefox.allowed.hosts", "localhost");
-          	long firefoxBinaryTimeout = conf.getLong("selenium.firefox.binary.timeout", 45);
-          	boolean enableFlashPlayer = conf.getBoolean("selenium.firefox.enable.flash", false);
-          	int loadImage = conf.getInt("selenium.firefox.load.image", 1);
-          	int loadStylesheet = conf.getInt("selenium.firefox.load.stylesheet", 1);
-    		    FirefoxProfile profile = new FirefoxProfile();
-    		    FirefoxBinary binary = new FirefoxBinary();
-    		    profile.setPreference(FirefoxProfile.ALLOWED_HOSTS_PREFERENCE, allowedHost);
-    		    profile.setPreference("dom.ipc.plugins.enabled.libflashplayer.so", enableFlashPlayer);
-    		    profile.setPreference("permissions.default.stylesheet", loadStylesheet);
-  	      	profile.setPreference("permissions.default.image", loadImage);
-    		    binary.setTimeout(TimeUnit.SECONDS.toMillis(firefoxBinaryTimeout));
-            driver = new FirefoxDriver(binary, profile);
-            break;
-          case "chrome":
-            driver = new ChromeDriver();
-            break;
-          case "safari":
-            driver = new SafariDriver();
-            break;
-          case "opera":
-            driver = new OperaDriver();
-            break;
-          case "phantomjs":
-            driver = new PhantomJSDriver();
-            break;
-          case "remote":
-            String seleniumHubHost = conf.get("selenium.hub.host", "localhost");
-            int seleniumHubPort = Integer.parseInt(conf.get("selenium.hub.port", "4444"));
-            String seleniumHubPath = conf.get("selenium.hub.path", "/wd/hub");
-            String seleniumHubProtocol = conf.get("selenium.hub.protocol", "http");
-            String seleniumGridDriver = conf.get("selenium.grid.driver","firefox");
-            String seleniumGridBinary = conf.get("selenium.grid.binary");
-
-            switch (seleniumGridDriver){
-              case "firefox":
-                capabilities = DesiredCapabilities.firefox();
-                capabilities.setBrowserName("firefox");
-                capabilities.setJavascriptEnabled(true);
-                capabilities.setCapability("firefox_binary",seleniumGridBinary);
-                System.setProperty("webdriver.reap_profile", "false");
-                driver = new RemoteWebDriver(new URL(seleniumHubProtocol, seleniumHubHost,
seleniumHubPort, seleniumHubPath), capabilities);
-                break;
-              case "phantomjs":
-                capabilities = DesiredCapabilities.phantomjs();
-                capabilities.setBrowserName("phantomjs");
-                capabilities.setJavascriptEnabled(true);
-                capabilities.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,seleniumGridBinary);
-                driver = new RemoteWebDriver(new URL(seleniumHubProtocol, seleniumHubHost,
seleniumHubPort, seleniumHubPath), capabilities);
-                break;
-              default:
-                LOG.error("The Selenium Grid WebDriver choice {} is not available... defaulting
to FirefoxDriver().", driverType);
-                driver = new RemoteWebDriver(new URL(seleniumHubProtocol, seleniumHubHost,
seleniumHubPort, seleniumHubPath), DesiredCapabilities.firefox());
-                break;
-            }
-            break;
-          default:
-            LOG.error("The Selenium WebDriver choice {} is not available... defaulting to
FirefoxDriver().", driverType);
-            driver = new FirefoxDriver();
-            break;
+    try {
+      String driverType = conf.get("selenium.driver", "firefox");
+      boolean enableHeadlessMode = conf.getBoolean("selenium.enable.headless",
+          false);
+
+      switch (driverType) {
+      case "firefox":
+        String geckoDriverPath = conf.get("selenium.grid.binary",
+            "/root/geckodriver");
+        driver = createFirefoxWebDriver(geckoDriverPath, enableHeadlessMode);
+        break;
+      case "chrome":
+        String chromeDriverPath = conf.get("selenium.grid.binary",
+            "/root/chromedriver");
+        driver = createChromeWebDriver(chromeDriverPath, enableHeadlessMode);
+        break;
+      // case "opera":
+      // // This class is provided as a convenience for easily testing the
+      // Chrome browser.
+      // String operaDriverPath = conf.get("selenium.grid.binary",
+      // "/root/operadriver");
+      // driver = createOperaWebDriver(operaDriverPath, enableHeadlessMode);
+      // break;
+      case "remote":
+        String seleniumHubHost = conf.get("selenium.hub.host", "localhost");
+        int seleniumHubPort = Integer
+            .parseInt(conf.get("selenium.hub.port", "4444"));
+        String seleniumHubPath = conf.get("selenium.hub.path", "/wd/hub");
+        String seleniumHubProtocol = conf.get("selenium.hub.protocol", "http");
+        URL seleniumHubUrl = new URL(seleniumHubProtocol, seleniumHubHost,
+            seleniumHubPort, seleniumHubPath);
+
+        String seleniumGridDriver = conf.get("selenium.grid.driver", "firefox");
+
+        switch (seleniumGridDriver) {
+        case "firefox":
+          driver = createFirefoxRemoteWebDriver(seleniumHubUrl,
+              enableHeadlessMode);
+          break;
+        case "chrome":
+          driver = createChromeRemoteWebDriver(seleniumHubUrl,
+              enableHeadlessMode);
+          break;
+        case "random":
+          driver = createRandomRemoteWebDriver(seleniumHubUrl,
+              enableHeadlessMode);
+          break;
+        default:
+          LOG.error(
+              "The Selenium Grid WebDriver choice {} is not available... defaulting to FirefoxDriver().",
+              driverType);
+          driver = createDefaultRemoteWebDriver(seleniumHubUrl,
+              enableHeadlessMode);
+          break;
         }
-        LOG.debug("Selenium {} WebDriver selected.", driverType);
-  
-        driver.manage().timeouts().pageLoadTimeout(pageLoadWait, TimeUnit.SECONDS);
-        driver.get(url);
-      } catch (Exception e) {
-			  if(e instanceof TimeoutException) {
-          LOG.debug("Selenium WebDriver: Timeout Exception: Capturing whatever loaded so
far...");
-          return driver;
-			  }
-			  cleanUpDriver(driver);
-		    throw new RuntimeException(e);
-	    } 
-
-      return driver;
-  }
+        break;
+      default:
+        LOG.error(
+            "The Selenium WebDriver choice {} is not available... defaulting to FirefoxDriver().",
+            driverType);
+        FirefoxOptions options = new FirefoxOptions();
+        driver = new FirefoxDriver(options);
+        break;
+      }
+      LOG.debug("Selenium {} WebDriver selected.", driverType);
 
-  public static String getHTMLContent(WebDriver driver, Configuration conf) {
-      if (conf.getBoolean("take.screenshot", false)) {
-        takeScreenshot(driver, conf);
+      driver.manage().timeouts().pageLoadTimeout(pageLoadWait,
+          TimeUnit.SECONDS);
+      driver.get(url);
+    } catch (Exception e) {
+      if (e instanceof TimeoutException) {
+        LOG.error(
+            "Selenium WebDriver: Timeout Exception: Capturing whatever loaded so far...");
+        return driver;
+      } else {
+        LOG.error(e.toString());
       }
+      cleanUpDriver(driver);
+      throw new RuntimeException(e);
+    }
+
+    return driver;
+  }
+
+  public static WebDriver createFirefoxWebDriver(String firefoxDriverPath,
+      boolean enableHeadlessMode) {
+    System.setProperty("webdriver.gecko.driver", firefoxDriverPath);
+    FirefoxOptions firefoxOptions = new FirefoxOptions();
+    if (enableHeadlessMode) {
+      firefoxOptions.addArguments("--headless");
+    }
+    WebDriver driver = new FirefoxDriver(firefoxOptions);
+    return driver;
+  }
 
-      return driver.findElement(By.tagName("body")).getAttribute("innerHTML");
+  public static WebDriver createChromeWebDriver(String chromeDriverPath,
+      boolean enableHeadlessMode) {
+    // if not specified, WebDriver will search your path for chromedriver
+    System.setProperty("webdriver.chrome.driver", chromeDriverPath);
+    ChromeOptions chromeOptions = new ChromeOptions();
+    chromeOptions.addArguments("--no-sandbox");
+    chromeOptions.addArguments("--disable-extensions");
+    // be sure to set selenium.enable.headless to true if no monitor attached
+    // to your server
+    if (enableHeadlessMode) {
+      chromeOptions.addArguments("--headless");
+    }
+    WebDriver driver = new ChromeDriver(chromeOptions);
+    return driver;
+  }
+
+  public static WebDriver createOperaWebDriver(String operaDriverPath,
+      boolean enableHeadlessMode) {
+    // if not specified, WebDriver will search your path for operadriver
+    System.setProperty("webdriver.opera.driver", operaDriverPath);
+    OperaOptions operaOptions = new OperaOptions();
+    // operaOptions.setBinary("/usr/bin/opera");
+    operaOptions.addArguments("--no-sandbox");
+    operaOptions.addArguments("--disable-extensions");
+    // be sure to set selenium.enable.headless to true if no monitor attached
+    // to your server
+    if (enableHeadlessMode) {
+      operaOptions.addArguments("--headless");
+    }
+    WebDriver driver = new OperaDriver(operaOptions);
+    return driver;
+  }
+
+  public static RemoteWebDriver createFirefoxRemoteWebDriver(URL seleniumHubUrl,
+      boolean enableHeadlessMode) {
+    FirefoxOptions firefoxOptions = new FirefoxOptions();
+    if (enableHeadlessMode) {
+      firefoxOptions.setHeadless(true);
+    }
+    RemoteWebDriver driver = new RemoteWebDriver(seleniumHubUrl,
+        firefoxOptions);
+    return driver;
+  }
+
+  public static RemoteWebDriver createChromeRemoteWebDriver(URL seleniumHubUrl,
+      boolean enableHeadlessMode) {
+    ChromeOptions chromeOptions = new ChromeOptions();
+    if (enableHeadlessMode) {
+      chromeOptions.setHeadless(true);
+    }
+    RemoteWebDriver driver = new RemoteWebDriver(seleniumHubUrl, chromeOptions);
+    return driver;
+  }
+
+  public static RemoteWebDriver createRandomRemoteWebDriver(URL seleniumHubUrl,
+      boolean enableHeadlessMode) {
+    // we consider a possibility of generating only 2 types of browsers: Firefox
+    // and
+    // Chrome only
+    Random r = new Random();
+    int min = 0;
+    // we have actually hardcoded the maximum number of types of web driver that
+    // can
+    // be created
+    // but this must be later moved to the configuration file in order to be
+    // able
+    // to randomly choose between much more types(ex: Edge, Opera, Safari)
+    int max = 1; // for 3 types, change to 2 and update the if-clause
+    int num = r.nextInt((max - min) + 1) + min;
+    if (num == 0) {
+      return createFirefoxRemoteWebDriver(seleniumHubUrl, enableHeadlessMode);
+    }
+
+    return createChromeRemoteWebDriver(seleniumHubUrl, enableHeadlessMode);
+  }
+
+  public static RemoteWebDriver createDefaultRemoteWebDriver(URL seleniumHubUrl,
+      boolean enableHeadlessMode) {
+    return createFirefoxRemoteWebDriver(seleniumHubUrl, enableHeadlessMode);
   }
 
   public static void cleanUpDriver(WebDriver driver) {
     if (driver != null) {
       try {
-	      driver.close();
+        // driver.close();
         driver.quit();
         TemporaryFilesystem.getDefaultTmpFS().deleteTemporaryFiles();
       } catch (Exception e) {
-        throw new RuntimeException(e);
+        LOG.error(e.toString());
+        // throw new RuntimeException(e);
       }
     }
   }
 
   /**
-   * Function for obtaining the HTML BODY using the selected
-   * <a href='https://seleniumhq.github.io/selenium/docs/api/java/org/openqa/selenium/WebDriver.html'>selenium
webdriver</a>
-   * There are a number of configuration properties within
-   * <code>nutch-site.xml</code> which determine whether to
-   * take screenshots of the rendered pages and persist them
-   * as timestamped .png's into HDFS.
-   * @param url the URL to fetch and render
-   * @param conf the {@link org.apache.hadoop.conf.Configuration}
+   * Function for obtaining the HTML BODY using the selected <a href=
+   * 'https://seleniumhq.github.io/selenium/docs/api/java/org/openqa/selenium/WebDriver.html'>selenium
+   * webdriver</a> There are a number of configuration properties within
+   * <code>nutch-site.xml</code> which determine whether to take screenshots
of
+   * the rendered pages and persist them as timestamped .png's into HDFS.
+   * 
+   * @param url
+   *          the URL to fetch and render
+   * @param conf
+   *          the {@link org.apache.hadoop.conf.Configuration}
    * @return the rendered inner HTML page
    */
   public static String getHtmlPage(String url, Configuration conf) {
     WebDriver driver = getDriverForPage(url, conf);
-    
+
     try {
       if (conf.getBoolean("take.screenshot", false)) {
         takeScreenshot(driver, conf);
       }
 
-      String innerHtml = driver.findElement(By.tagName("body")).getAttribute("innerHTML");
+      String innerHtml = driver.findElement(By.tagName("body"))
+          .getAttribute("innerHTML");
       return innerHtml;
 
-      // I'm sure this catch statement is a code smell ; borrowing it from lib-htmlunit
+      // I'm sure this catch statement is a code smell ; borrowing it from
+      // lib-htmlunit
     } catch (Exception e) {
       TemporaryFilesystem.getDefaultTmpFS().deleteTemporaryFiles();
+      // throw new RuntimeException(e);
+      LOG.error("getHtmlPage(url, conf): " + e.toString());
       throw new RuntimeException(e);
     } finally {
       cleanUpDriver(driver);
@@ -213,24 +305,32 @@ public class HttpWebClient {
   private static void takeScreenshot(WebDriver driver, Configuration conf) {
     try {
       String url = driver.getCurrentUrl();
-      File srcFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
+      File srcFile = ((TakesScreenshot) driver)
+          .getScreenshotAs(OutputType.FILE);
       LOG.debug("In-memory screenshot taken of: {}", url);
       FileSystem fs = FileSystem.get(conf);
       if (conf.get("screenshot.location") != null) {
-        Path screenshotPath = new Path(conf.get("screenshot.location") + "/" + srcFile.getName());
+        Path screenshotPath = new Path(
+            conf.get("screenshot.location") + "/" + srcFile.getName());
         OutputStream os = null;
         if (!fs.exists(screenshotPath)) {
-          LOG.debug("No existing screenshot already exists... creating new file at {} {}.",
screenshotPath, srcFile.getName());
+          LOG.debug(
+              "No existing screenshot already exists... creating new file at {} {}.",
+              screenshotPath, srcFile.getName());
           os = fs.create(screenshotPath);
         }
         InputStream is = new BufferedInputStream(new FileInputStream(srcFile));
         IOUtils.copyBytes(is, os, conf);
-        LOG.debug("Screenshot for {} successfully saved to: {} {}", url, screenshotPath,
srcFile.getName()); 
+        LOG.debug("Screenshot for {} successfully saved to: {} {}", url,
+            screenshotPath, srcFile.getName());
       } else {
-        LOG.warn("Screenshot for {} not saved to HDFS (subsequently disgarded) as value for
"
-            + "'screenshot.location' is absent from nutch-site.xml.", url);
+        LOG.warn(
+            "Screenshot for {} not saved to HDFS (subsequently disgarded) as value for "
+                + "'screenshot.location' is absent from nutch-site.xml.",
+            url);
       }
     } catch (Exception e) {
+      LOG.error("Error taking screenshot: ", e);
       cleanUpDriver(driver);
       throw new RuntimeException(e);
     }


Mime
View raw message