nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: CSCI - 572: Team 18 : Questions
Date Sun, 27 Sep 2015 18:08:08 GMT
Hi Team 18,

This would be a good question and discussion to move
to the dev@nutch.apache.org list. So I’m moving it there.
Mike Joyce and Kim Whitehall who are working on Nutch and
Selenium can help there.

Cheers,
Chris

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Adjunct Associate Professor, Computer Science Department
University of Southern California
Los Angeles, CA 90089 USA
Email: mattmann@usc.edu
WWW: http://sunset.usc.edu/
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Charan Shampur <shampur@usc.edu>
Date: Saturday, September 26, 2015 at 7:19 PM
To: jpluser <mattmann@usc.edu>
Subject: CSCI - 572: Team 18 : Questions

>
>
>Hello Professor,
>
>
>We started building the handler for interactive-selenium plugin. We
>figured out writing "processDriver()" method as part of
>InteractiveSeleniumHandler class. We are unable to figure out how to pass
>list of urls to "shouldProcessURL()" method of
>InteractiveSeleniumHandler.
>
>
>We made the necessary configuration changes in nutch-site.xml and other
>needed changes as mentioned in many online tutorials. After doing a fresh
>"ant runtime" and started crawling for the urls, mozilla browser opens up
>for some of the urls but crawler
> displays "java.net.SocketTimeoutException: Read Timed out" and it is
>continuing with next set of urls. We believe this message means the
>request is not being made since there is no url. so next time when the
>browser opened we manually typed some random url
> and then we could see that crawler continued execution with the newly
>fetched data.
>
>
>When the browser opens, the url field will always be empty. We are not
>able to understand how to pass url to the browser once it opens up so
>that the whole process is automated.
>
>
>Thanks
>Team 18
>
>
>
>
>
>On Fri, Sep 25, 2015 at 9:32 PM, Christian Alan Mattmann
><mattmann@usc.edu> wrote:
>
>Hi Charan,
>
>You should get codes like DB_UNFETCHED, DB_GONE, etc etc
>via nutchpy. Roughly you can map those to various (HTTP)
>codes like DB_GONE (which is 404), etc.
>
>Does that help?
>
>Cheers,
>Chris
>
>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Adjunct Associate Professor, Computer Science Department
>University of Southern California
>Los Angeles, CA 90089 USA
>Email: mattmann@usc.edu
>WWW: http://sunset.usc.edu/
>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: Charan Shampur <shampur@usc.edu>
>Date: Friday, September 25, 2015 at 9:22 PM
>To: jpluser <mattmann@usc.edu>
>Subject: Question with assignment 1
>
>>Hello professor,
>>
>>
>>I examined the three nutch
>>datasets using nutchpy, I was able to extract  the different  image
>>mime-types that were encountered while fetching the image
>>urls, However I was unable to find the http response codes of the
>>urls that were being fetched.
>>
>>
>>hadoop.log files have a list of
>>urls which are not fetched due to response code 403, and other issues. Is
>>this the place to find those 100
>>urls.
>>
>>
>>professor, Kindly guide us in the right direction.
>>
>>
>>Thanks,
>>Charan
>>
>
>
>
>
>
>
>
>
>

Mime
View raw message