nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: [Nutch Wiki] Update of "ApacheConUs2009MeetUp" by KenKrugler
Date Thu, 05 Nov 2009 10:41:06 GMT
Hi Bartosz,

I've updated the wiki, and others who attended might add/edit as  
necessary.

No video/podcast - it wasn't so high tech as that, just three of us in  
a spare room with Thorsten on Skype.

We're still waiting for some input from the Heritrix team, I think,  
before moving forward.

-- Ken


On Nov 5, 2009, at 12:01am, Bartosz Gadzimski wrote:

> Hello,
>
> Are there will be any materials after the meeting? Wiki pages,  
> slides, video, podcasts? Would be grate!
>
> Thanks,
> Bartosz
>
> Apache Wiki pisze:
>> Dear Wiki user,
>>
>> You have subscribed to a wiki page or wiki category on "Nutch Wiki"  
>> for change notification.
>>
>> The "ApacheConUs2009MeetUp" page has been changed by KenKrugler.
>> http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&rev1=5&rev2=6
>>
>> --------------------------------------------------
>>
>> - We were planning to have a "Web Crawler Developer" !MeetUp at  
>> this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon  
>> US]] in Oakland.
>> + We had a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/

>> |ApacheCon US]] in Oakland.
>>  - Unfortunately the only time slot where people would be around  
>> was Thursday night, which wound up conflicting with the Hadoop ! 
>> MeetUp.
>> + It wound up being an !UnMeetUp (!MeetDown?) on Wednesday,  
>> November 4th from 11am - 1pm.   - So we're going to have an ! 
>> UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm.  
>> Location is TBD, hopefully we can get some space at the event but  
>> might be a lunch meeting :)
>> + == Attendees ==
>> + +  * Andrzej Bialeki - Apache Nutch
>> +  * Thorsten xxx - Apache Droids
>> +  * Michael Stack - Formerly with Heritrix, now HBase
>> +  * Ken Krugler - Bixo
>> + + == Topics ==
>> + + === Roadmaps ===
>> + + Nutch - become more component based.
>> + Droids - get more people involved.
>> + + === Sharable Components ===
>> + +  * robots.txt parsing
>> +  * URL normalization
>> +  * URL filtering
>> +  * Page cleansing
>> +   * General purpose
>> +   * Specialized
>> +  * Sub-page parsing (portlets)
>> +  * AJAX-ish page interactions
>> +  * Document parsing (via Tika)
>> +  * HttpClient (configuration)
>> +  * Text similarity
>> +  * Mime/charset/language detection
>> + + === Tika ===
>> + +  * Needs help to become really usable
>> +  * Would benefit from large test corpus
>> +  * Could do comparison with Nutch parser
>> +  * Needs option for direct DOM querying (screen scraping tasks)
>> +  * Handles mime & charset detection now (some issues)
>> +  * Could be extended to include language detection (wrap other  
>> impl)
>> + + === URL Normalization ===
>> + +  * Includes both domain (www.x.com == x.com), path, and query  
>> portions of URL
>> +  * Often site-specific rules
>> +   * Option to derive rules using URLs to similar documents.
>> + + === AJAX-ish Page Interaction ===
>> + +  * Not applicable for broad/general crawling
>> +  * Can be very important for specific web sites
>> +  * Use Selenium or headless Mozilla
>> + + === Component API Issues ===
>> + +  * Want to avoid using an API that's tied too closely to any  
>> implementation.
>> +  * One option is to have simple (e.g. URL param) API that takes  
>> meta-data.
>> +   * Similar to Tika passing in of meta-data.
>> + + === Hosting Options ===
>> + +  * As part of Nutch - but easy to get lost in Nutch codebase,  
>> and can be associated too closely with Nutch.
>> +  * As part of Droids - but Droids is both a framework (queue- 
>> based) and set of components.
>> +  * New sub-project under Lucene TLP - but overhead to set up/ 
>> maintain, and then confusion between it and Droids.
>> +  * Google code - seems like a good short-term solution, to judge  
>> level of interest and help shake out issues.
>> + + == Next Steps ==
>> + +  * Get input from Gordon re Heritrix. Stack to follow up with  
>> him. Ideally he'd add his comments to this page.
>> +  * Get input from Thorsten on Google code option. If OK as  
>> starting point, then Andrzej to set up.
>> +  * Make decision about build system (and then move on to code  
>> formatting debate :))
>> +   * I'm going to propose ant + maven ant tasks for dependency  
>> management. I'm using this with Bixo, and so far it's been pretty  
>> good.
>> +  * Start contributing code
>> +   * Ken will put in robots.txt parser.
>> + + == Original Discussion Topic List ==
>>    Below are some potential topics for discussion - feel free to  
>> add/comment.
>>
>>
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
View raw message