nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sami Siren (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-688) Fix missing/wrong headers in source files
Date Tue, 17 Feb 2009 14:05:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674216#action_12674216 ] 

Sami Siren commented on NUTCH-688:
----------------------------------

Buildfile: build.xml

rat-sources-typedef:
Trying to override old definition of task javadoc

rat-sources:
[rat:report] 
[rat:report] *****************************************************
[rat:report] Summary
[rat:report] -------
[rat:report] Notes: 0
[rat:report] Binaries: 0
[rat:report] Archives: 0
[rat:report] Standards: 242
[rat:report] 
[rat:report] Apache Licensed: 175
[rat:report] Generated Documents: 8
[rat:report] 
[rat:report] JavaDocs are generated and so license header is optional
[rat:report] Generated files do not required license headers
[rat:report] 
[rat:report] 59 Unknown Licenses
[rat:report] 
[rat:report] *******************************
[rat:report] 
[rat:report] Unapproved licenses:
[rat:report] 
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/package.html
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/NutchWritable.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/package.html
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/fetcher/package.html
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/NutchDocument.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/AnchorFields.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/BasicFields.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/CustomFields.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldFilter.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldFilters.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldIndexer.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldType.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldWritable.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/Fields.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldsWritable.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/lucene/LuceneConstants.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/lucene/LuceneWriter.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/package.html
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/solr/SolrConstants.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/package.html
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/package.html
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LinkDatum.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LoopReader.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/Loops.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/Node.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/NodeReader.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/DistributedSearchBean.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/RPCSearchBean.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/RPCSegmentBean.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/SearchBean.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/SegmentBean.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/package.html
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/RequestUtils.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/ResponseWriter.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/ResponseWriters.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/SearchResults.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/SearchServlet.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/segment/ContentAsTextInputFormat.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/ResolveUrls.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/SearchLoadTester.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/compat/ReprUrlFixer.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/EncodingDetector.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/FSUtils.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/GenericWritableConfigurable.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/NodeWalker.java
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/domain/package.html
[rat:report]   /home/sam/workspace/nutch-trunk-eu/src/java/overview.html
[rat:report] 
[rat:report] *******************************
[rat:report] 
[rat:report] Archives (+ indicates readable, $ unreadable): 
[rat:report] 
[rat:report]  
[rat:report] *****************************************************
[rat:report]   Files with Apache License headers will be marked AL
[rat:report]   Binary files (which do not require AL headers) will be marked B
[rat:report]   Compressed archives will be marked A
[rat:report]   Notices, licenses etc will be marked N
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/AnalyzerFactory.java
[rat:report]   GEN   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/CharStream.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/CommonGrams.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/FastCharStream.java
[rat:report]   GEN   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/NutchAnalysis.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/NutchAnalysis.jj
[rat:report]   GEN   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/NutchAnalysisConstants.java
[rat:report]   GEN   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/NutchAnalysisTokenManager.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/NutchAnalyzer.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/NutchDocumentTokenizer.java
[rat:report]   GEN   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/ParseException.java
[rat:report]   GEN   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/Token.java
[rat:report]   GEN   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/TokenManager.java
[rat:report]   GEN   /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/TokenMgrError.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/package.html
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/clustering/HitsCluster.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/clustering/OnlineClusterer.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/clustering/OnlineClustererFactory.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/Crawl.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/CrawlDatum.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/CrawlDb.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/CrawlDbFilter.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/CrawlDbReader.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/DefaultFetchSchedule.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/FetchSchedule.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/FetchScheduleFactory.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/Generator.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/Injector.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/Inlink.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/Inlinks.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/LinkDb.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/LinkDbFilter.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/LinkDbMerger.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/LinkDbReader.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/MD5Signature.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/MapWritable.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/NutchWritable.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/PartitionUrlByHost.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/Signature.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/SignatureComparator.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/SignatureFactory.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/TextProfileSignature.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/package.html
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/fetcher/Fetcher.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/fetcher/Fetcher2.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/fetcher/FetcherOutput.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/fetcher/package.html
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/html/Entities.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/FsDirectory.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/HighFreqTerms.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/IndexMerger.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/IndexSorter.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/Indexer.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/IndexingException.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/IndexingFilter.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/IndexingFilters.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/NutchDocument.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/NutchSimilarity.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/AnchorFields.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/BasicFields.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/CustomFields.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldFilter.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldFilters.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldIndexer.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldType.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldWritable.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/Fields.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldsWritable.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/lucene/LuceneConstants.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/lucene/LuceneWriter.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/package.html
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/solr/SolrConstants.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/CreativeCommons.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/DublinCore.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/Feed.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/HttpHeaders.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/MetaWrapper.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/Metadata.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/Nutch.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/Office.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/package.html
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/net/URLFilter.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/net/URLFilterChecker.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/net/URLFilterException.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/net/URLFilters.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/net/URLNormalizer.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/net/URLNormalizerChecker.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/net/URLNormalizers.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/net/protocols/HttpDateFormat.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/net/protocols/ProtocolException.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/net/protocols/Response.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/ontology/Ontology.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/ontology/OntologyFactory.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/HTMLMetaTags.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/HtmlParseFilter.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/HtmlParseFilters.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/Outlink.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/OutlinkExtractor.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/Parse.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParseData.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParseException.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParseImpl.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParseOutputFormat.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParsePluginList.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParsePluginsReader.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParseResult.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParseSegment.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParseStatus.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParseText.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParseUtil.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/Parser.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParserChecker.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParserFactory.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/parse/ParserNotFound.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/CircularDependencyException.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/Extension.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/ExtensionPoint.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/MissingDependencyException.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/Pluggable.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/Plugin.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/PluginClassLoader.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/PluginDescriptor.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/PluginManifestParser.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/PluginRepository.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/PluginRuntimeException.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/package.html
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/protocol/Content.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/protocol/EmptyRobotRules.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/protocol/Protocol.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/protocol/ProtocolException.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/protocol/ProtocolFactory.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/protocol/ProtocolNotFound.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/protocol/ProtocolOutput.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/protocol/ProtocolStatus.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/protocol/RobotRules.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/ScoringFilter.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/ScoringFilterException.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/ScoringFilters.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LinkDatum.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LoopReader.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/Loops.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/Node.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/NodeReader.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/DistributedSearch.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/DistributedSearchBean.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/FetchedSegments.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/FieldQueryFilter.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/Hit.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/HitContent.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/HitDetailer.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/HitDetails.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/HitInlinks.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/HitSummarizer.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/Hits.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/IndexSearcher.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/LinkDbInlinks.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/LuceneSearchBean.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/NutchBean.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/OpenSearchServlet.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/Query.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/QueryException.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/QueryFilter.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/QueryFilters.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/RPCSearchBean.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/RPCSegmentBean.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/RawFieldQueryFilter.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/SearchBean.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/Searcher.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/SegmentBean.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/SolrSearchBean.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/Summarizer.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/SummarizerFactory.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/Summary.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/package.html
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/RequestUtils.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/ResponseWriter.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/ResponseWriters.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/SearchResults.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/SearchServlet.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/segment/ContentAsTextInputFormat.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/segment/SegmentMerger.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/segment/SegmentPart.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/segment/SegmentReader.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/servlet/Cached.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/DmozParser.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/FreeGenerator.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/PruneIndexTool.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/ResolveUrls.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/SearchLoadTester.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/arc/ArcInputFormat.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/arc/ArcRecordReader.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/compat/CrawlDbConverter.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/compat/ReprUrlFixer.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/CommandRunner.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/DeflateUtils.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/DomUtil.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/EncodingDetector.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/FSUtils.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/GZIPUtils.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/GenericWritableConfigurable.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/HadoopFSUtil.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/LockUtil.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/LogUtil.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/MimeUtil.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/NodeWalker.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/NutchConfiguration.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/NutchJob.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/ObjectCache.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/PrefixStringMatcher.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/StringUtil.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/SuffixStringMatcher.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/TrieStringMatcher.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/URLUtil.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/domain/DomainStatistics.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/domain/DomainSuffix.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/domain/DomainSuffixes.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/domain/DomainSuffixesReader.java
[rat:report]   AL    /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/domain/TopLevelDomain.java
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/domain/package.html
[rat:report]  !????? /home/sam/workspace/nutch-trunk-eu/src/java/overview.html
[rat:report]  
[rat:report]  *****************************************************
[rat:report]  Printing headers for files without AL header...
[rat:report]  
[rat:report]  
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/analysis/package.html
[rat:report]  =======================================================================
[rat:report] <html>
[rat:report] <body>
[rat:report] Tokenizer for documents and query parser.
[rat:report] </body>
[rat:report] </html>
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/NutchWritable.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.crawl;
[rat:report] 
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.nutch.util.GenericWritableConfigurable;
[rat:report] 
[rat:report] public class NutchWritable extends GenericWritableConfigurable {
[rat:report]   
[rat:report]   private static Class<? extends Writable>[] CLASSES = null;
[rat:report]   
[rat:report]   static {
[rat:report]     CLASSES = (Class<? extends Writable>[]) new Class[] {
[rat:report]       org.apache.hadoop.io.NullWritable.class, 
[rat:report]       org.apache.hadoop.io.LongWritable.class,
[rat:report]       org.apache.hadoop.io.BytesWritable.class,
[rat:report]       org.apache.hadoop.io.FloatWritable.class,
[rat:report]       org.apache.hadoop.io.IntWritable.class,
[rat:report]       org.apache.hadoop.io.Text.class,
[rat:report]       org.apache.hadoop.io.MD5Hash.class,
[rat:report]       org.apache.nutch.crawl.CrawlDatum.class,
[rat:report]       org.apache.nutch.crawl.Inlink.class,
[rat:report]       org.apache.nutch.crawl.Inlinks.class,
[rat:report]       org.apache.nutch.crawl.MapWritable.class,
[rat:report]       org.apache.nutch.fetcher.FetcherOutput.class,
[rat:report]       org.apache.nutch.metadata.Metadata.class,
[rat:report]       org.apache.nutch.parse.Outlink.class,
[rat:report]       org.apache.nutch.parse.ParseText.class,
[rat:report]       org.apache.nutch.parse.ParseData.class,
[rat:report]       org.apache.nutch.parse.ParseImpl.class,
[rat:report]       org.apache.nutch.parse.ParseStatus.class,
[rat:report]       org.apache.nutch.protocol.Content.class,
[rat:report]       org.apache.nutch.protocol.ProtocolStatus.class,
[rat:report]       org.apache.nutch.searcher.Hit.class,
[rat:report]       org.apache.nutch.searcher.HitDetails.class,
[rat:report]       org.apache.nutch.searcher.Hits.class
[rat:report]     };
[rat:report]   }
[rat:report] 
[rat:report]   public NutchWritable() { }
[rat:report]   
[rat:report]   public NutchWritable(Writable instance) {
[rat:report]     set(instance);
[rat:report]   }
[rat:report] 
[rat:report]   @Override
[rat:report]   protected Class<? extends Writable>[] getTypes() {
[rat:report]     return CLASSES;
[rat:report]   }
[rat:report] 
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/crawl/package.html
[rat:report]  =======================================================================
[rat:report] <html>
[rat:report] <body>
[rat:report] Crawl control code.
[rat:report] </body>
[rat:report] </html>
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/fetcher/package.html
[rat:report]  =======================================================================
[rat:report] <html>
[rat:report] <body>
[rat:report] The Nutch robot.
[rat:report] </body>
[rat:report] </html>
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.util.Collection;
[rat:report] import java.util.Iterator;
[rat:report] 
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.Mapper;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.nutch.crawl.CrawlDatum;
[rat:report] import org.apache.nutch.crawl.CrawlDb;
[rat:report] import org.apache.nutch.crawl.Inlinks;
[rat:report] import org.apache.nutch.crawl.LinkDb;
[rat:report] import org.apache.nutch.crawl.NutchWritable;
[rat:report] import org.apache.nutch.metadata.Metadata;
[rat:report] import org.apache.nutch.metadata.Nutch;
[rat:report] import org.apache.nutch.parse.Parse;
[rat:report] import org.apache.nutch.parse.ParseData;
[rat:report] import org.apache.nutch.parse.ParseImpl;
[rat:report] import org.apache.nutch.parse.ParseText;
[rat:report] import org.apache.nutch.scoring.ScoringFilterException;
[rat:report] import org.apache.nutch.scoring.ScoringFilters;
[rat:report] 
[rat:report] public class IndexerMapReduce extends Configured
[rat:report] implements Mapper<Text, Writable, Text, NutchWritable>,
[rat:report]           Reducer<Text, NutchWritable, Text, NutchDocument> {
[rat:report] 
[rat:report]   public static final Log LOG = LogFactory.getLog(IndexerMapReduce.class);
[rat:report] 
[rat:report]   private IndexingFilters filters;
[rat:report]   private ScoringFilters scfilters;
[rat:report] 
[rat:report]   public void configure(JobConf job) {
[rat:report]     setConf(job);
[rat:report]     this.filters = new IndexingFilters(getConf());
[rat:report]     this.scfilters = new ScoringFilters(getConf());
[rat:report]   }
[rat:report] 
[rat:report]   public void map(Text key, Writable value,
[rat:report]       OutputCollector<Text, NutchWritable> output, Reporter reporter) throws IOException {
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.RecordWriter;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.util.Progressable;
[rat:report] 
[rat:report] public class IndexerOutputFormat extends FileOutputFormat<Text, NutchDocument> {
[rat:report] 
[rat:report]   @Override
[rat:report]   public RecordWriter<Text, NutchDocument> getRecordWriter(FileSystem ignored,
[rat:report]       JobConf job, String name, Progressable progress) throws IOException {
[rat:report]     final NutchIndexWriter[] writers =
[rat:report]       NutchIndexWriterFactory.getNutchIndexWriters(job);
[rat:report] 
[rat:report]     for (final NutchIndexWriter writer : writers) {
[rat:report]       writer.open(job, name);
[rat:report]     }
[rat:report]     return new RecordWriter<Text, NutchDocument>() {
[rat:report] 
[rat:report]       public void close(Reporter reporter) throws IOException {
[rat:report]         for (final NutchIndexWriter writer : writers) {
[rat:report]           writer.close();
[rat:report]         }
[rat:report]       }
[rat:report] 
[rat:report]       public void write(Text key, NutchDocument doc) throws IOException {
[rat:report]         for (final NutchIndexWriter writer : writers) {
[rat:report]           writer.write(doc);
[rat:report]         }
[rat:report]       }
[rat:report]     };
[rat:report]   }
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/NutchDocument.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer;
[rat:report] 
[rat:report] import java.io.DataInput;
[rat:report] import java.io.DataOutput;
[rat:report] import java.io.IOException;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.Collection;
[rat:report] import java.util.HashMap;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.List;
[rat:report] import java.util.Map;
[rat:report] import java.util.Map.Entry;
[rat:report] 
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.VersionMismatchException;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.hadoop.io.WritableUtils;
[rat:report] import org.apache.nutch.metadata.Metadata;
[rat:report] 
[rat:report] /** A {@link NutchDocument} is the unit of indexing.*/
[rat:report] public class NutchDocument
[rat:report] implements Writable, Iterable<Entry<String, List<String>>> {
[rat:report] 
[rat:report]   public static final byte VERSION = 1;
[rat:report] 
[rat:report]   private Map<String, List<String>> fields;
[rat:report] 
[rat:report]   private Metadata documentMeta;
[rat:report] 
[rat:report]   private float score;
[rat:report] 
[rat:report]   public NutchDocument() {
[rat:report]     fields = new HashMap<String, List<String>>();
[rat:report]     documentMeta = new Metadata();
[rat:report]     score = 0.0f;
[rat:report]   }
[rat:report] 
[rat:report]   public void add(String name, String value) {
[rat:report]     List<String> fieldValues = fields.get(name);
[rat:report]     if (fieldValues == null) {
[rat:report]       fieldValues = new ArrayList<String>();
[rat:report]     }
[rat:report]     fieldValues.add(value);
[rat:report]     fields.put(name, fieldValues);
[rat:report]   }
[rat:report] 
[rat:report]   private void addFieldUnprotected(String name, String value) {
[rat:report]     fields.get(name).add(value);
[rat:report]   }
[rat:report] 
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] 
[rat:report] public interface NutchIndexWriter {
[rat:report]   public void open(JobConf job, String name) throws IOException;
[rat:report] 
[rat:report]   public void write(NutchDocument doc) throws IOException;
[rat:report] 
[rat:report]   public void close() throws IOException;
[rat:report] 
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer;
[rat:report] 
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] 
[rat:report] public class NutchIndexWriterFactory {
[rat:report]   @SuppressWarnings("unchecked")
[rat:report]   public static NutchIndexWriter[] getNutchIndexWriters(Configuration conf) {
[rat:report]     final String[] classes = conf.getStrings("indexer.writer.classes");
[rat:report]     final NutchIndexWriter[] writers = new NutchIndexWriter[classes.length];
[rat:report]     for (int i = 0; i < classes.length; i++) {
[rat:report]       final String clazz = classes[i];
[rat:report]       try {
[rat:report]         final Class<NutchIndexWriter> implClass =
[rat:report]           (Class<NutchIndexWriter>) Class.forName(clazz);
[rat:report]         writers[i] = implClass.newInstance();
[rat:report]       } catch (final Exception e) {
[rat:report]         throw new RuntimeException("Couldn't create " + clazz, e);
[rat:report]       }
[rat:report]     }
[rat:report]     return writers;
[rat:report]   }
[rat:report] 
[rat:report]   public static void addClassToConf(Configuration conf,
[rat:report]                                     Class<? extends NutchIndexWriter> clazz) {
[rat:report]     final String classes = conf.get("indexer.writer.classes");
[rat:report]     final String newClass = clazz.getName();
[rat:report] 
[rat:report]     if (classes == null) {
[rat:report]       conf.set("indexer.writer.classes", newClass);
[rat:report]     } else {
[rat:report]       conf.set("indexer.writer.classes", classes + "," + newClass);
[rat:report]     }
[rat:report] 
[rat:report]   }
[rat:report] 
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/AnchorFields.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.field;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.Collections;
[rat:report] import java.util.Comparator;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.List;
[rat:report] import java.util.Random;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.ObjectWritable;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.Mapper;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.mapred.SequenceFileOutputFormat;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.nutch.scoring.webgraph.LinkDatum;
[rat:report] import org.apache.nutch.scoring.webgraph.Node;
[rat:report] import org.apache.nutch.scoring.webgraph.WebGraph;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] import org.apache.nutch.util.NutchJob;
[rat:report] 
[rat:report] /**
[rat:report]  * Creates FieldWritable objects for inbound anchor text.   These FieldWritable
[rat:report]  * objects are then included in the input to the FieldIndexer to be converted
[rat:report]  * to Lucene Field objects and indexed.
[rat:report]  * 
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/BasicFields.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.field;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.List;
[rat:report] import java.util.Random;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.ObjectWritable;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.hadoop.io.WritableUtils;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.Mapper;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.mapred.SequenceFileOutputFormat;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.lucene.document.DateTools;
[rat:report] import org.apache.nutch.crawl.CrawlDatum;
[rat:report] import org.apache.nutch.metadata.Metadata;
[rat:report] import org.apache.nutch.metadata.Nutch;
[rat:report] import org.apache.nutch.parse.Parse;
[rat:report] import org.apache.nutch.parse.ParseData;
[rat:report] import org.apache.nutch.parse.ParseImpl;
[rat:report] import org.apache.nutch.parse.ParseText;
[rat:report] import org.apache.nutch.scoring.webgraph.LinkDatum;
[rat:report] import org.apache.nutch.scoring.webgraph.Node;
[rat:report] import org.apache.nutch.scoring.webgraph.WebGraph;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/CustomFields.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.field;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.io.InputStream;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.Enumeration;
[rat:report] import java.util.HashMap;
[rat:report] import java.util.HashSet;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.List;
[rat:report] import java.util.Map;
[rat:report] import java.util.Properties;
[rat:report] import java.util.Random;
[rat:report] import java.util.Set;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.LongWritable;
[rat:report] import org.apache.hadoop.io.ObjectWritable;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.Mapper;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.mapred.SequenceFileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.TextInputFormat;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] import org.apache.nutch.util.NutchJob;
[rat:report] 
[rat:report] /**
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldFilter.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.field;
[rat:report] 
[rat:report] import java.util.List;
[rat:report] 
[rat:report] import org.apache.hadoop.conf.Configurable;
[rat:report] import org.apache.lucene.document.Document;
[rat:report] import org.apache.nutch.indexer.IndexingException;
[rat:report] import org.apache.nutch.plugin.Pluggable;
[rat:report] 
[rat:report] /**
[rat:report]  * Filter to manipulate FieldWritable objects for a given url during indexing.
[rat:report]  * 
[rat:report]  * Field filters are responsible for converting FieldWritable objects into 
[rat:report]  * lucene fields and adding those fields to the Lucene document.
[rat:report]  */
[rat:report] public interface FieldFilter
[rat:report]   extends Pluggable, Configurable {
[rat:report] 
[rat:report]   final static String X_POINT_ID = FieldFilter.class.getName();
[rat:report] 
[rat:report]   /**
[rat:report]    * Returns the document to which fields are being added or null if we are to
[rat:report]    * stop processing for this url and not add anything to the index.  All 
[rat:report]    * FieldWritable objects for a url are aggregated from databases passed into
[rat:report]    * the FieldIndexer and these fields are then passed into the Field filters.
[rat:report]    * 
[rat:report]    * It is therefore possible for fields to be added, removed, and changed 
[rat:report]    * before being indexed.
[rat:report]    * 
[rat:report]    * @param url The url to index.  
[rat:report]    * @param doc The lucene document
[rat:report]    * @param fields The list of FieldWritable objects representing fields for 
[rat:report]    * the index.
[rat:report]    * @return The lucene Document or null to stop processing and not index any
[rat:report]    * content for this url.
[rat:report]    * 
[rat:report]    * @throws IndexingException If an error occurs during indexing
[rat:report]    */
[rat:report]   public Document filter(String url, Document doc, List<FieldWritable> fields)
[rat:report]     throws IndexingException;
[rat:report] 
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldFilters.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.field;
[rat:report] 
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.HashMap;
[rat:report] import java.util.List;
[rat:report] 
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.lucene.document.Document;
[rat:report] import org.apache.nutch.indexer.IndexingException;
[rat:report] import org.apache.nutch.plugin.Extension;
[rat:report] import org.apache.nutch.plugin.ExtensionPoint;
[rat:report] import org.apache.nutch.plugin.PluginRepository;
[rat:report] import org.apache.nutch.plugin.PluginRuntimeException;
[rat:report] import org.apache.nutch.util.ObjectCache;
[rat:report] 
[rat:report] /**
[rat:report]  * The FieldFilters class provides a standard way to collect, order, and run
[rat:report]  * all FieldFilter implementations that are active in the plugin system.
[rat:report]  */
[rat:report] public class FieldFilters {
[rat:report] 
[rat:report]   public static final String FIELD_FILTER_ORDER = "field.filter.order";
[rat:report] 
[rat:report]   public final static Log LOG = LogFactory.getLog(FieldFilters.class);
[rat:report] 
[rat:report]   private FieldFilter[] fieldFilters;
[rat:report] 
[rat:report]   /**
[rat:report]    * Configurable constructor.
[rat:report]    */
[rat:report]   public FieldFilters(Configuration conf) {
[rat:report] 
[rat:report]     // get the field filter order, the cache, and all field filters
[rat:report]     String order = conf.get(FIELD_FILTER_ORDER);
[rat:report]     ObjectCache objectCache = ObjectCache.get(conf);
[rat:report]     this.fieldFilters = (FieldFilter[])objectCache.getObject(FieldFilter.class.getName());
[rat:report]     
[rat:report]     if (this.fieldFilters == null) {
[rat:report] 
[rat:report]       String[] orderedFilters = null;
[rat:report]       if (order != null && !order.trim().equals("")) {
[rat:report]         orderedFilters = order.split("\\s+");
[rat:report]       }
[rat:report]       try {
[rat:report] 
[rat:report]         // get the field filter extension point
[rat:report]         ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint(
[rat:report]           FieldFilter.X_POINT_ID);
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldIndexer.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.field;
[rat:report] 
[rat:report] import java.io.DataInput;
[rat:report] import java.io.DataOutput;
[rat:report] import java.io.IOException;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.List;
[rat:report] import java.util.Random;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.hadoop.io.WritableComparable;
[rat:report] import org.apache.hadoop.io.WritableUtils;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.Mapper;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.RecordWriter;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.util.Progressable;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.lucene.document.Document;
[rat:report] import org.apache.lucene.index.IndexWriter;
[rat:report] import org.apache.nutch.analysis.AnalyzerFactory;
[rat:report] import org.apache.nutch.analysis.NutchAnalyzer;
[rat:report] import org.apache.nutch.analysis.NutchDocumentAnalyzer;
[rat:report] import org.apache.nutch.indexer.IndexingException;
[rat:report] import org.apache.nutch.indexer.NutchSimilarity;
[rat:report] import org.apache.nutch.util.LogUtil;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldType.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.field;
[rat:report] 
[rat:report] /**
[rat:report]  * The different types of fields. Different types of fields will be handled by
[rat:report]  * different FieldFilter implementations during indexing.
[rat:report]  */
[rat:report] public enum FieldType {
[rat:report]   
[rat:report]   CONTENT,
[rat:report]   BOOST,
[rat:report]   COMPUTATION,
[rat:report]   ACTION;
[rat:report]   
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldWritable.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.field;
[rat:report] 
[rat:report] import java.io.DataInput;
[rat:report] import java.io.DataOutput;
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] 
[rat:report] /** 
[rat:report]  * A class that holds a single field of content to be placed into an index.
[rat:report]  * 
[rat:report]  * This class has options type of content as well as for how the field is to 
[rat:report]  * be indexed.
[rat:report]  */
[rat:report] public class FieldWritable
[rat:report]   implements Writable {
[rat:report] 
[rat:report]   private String name;
[rat:report]   private String value;
[rat:report]   private FieldType type = FieldType.CONTENT;
[rat:report]   private float boost;
[rat:report]   private boolean indexed = true;
[rat:report]   private boolean stored = false;
[rat:report]   private boolean tokenized = true;
[rat:report] 
[rat:report]   public FieldWritable() {
[rat:report] 
[rat:report]   }
[rat:report] 
[rat:report]   public FieldWritable(String name, String value, FieldType type, float boost) {
[rat:report]     this(name, value, type, boost, true, false, true);
[rat:report]   }
[rat:report] 
[rat:report]   public FieldWritable(String name, String value, FieldType type,
[rat:report]     boolean indexed, boolean stored, boolean tokenized) {
[rat:report]     this(name, value, type, 0.0f, indexed, stored, tokenized);
[rat:report]   }
[rat:report] 
[rat:report]   public FieldWritable(String name, String value, FieldType type, float boost,
[rat:report]     boolean indexed, boolean stored, boolean tokenized) {
[rat:report]     this.name = name;
[rat:report]     this.value = value;
[rat:report]     this.type = type;
[rat:report]     this.boost = boost;
[rat:report]     this.indexed = indexed;
[rat:report]     this.stored = stored;
[rat:report]     this.tokenized = tokenized;
[rat:report]   }
[rat:report] 
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/Fields.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.field;
[rat:report] 
[rat:report] public interface Fields {
[rat:report] 
[rat:report]   // names of common fields
[rat:report]   public static final String ANCHOR = "anchor";
[rat:report]   public static final String SEGMENT = "segment";
[rat:report]   public static final String DIGEST = "digest";
[rat:report]   public static final String HOST = "host";
[rat:report]   public static final String SITE = "site";
[rat:report]   public static final String URL = "url";
[rat:report]   public static final String ORIG_URL = "orig";
[rat:report]   public static final String SEG_URL = "segurl";
[rat:report]   public static final String CONTENT = "content";
[rat:report]   public static final String TITLE = "title";
[rat:report]   public static final String CACHE = "cache";
[rat:report]   public static final String TSTAMP = "tstamp";
[rat:report]   public static final String BOOSTFACTOR = "boostfactor";
[rat:report]   
[rat:report]   // special fields for indexer
[rat:report]   public static final String BOOST = "boost";
[rat:report]   public static final String COMPUTATION = "computation";
[rat:report]   public static final String ACTION = "action";
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/field/FieldsWritable.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.field;
[rat:report] 
[rat:report] import java.io.DataInput;
[rat:report] import java.io.DataOutput;
[rat:report] import java.io.IOException;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.List;
[rat:report] 
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] 
[rat:report] /**
[rat:report]  * A class that holds a grouping of FieldWritable objects.
[rat:report]  */
[rat:report] public class FieldsWritable
[rat:report]   implements Writable {
[rat:report] 
[rat:report]   private List<FieldWritable> fieldsList = new ArrayList<FieldWritable>();
[rat:report] 
[rat:report]   public FieldsWritable() {
[rat:report] 
[rat:report]   }
[rat:report]   
[rat:report]   public boolean hasField(String name) {
[rat:report]     for (FieldWritable field : fieldsList) {
[rat:report]       if (field.getName().equals(name)) {
[rat:report]         return true;
[rat:report]       }
[rat:report]     }
[rat:report]     return false;
[rat:report]   }
[rat:report]   
[rat:report]   public FieldWritable getField(String name) {
[rat:report]     for (FieldWritable field : fieldsList) {
[rat:report]       if (field.getName().equals(name)) {
[rat:report]         return field;
[rat:report]       }
[rat:report]     }
[rat:report]     return null;
[rat:report]   }
[rat:report]   
[rat:report]   public List<FieldWritable> getFields(String name) {
[rat:report]     List<FieldWritable> named = new ArrayList<FieldWritable>();
[rat:report]     for (FieldWritable field : fieldsList) {
[rat:report]       if (field.getName().equals(name)) {
[rat:report]         named.add(field);
[rat:report]       }
[rat:report]     }
[rat:report]     return named.size() > 0 ? named : null;
[rat:report]   }
[rat:report]   
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/lucene/LuceneConstants.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.lucene;
[rat:report] 
[rat:report] public interface LuceneConstants {
[rat:report]   public static final String LUCENE_PREFIX = "lucene.";
[rat:report] 
[rat:report]   public static final String FIELD_PREFIX = LUCENE_PREFIX + "field.";
[rat:report] 
[rat:report]   public static final String FIELD_STORE_PREFIX = FIELD_PREFIX + "store.";
[rat:report] 
[rat:report]   public static final String FIELD_INDEX_PREFIX = FIELD_PREFIX + "index.";
[rat:report] 
[rat:report]   public static final String FIELD_VECTOR_PREFIX = FIELD_PREFIX + "vector.";
[rat:report] 
[rat:report]   public static final String STORE_YES = "store.yes";
[rat:report] 
[rat:report]   public static final String STORE_NO = "store.no";
[rat:report] 
[rat:report]   public static final String STORE_COMPRESS = "store.compress";
[rat:report] 
[rat:report]   public static final String INDEX_NO = "index.no";
[rat:report] 
[rat:report]   public static final String INDEX_NO_NORMS = "index.no_norms";
[rat:report] 
[rat:report]   public static final String INDEX_TOKENIZED = "index.tokenized";
[rat:report] 
[rat:report]   public static final String INDEX_UNTOKENIZED = "index.untokenized";
[rat:report] 
[rat:report]   public static final String VECTOR_NO = "vector.no";
[rat:report] 
[rat:report]   public static final String VECTOR_POS = "vector.pos";
[rat:report] 
[rat:report]   public static final String VECTOR_OFFSET = "vector.offset";
[rat:report] 
[rat:report]   public static final String VECTOR_POS_OFFSET = "vector.pos_offset";
[rat:report] 
[rat:report]   public static final String VECTOR_YES = "vector.yes";
[rat:report] 
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/lucene/LuceneWriter.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.lucene;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.util.HashMap;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.List;
[rat:report] import java.util.Map;
[rat:report] import java.util.Random;
[rat:report] import java.util.Map.Entry;
[rat:report] 
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.lucene.document.Document;
[rat:report] import org.apache.lucene.document.Field;
[rat:report] import org.apache.lucene.index.IndexWriter;
[rat:report] import org.apache.nutch.analysis.AnalyzerFactory;
[rat:report] import org.apache.nutch.analysis.NutchAnalyzer;
[rat:report] import org.apache.nutch.analysis.NutchDocumentAnalyzer;
[rat:report] import org.apache.nutch.indexer.Indexer;
[rat:report] import org.apache.nutch.indexer.NutchDocument;
[rat:report] import org.apache.nutch.indexer.NutchIndexWriter;
[rat:report] import org.apache.nutch.indexer.NutchSimilarity;
[rat:report] import org.apache.nutch.metadata.Metadata;
[rat:report] import org.apache.nutch.util.LogUtil;
[rat:report] 
[rat:report] public class LuceneWriter implements NutchIndexWriter {
[rat:report] 
[rat:report]   public static enum STORE { YES, NO, COMPRESS }
[rat:report] 
[rat:report]   public static enum INDEX { NO, NO_NORMS, TOKENIZED, UNTOKENIZED }
[rat:report] 
[rat:report]   public static enum VECTOR { NO, OFFSET, POS, POS_OFFSET, YES }
[rat:report] 
[rat:report]   private IndexWriter writer;
[rat:report] 
[rat:report]   private AnalyzerFactory analyzerFactory;
[rat:report] 
[rat:report]   private Path perm;
[rat:report] 
[rat:report]   private Path temp;
[rat:report] 
[rat:report]   private FileSystem fs;
[rat:report] 
[rat:report]   private final Map<String, Field.Store> fieldStore;
[rat:report] 
[rat:report]   private final Map<String, Field.Index> fieldIndex;
[rat:report] 
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/package.html
[rat:report]  =======================================================================
[rat:report] <html>
[rat:report] <body>
[rat:report] Maintain Lucene full-text indexes.
[rat:report] </body>
[rat:report] </html>
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/solr/SolrConstants.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.solr;
[rat:report] 
[rat:report] public interface SolrConstants {
[rat:report]   public static final String SOLR_PREFIX = "solr.";
[rat:report] 
[rat:report]   public static final String SERVER_URL = SOLR_PREFIX + "server.url";
[rat:report] 
[rat:report]   public static final String COMMIT_SIZE = SOLR_PREFIX + "commit.size";
[rat:report] 
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.solr;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.List;
[rat:report] import java.util.Random;
[rat:report] 
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.nutch.indexer.IndexerMapReduce;
[rat:report] import org.apache.nutch.indexer.NutchIndexWriterFactory;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] import org.apache.nutch.util.NutchJob;
[rat:report] 
[rat:report] public class SolrIndexer extends Configured implements Tool {
[rat:report] 
[rat:report]   public static Log LOG = LogFactory.getLog(SolrIndexer.class);
[rat:report] 
[rat:report]   public SolrIndexer() {
[rat:report]     super(null);
[rat:report]   }
[rat:report] 
[rat:report]   public SolrIndexer(Configuration conf) {
[rat:report]     super(conf);
[rat:report]   }
[rat:report] 
[rat:report]   private void indexSolr(String solrUrl, Path crawlDb, Path linkDb,
[rat:report]       List<Path> segments) throws IOException {
[rat:report]     LOG.info("SolrIndexer: starting");
[rat:report] 
[rat:report]     final JobConf job = new NutchJob(getConf());
[rat:report]     job.setJobName("index-solr " + solrUrl);
[rat:report] 
[rat:report]     IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job);
[rat:report] 
[rat:report]     job.set(SolrConstants.SERVER_URL, solrUrl);
[rat:report] 
[rat:report]     NutchIndexWriterFactory.addClassToConf(job, SolrWriter.class);
[rat:report] 
[rat:report]     job.setReduceSpeculativeExecution(false);
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.indexer.solr;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.List;
[rat:report] import java.util.Map.Entry;
[rat:report] 
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.nutch.indexer.NutchDocument;
[rat:report] import org.apache.nutch.indexer.NutchIndexWriter;
[rat:report] import org.apache.solr.client.solrj.SolrServer;
[rat:report] import org.apache.solr.client.solrj.SolrServerException;
[rat:report] import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
[rat:report] import org.apache.solr.common.SolrInputDocument;
[rat:report] 
[rat:report] public class SolrWriter implements NutchIndexWriter {
[rat:report] 
[rat:report]   private SolrServer solr;
[rat:report] 
[rat:report]   private final List<SolrInputDocument> inputDocs =
[rat:report]     new ArrayList<SolrInputDocument>();
[rat:report] 
[rat:report]   private int commitSize;
[rat:report] 
[rat:report]   public void open(JobConf job, String name)
[rat:report]   throws IOException {
[rat:report]     solr = new CommonsHttpSolrServer(job.get(SolrConstants.SERVER_URL));
[rat:report]     commitSize = job.getInt(SolrConstants.COMMIT_SIZE, 1000);
[rat:report]   }
[rat:report] 
[rat:report]   public void write(NutchDocument doc) throws IOException {
[rat:report]     final SolrInputDocument inputDoc = new SolrInputDocument();
[rat:report]     for(final Entry<String, List<String>> e : doc) {
[rat:report]       for (final String val : e.getValue()) {
[rat:report]         inputDoc.addField(e.getKey(), val);
[rat:report]       }
[rat:report]     }
[rat:report]     inputDoc.setDocumentBoost(doc.getScore());
[rat:report]     inputDocs.add(inputDoc);
[rat:report]     if (inputDocs.size() > commitSize) {
[rat:report]       try {
[rat:report]         solr.add(inputDocs);
[rat:report]       } catch (final SolrServerException e) {
[rat:report]         throw makeIOException(e);
[rat:report]       }
[rat:report]       inputDocs.clear();
[rat:report]     }
[rat:report]   }
[rat:report] 
[rat:report]   public void close() throws IOException {
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/metadata/package.html
[rat:report]  =======================================================================
[rat:report] <html>
[rat:report] <body>
[rat:report] A Multi-valued Metadata container, and set
[rat:report] of constant fields for Nutch Metadata.
[rat:report] </body>
[rat:report] </html>
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/plugin/package.html
[rat:report]  =======================================================================
[rat:report] <html>
[rat:report] <body>
[rat:report] The Nutch {@link org.apache.nutch.plugin.Pluggable Plugin} System.
[rat:report] <p>
[rat:report] <b>The Nutch Plugin System provides a way to extend nutch functionality</b>.
[rat:report] A large part of the functionality of Nutch are provided by plugins:
[rat:report] All of the parsing, indexing and searching that nutch does is actually
[rat:report] accomplished by various plugins.
[rat:report] </p><p>
[rat:report] In writing a plugin, you're actually providing one or more extensions of the
[rat:report] existing extension-points (<i>hooks</i>).
[rat:report] The core Nutch extension-points are themselves defined in a plugin,
[rat:report] the <code>nutch-extensionpoints</code> plugin.
[rat:report] Each extension-point defines an interface that must be implemented by the
[rat:report] extension. The core extension-points and extensions available in Nutch are
[rat:report] listed in the {@link org.apache.nutch.plugin.Pluggable} interface.
[rat:report] </p>
[rat:report] 
[rat:report] @see <a href="./doc-files/plugin.dtd">Nutch plugin manifest DTD</a>
[rat:report] 
[rat:report] @see <a href="http://wiki.apache.org/nutch/PluginCentral">
[rat:report]      Plugin Central
[rat:report]      </a>
[rat:report] @see <a href="http://wiki.apache.org/nutch/AboutPlugins">
[rat:report]      About Plugins
[rat:report]      </a>
[rat:report] @see <a href="http://wiki.apache.org/nutch/WhyNutchHasAPluginSystem">
[rat:report]      Why Nutch has a Plugin System?
[rat:report]      </a>
[rat:report] @see <a href="http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem">
[rat:report]      Which technical concepts are behind the nutch plugin system?
[rat:report]      </a>
[rat:report] @see <a href="http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading">
[rat:report]      What's the problem with Plugins and Class loading?
[rat:report]      </a>
[rat:report] @see <a href="http://wiki.apache.org/nutch/WritingPluginExample">
[rat:report]      Writing Plugin Example
[rat:report]      </a>
[rat:report] </body>
[rat:report] </html>
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LinkDatum.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.scoring.webgraph;
[rat:report] 
[rat:report] import java.io.DataInput;
[rat:report] import java.io.DataOutput;
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] 
[rat:report] /**
[rat:report]  * A class for holding link information including the url, anchor text, a score,
[rat:report]  * the timestamp of the link and a link type.
[rat:report]  */
[rat:report] public class LinkDatum
[rat:report]   implements Writable {
[rat:report] 
[rat:report]   public final static byte INLINK = 1;
[rat:report]   public final static byte OUTLINK = 2;
[rat:report] 
[rat:report]   private String url = null;
[rat:report]   private String anchor = "";
[rat:report]   private float score = 0.0f;
[rat:report]   private long timestamp = 0L;
[rat:report]   private byte linkType = 0;
[rat:report] 
[rat:report]   /**
[rat:report]    * Default constructor, no url, timestamp, score, or link type.
[rat:report]    */
[rat:report]   public LinkDatum() {
[rat:report] 
[rat:report]   }
[rat:report] 
[rat:report]   /**
[rat:report]    * Creates a LinkDatum with a given url. Timestamp is set to current time.
[rat:report]    * 
[rat:report]    * @param url The link url.
[rat:report]    */
[rat:report]   public LinkDatum(String url) {
[rat:report]     this(url, "", System.currentTimeMillis());
[rat:report]   }
[rat:report] 
[rat:report]   /**
[rat:report]    * Creates a LinkDatum with a url and an anchor text. Timestamp is set to
[rat:report]    * current time.
[rat:report]    * 
[rat:report]    * @param url The link url.
[rat:report]    * @param anchor The link anchor text.
[rat:report]    */
[rat:report]   public LinkDatum(String url, String anchor) {
[rat:report]     this(url, anchor, System.currentTimeMillis());
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.scoring.webgraph;
[rat:report] 
[rat:report] import java.io.DataInput;
[rat:report] import java.io.DataOutput;
[rat:report] import java.io.IOException;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.List;
[rat:report] import java.util.Random;
[rat:report] import java.util.Set;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.MapFile;
[rat:report] import org.apache.hadoop.io.ObjectWritable;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.hadoop.io.WritableUtils;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.MapFileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.Mapper;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.mapred.SequenceFileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.lib.HashPartitioner;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.nutch.scoring.webgraph.Loops.LoopSet;
[rat:report] import org.apache.nutch.util.FSUtils;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] import org.apache.nutch.util.NutchJob;
[rat:report] 
[rat:report] /**
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.scoring.webgraph;
[rat:report] 
[rat:report] import java.io.BufferedReader;
[rat:report] import java.io.IOException;
[rat:report] import java.io.InputStreamReader;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.HashSet;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.List;
[rat:report] import java.util.Random;
[rat:report] import java.util.Set;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.FSDataInputStream;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.LongWritable;
[rat:report] import org.apache.hadoop.io.ObjectWritable;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.hadoop.io.WritableUtils;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.MapFileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.Mapper;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.mapred.SequenceFileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.TextOutputFormat;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.nutch.scoring.webgraph.Loops.LoopSet;
[rat:report] import org.apache.nutch.util.FSUtils;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] import org.apache.nutch.util.NutchJob;
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/LoopReader.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.scoring.webgraph;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.MapFile;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.mapred.MapFileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.lib.HashPartitioner;
[rat:report] import org.apache.nutch.scoring.webgraph.Loops.LoopSet;
[rat:report] import org.apache.nutch.util.FSUtils;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] 
[rat:report] /**
[rat:report]  * The LoopReader tool prints the loopset information for a single url.
[rat:report]  */
[rat:report] public class LoopReader {
[rat:report] 
[rat:report]   private Configuration conf;
[rat:report]   private FileSystem fs;
[rat:report]   private MapFile.Reader[] loopReaders;
[rat:report] 
[rat:report]   /**
[rat:report]    * Prints loopset for a single url.  The loopset information will show any
[rat:report]    * outlink url the eventually forms a link cycle.
[rat:report]    * 
[rat:report]    * @param webGraphDb The WebGraph to check for loops
[rat:report]    * @param url The url to check.
[rat:report]    * 
[rat:report]    * @throws IOException If an error occurs while printing loopset information.
[rat:report]    */
[rat:report]   public void dumpUrl(Path webGraphDb, String url)
[rat:report]     throws IOException {
[rat:report] 
[rat:report]     // open the readers
[rat:report]     conf = NutchConfiguration.create();
[rat:report]     fs = FileSystem.get(conf);
[rat:report]     loopReaders = MapFileOutputFormat.getReaders(fs, new Path(webGraphDb,
[rat:report]       Loops.LOOPS_DIR), conf);
[rat:report] 
[rat:report]     // get the loopset for a given url, if any
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/Loops.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.scoring.webgraph;
[rat:report] 
[rat:report] import java.io.DataInput;
[rat:report] import java.io.DataOutput;
[rat:report] import java.io.IOException;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.HashSet;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.LinkedHashSet;
[rat:report] import java.util.List;
[rat:report] import java.util.Random;
[rat:report] import java.util.Set;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.ObjectWritable;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.hadoop.io.WritableUtils;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.MapFileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.Mapper;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.mapred.SequenceFileOutputFormat;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.nutch.util.FSUtils;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] import org.apache.nutch.util.NutchJob;
[rat:report] 
[rat:report] /**
[rat:report]  * The Loops job identifies cycles of loops inside of the web graph. This is
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/Node.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.scoring.webgraph;
[rat:report] 
[rat:report] import java.io.DataInput;
[rat:report] import java.io.DataOutput;
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.nutch.metadata.Metadata;
[rat:report] 
[rat:report] /**
[rat:report]  * A class which holds the number of inlinks and outlinks for a given url along
[rat:report]  * with an inlink score from a link analysis program and any metadata.  
[rat:report]  * 
[rat:report]  * The Node is the core unit of the NodeDb in the WebGraph.
[rat:report]  */
[rat:report] public class Node
[rat:report]   implements Writable {
[rat:report] 
[rat:report]   private int numInlinks = 0;
[rat:report]   private int numOutlinks = 0;
[rat:report]   private float inlinkScore = 1.0f;
[rat:report]   private Metadata metadata = new Metadata();
[rat:report] 
[rat:report]   public Node() {
[rat:report] 
[rat:report]   }
[rat:report] 
[rat:report]   public int getNumInlinks() {
[rat:report]     return numInlinks;
[rat:report]   }
[rat:report] 
[rat:report]   public void setNumInlinks(int numInlinks) {
[rat:report]     this.numInlinks = numInlinks;
[rat:report]   }
[rat:report] 
[rat:report]   public int getNumOutlinks() {
[rat:report]     return numOutlinks;
[rat:report]   }
[rat:report] 
[rat:report]   public void setNumOutlinks(int numOutlinks) {
[rat:report]     this.numOutlinks = numOutlinks;
[rat:report]   }
[rat:report] 
[rat:report]   public float getInlinkScore() {
[rat:report]     return inlinkScore;
[rat:report]   }
[rat:report] 
[rat:report]   public void setInlinkScore(float inlinkScore) {
[rat:report]     this.inlinkScore = inlinkScore;
[rat:report]   }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.scoring.webgraph;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.util.Iterator;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.FloatWritable;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.WritableUtils;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.Mapper;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.mapred.TextOutputFormat;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] import org.apache.nutch.util.NutchJob;
[rat:report] 
[rat:report] /**
[rat:report]  * A tools that dumps out the top urls by number of inlinks, number of outlinks,
[rat:report]  * or by score, to a text file. One of the major uses of this tool is to check
[rat:report]  * the top scoring urls of a link analysis program such as LinkRank.
[rat:report]  * 
[rat:report]  * For number of inlinks or number of outlinks the WebGraph program will need to
[rat:report]  * have been run. For link analysis score a program such as LinkRank will need
[rat:report]  * to have been run which updates the NodeDb of the WebGraph.
[rat:report]  */
[rat:report] public class NodeDumper
[rat:report]   extends Configured
[rat:report]   implements Tool {
[rat:report] 
[rat:report]   public static final Log LOG = LogFactory.getLog(NodeDumper.class);
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/NodeReader.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.scoring.webgraph;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.MapFile;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.mapred.MapFileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.lib.HashPartitioner;
[rat:report] import org.apache.nutch.util.FSUtils;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] 
[rat:report] /**
[rat:report]  * Reads and prints to system out information for a single node from the NodeDb 
[rat:report]  * in the WebGraph.
[rat:report]  */
[rat:report] public class NodeReader {
[rat:report] 
[rat:report]   private Configuration conf;
[rat:report]   private FileSystem fs;
[rat:report]   private MapFile.Reader[] nodeReaders;
[rat:report] 
[rat:report]   /**
[rat:report]    * Prints the content of the Node represented by the url to system out.
[rat:report]    * 
[rat:report]    * @param webGraphDb The webgraph from which to get the node.
[rat:report]    * @param url The url of the node.
[rat:report]    * 
[rat:report]    * @throws IOException If an error occurs while getting the node.
[rat:report]    */
[rat:report]   public void dumpUrl(Path webGraphDb, String url)
[rat:report]     throws IOException {
[rat:report] 
[rat:report]     conf = NutchConfiguration.create();
[rat:report]     fs = FileSystem.get(conf);
[rat:report]     nodeReaders = MapFileOutputFormat.getReaders(fs, new Path(webGraphDb,
[rat:report]       WebGraph.NODE_DIR), conf);
[rat:report] 
[rat:report]     // open the readers, get the node, print out the info, and close the readers
[rat:report]     Text key = new Text(url);
[rat:report]     Node node = new Node();
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.scoring.webgraph;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.Random;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.ObjectWritable;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.MapFileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.Mapper;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.nutch.crawl.CrawlDatum;
[rat:report] import org.apache.nutch.crawl.CrawlDb;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] import org.apache.nutch.util.NutchJob;
[rat:report] 
[rat:report] /**
[rat:report]  * Updates the score from the WebGraph node database into the crawl database.
[rat:report]  * Any score that is not in the node database is set to the clear score in the 
[rat:report]  * crawl database.
[rat:report]  */
[rat:report] public class ScoreUpdater
[rat:report]   extends Configured
[rat:report]   implements Tool, Mapper<Text, Writable, Text, ObjectWritable>,
[rat:report]   Reducer<Text, ObjectWritable, Text, CrawlDatum> {
[rat:report] 
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.scoring.webgraph;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.HashSet;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.LinkedHashMap;
[rat:report] import java.util.List;
[rat:report] import java.util.Map;
[rat:report] import java.util.Random;
[rat:report] import java.util.Set;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] import org.apache.hadoop.io.WritableUtils;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.MapFileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.Mapper;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.nutch.metadata.Nutch;
[rat:report] import org.apache.nutch.net.URLNormalizers;
[rat:report] import org.apache.nutch.parse.Outlink;
[rat:report] import org.apache.nutch.parse.ParseData;
[rat:report] import org.apache.nutch.util.FSUtils;
[rat:report] import org.apache.nutch.util.LockUtil;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] import org.apache.nutch.util.NutchJob;
[rat:report] import org.apache.nutch.util.URLUtil;
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/DistributedSearchBean.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.searcher;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.net.InetSocketAddress;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.Arrays;
[rat:report] import java.util.Collections;
[rat:report] import java.util.Comparator;
[rat:report] import java.util.List;
[rat:report] import java.util.PriorityQueue;
[rat:report] import java.util.concurrent.Callable;
[rat:report] import java.util.concurrent.ExecutionException;
[rat:report] import java.util.concurrent.ExecutorService;
[rat:report] import java.util.concurrent.Executors;
[rat:report] import java.util.concurrent.Future;
[rat:report] import java.util.concurrent.ScheduledExecutorService;
[rat:report] import java.util.concurrent.TimeUnit;
[rat:report] 
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.ipc.RPC;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] 
[rat:report] public class DistributedSearchBean implements SearchBean {
[rat:report] 
[rat:report]   private static final ExecutorService executor =
[rat:report]     Executors.newCachedThreadPool();
[rat:report] 
[rat:report]   private final ScheduledExecutorService pingService;
[rat:report] 
[rat:report]   private class SearchTask implements Callable<Hits> {
[rat:report]     private int id;
[rat:report] 
[rat:report]     private Query query;
[rat:report]     private int numHits;
[rat:report]     private String dedupField;
[rat:report]     private String sortField;
[rat:report]     private boolean reverse;
[rat:report] 
[rat:report]     public SearchTask(int id) {
[rat:report]       this.id = id;
[rat:report]     }
[rat:report] 
[rat:report]     public Hits call() throws Exception {
[rat:report]       if (!liveServers[id]) {
[rat:report]         return null;
[rat:report]       }
[rat:report]       return beans[id].search(query, numHits, dedupField, sortField, reverse);
[rat:report]     }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.searcher;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.net.InetSocketAddress;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.List;
[rat:report] import java.util.Map;
[rat:report] import java.util.concurrent.Callable;
[rat:report] import java.util.concurrent.ConcurrentHashMap;
[rat:report] import java.util.concurrent.ConcurrentMap;
[rat:report] import java.util.concurrent.ExecutorService;
[rat:report] import java.util.concurrent.Executors;
[rat:report] import java.util.concurrent.Future;
[rat:report] import java.util.concurrent.ScheduledExecutorService;
[rat:report] import java.util.concurrent.TimeUnit;
[rat:report] 
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.ipc.RPC;
[rat:report] import org.apache.nutch.parse.ParseData;
[rat:report] import org.apache.nutch.parse.ParseText;
[rat:report] 
[rat:report] public class DistributedSegmentBean implements SegmentBean {
[rat:report] 
[rat:report]   private static final ExecutorService executor =
[rat:report]     Executors.newCachedThreadPool();
[rat:report] 
[rat:report]   private final ScheduledExecutorService pingService;
[rat:report] 
[rat:report]   private class DistSummmaryTask implements Callable<Summary[]> {
[rat:report]     private int id;
[rat:report] 
[rat:report]     private HitDetails[] details;
[rat:report]     private Query query;
[rat:report] 
[rat:report]     public DistSummmaryTask(int id) {
[rat:report]       this.id = id;
[rat:report]     }
[rat:report] 
[rat:report]     public Summary[] call() throws Exception {
[rat:report]       if (details == null) {
[rat:report]         return null;
[rat:report]       }
[rat:report]       return beans[id].getSummary(details, query);
[rat:report]     }
[rat:report] 
[rat:report]     public void setSummaryArgs(HitDetails[] details, Query query) {
[rat:report]       this.details = details;
[rat:report]       this.query = query;
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/RPCSearchBean.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.searcher;
[rat:report] 
[rat:report] import org.apache.hadoop.ipc.VersionedProtocol;
[rat:report] 
[rat:report] public interface RPCSearchBean extends SearchBean, VersionedProtocol {
[rat:report] 
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/RPCSegmentBean.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.searcher;
[rat:report] 
[rat:report] import org.apache.hadoop.ipc.VersionedProtocol;
[rat:report] 
[rat:report] public interface RPCSegmentBean extends SegmentBean, VersionedProtocol {
[rat:report] 
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/SearchBean.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.searcher;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] 
[rat:report] public interface SearchBean extends Searcher, HitDetailer {
[rat:report]   public static final Log LOG = LogFactory.getLog(SearchBean.class);
[rat:report] 
[rat:report]   public boolean ping() throws IOException ;
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/SegmentBean.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.searcher;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] public interface SegmentBean extends HitContent, HitSummarizer {
[rat:report] 
[rat:report]   public String[] getSegmentNames() throws IOException;
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/package.html
[rat:report]  =======================================================================
[rat:report] <html>
[rat:report] <body>
[rat:report] Search API
[rat:report] </body>
[rat:report] </html>
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/RequestUtils.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.searcher.response;
[rat:report] 
[rat:report] import javax.servlet.http.HttpServletRequest;
[rat:report] 
[rat:report] import org.apache.commons.lang.StringUtils;
[rat:report] 
[rat:report] /**
[rat:report]  * A set of utility methods for getting request paramters.
[rat:report]  */
[rat:report] public class RequestUtils {
[rat:report] 
[rat:report]   public static boolean parameterExists(HttpServletRequest request, String param) {
[rat:report]     String value = request.getParameter(param);
[rat:report]     return value != null;
[rat:report]   }
[rat:report] 
[rat:report]   public static Integer getIntegerParameter(HttpServletRequest request,
[rat:report]     String param) {
[rat:report]     if (parameterExists(request, param)) {
[rat:report]       String value = request.getParameter(param);
[rat:report]       if (StringUtils.isNotBlank(value) && StringUtils.isNumeric(value)) {
[rat:report]         return new Integer(value);
[rat:report]       }
[rat:report]     }
[rat:report]     return null;
[rat:report]   }
[rat:report] 
[rat:report]   public static Integer getIntegerParameter(HttpServletRequest request,
[rat:report]     String param, Integer def) {
[rat:report]     Integer value = getIntegerParameter(request, param);
[rat:report]     return (value == null) ? def : value;
[rat:report]   }
[rat:report] 
[rat:report]   public static String getStringParameter(HttpServletRequest request,
[rat:report]     String param) {
[rat:report]     if (parameterExists(request, param)) {
[rat:report]       return request.getParameter(param);
[rat:report]     }
[rat:report]     return null;
[rat:report]   }
[rat:report] 
[rat:report]   public static String getStringParameter(HttpServletRequest request,
[rat:report]     String param, String def) {
[rat:report]     String value = getStringParameter(request, param);
[rat:report]     return (value == null) ? def : value;
[rat:report]   }
[rat:report] 
[rat:report]   public static Boolean getBooleanParameter(HttpServletRequest request,
[rat:report]     String param) {
[rat:report]     if (parameterExists(request, param)) {
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/ResponseWriter.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.searcher.response;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import javax.servlet.http.HttpServletRequest;
[rat:report] import javax.servlet.http.HttpServletResponse;
[rat:report] 
[rat:report] import org.apache.hadoop.conf.Configurable;
[rat:report] import org.apache.nutch.plugin.Pluggable;
[rat:report] 
[rat:report] /**
[rat:report]  * Nutch extension point which allow writing search results in many different
[rat:report]  * output formats.
[rat:report]  */
[rat:report] public interface ResponseWriter
[rat:report]   extends Pluggable, Configurable {
[rat:report] 
[rat:report]   public final static String X_POINT_ID = ResponseWriter.class.getName();
[rat:report]   
[rat:report]   /**
[rat:report]    * Sets the returned content MIME type.  Populated though variables set in
[rat:report]    * the plugin.xml file of the ResponseWriter.  This allows easily changing
[rat:report]    * output content types, for example for JSON from text/plain during tesing
[rat:report]    * and debugging to application/json in production.
[rat:report]    * 
[rat:report]    * @param contentType The MIME content type to set.
[rat:report]    */
[rat:report]   public void setContentType(String contentType);
[rat:report] 
[rat:report]   /**
[rat:report]    * Writes out the search results response to the HttpServletResponse.
[rat:report]    * 
[rat:report]    * @param results The SearchResults object containing hits and other info.
[rat:report]    * @param request The HttpServletRequest object.
[rat:report]    * @param response The HttpServletResponse object.
[rat:report]    * 
[rat:report]    * @throws IOException If an error occurs while writing out the response.
[rat:report]    */
[rat:report]   public void writeResponse(SearchResults results, HttpServletRequest request,
[rat:report]     HttpServletResponse response)
[rat:report]     throws IOException;
[rat:report] 
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/ResponseWriters.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.searcher.response;
[rat:report] 
[rat:report] import java.util.HashMap;
[rat:report] import java.util.Map;
[rat:report] 
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.nutch.plugin.Extension;
[rat:report] import org.apache.nutch.plugin.ExtensionPoint;
[rat:report] import org.apache.nutch.plugin.PluginRepository;
[rat:report] import org.apache.nutch.plugin.PluginRuntimeException;
[rat:report] import org.apache.nutch.util.ObjectCache;
[rat:report] 
[rat:report] /**
[rat:report]  * Utility class for getting all ResponseWriter implementations and for
[rat:report]  * returning the correct ResponseWriter for a given request type.
[rat:report]  */
[rat:report] public class ResponseWriters {
[rat:report] 
[rat:report]   private Map<String, ResponseWriter> responseWriters;
[rat:report] 
[rat:report]   /**
[rat:report]    * Constructor that configures the cache of ResponseWriter objects.
[rat:report]    * 
[rat:report]    * @param conf The Nutch configuration object.
[rat:report]    */
[rat:report]   public ResponseWriters(Configuration conf) {
[rat:report] 
[rat:report]     // get the cache and the cache key
[rat:report]     String cacheKey = ResponseWriter.class.getName();
[rat:report]     ObjectCache objectCache = ObjectCache.get(conf);
[rat:report]     this.responseWriters = (Map<String, ResponseWriter>)objectCache.getObject(cacheKey);
[rat:report] 
[rat:report]     // if already populated do nothing
[rat:report]     if (this.responseWriters == null) {
[rat:report] 
[rat:report]       try {
[rat:report] 
[rat:report]         // get the extension point and all ResponseWriter extensions
[rat:report]         ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint(
[rat:report]           ResponseWriter.X_POINT_ID);
[rat:report]         if (point == null) {
[rat:report]           throw new RuntimeException(ResponseWriter.X_POINT_ID + " not found.");
[rat:report]         }
[rat:report] 
[rat:report]         // populate content type on the ResponseWriter classes, each response
[rat:report]         // writer can handle more than one response type
[rat:report]         Extension[] extensions = point.getExtensions();
[rat:report]         Map<String, ResponseWriter> writers = new HashMap<String, ResponseWriter>();
[rat:report]         for (int i = 0; i < extensions.length; i++) {
[rat:report]           Extension extension = extensions[i];
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/SearchResults.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.searcher.response;
[rat:report] 
[rat:report] import org.apache.nutch.searcher.Hit;
[rat:report] import org.apache.nutch.searcher.HitDetails;
[rat:report] import org.apache.nutch.searcher.Summary;
[rat:report] 
[rat:report] public class SearchResults {
[rat:report] 
[rat:report]   private String[] fields;
[rat:report]   private String responseType;
[rat:report]   private String query;
[rat:report]   private String lang;
[rat:report]   private String sort;
[rat:report]   private boolean reverse;
[rat:report]   private boolean withSummary = true;
[rat:report]   private int start;
[rat:report]   private int rows;
[rat:report]   private int end;
[rat:report]   private long totalHits;
[rat:report]   private Hit[] hits;
[rat:report]   private HitDetails[] details;
[rat:report]   private Summary[] summaries;
[rat:report] 
[rat:report]   public SearchResults() {
[rat:report] 
[rat:report]   }
[rat:report] 
[rat:report]   public String[] getFields() {
[rat:report]     return fields;
[rat:report]   }
[rat:report] 
[rat:report]   public void setFields(String[] fields) {
[rat:report]     this.fields = fields;
[rat:report]   }
[rat:report] 
[rat:report]   public boolean isWithSummary() {
[rat:report]     return withSummary;
[rat:report]   }
[rat:report] 
[rat:report]   public void setWithSummary(boolean withSummary) {
[rat:report]     this.withSummary = withSummary;
[rat:report]   }
[rat:report] 
[rat:report]   public String getResponseType() {
[rat:report]     return responseType;
[rat:report]   }
[rat:report] 
[rat:report]   public void setResponseType(String responseType) {
[rat:report]     this.responseType = responseType;
[rat:report]   }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/searcher/response/SearchServlet.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.searcher.response;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import javax.servlet.ServletConfig;
[rat:report] import javax.servlet.ServletException;
[rat:report] import javax.servlet.http.HttpServlet;
[rat:report] import javax.servlet.http.HttpServletRequest;
[rat:report] import javax.servlet.http.HttpServletResponse;
[rat:report] 
[rat:report] import org.apache.commons.lang.StringUtils;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.nutch.searcher.Hit;
[rat:report] import org.apache.nutch.searcher.HitDetails;
[rat:report] import org.apache.nutch.searcher.Hits;
[rat:report] import org.apache.nutch.searcher.NutchBean;
[rat:report] import org.apache.nutch.searcher.Query;
[rat:report] import org.apache.nutch.searcher.Summary;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] 
[rat:report] /**
[rat:report]  * Servlet that allows returning search results in multiple different formats
[rat:report]  * through a ResponseWriter Nutch extension point.
[rat:report]  * 
[rat:report]  * @see org.apache.nutch.searcher.response.ResponseWriter
[rat:report]  */
[rat:report] public class SearchServlet
[rat:report]   extends HttpServlet {
[rat:report] 
[rat:report]   public static final Log LOG = LogFactory.getLog(SearchServlet.class);
[rat:report]   private NutchBean bean;
[rat:report]   private Configuration conf;
[rat:report]   private ResponseWriters writers;
[rat:report] 
[rat:report]   private String defaultRespType = "xml";
[rat:report]   private String defaultLang = null;
[rat:report]   private int defaultNumRows = 10;
[rat:report]   private String defaultDedupField = "site";
[rat:report]   private int defaultNumDupes = 1;
[rat:report] 
[rat:report]   public static final String RESPONSE_TYPE = "rt";
[rat:report]   public static final String QUERY = "query";
[rat:report]   public static final String LANG = "lang";
[rat:report]   public static final String START = "start";
[rat:report]   public static final String ROWS = "rows";
[rat:report]   public static final String SORT = "sort";
[rat:report]   public static final String REVERSE = "reverse";
[rat:report]   public static final String DEDUPE = "ddf";
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/segment/ContentAsTextInputFormat.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.segment;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.mapred.FileSplit;
[rat:report] import org.apache.hadoop.mapred.InputSplit;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.RecordReader;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.mapred.SequenceFileRecordReader;
[rat:report] import org.apache.nutch.protocol.Content;
[rat:report] 
[rat:report] /**
[rat:report]  * An input format that takes Nutch Content objects and converts them to text
[rat:report]  * while converting newline endings to spaces.  This format is useful for working
[rat:report]  * with Nutch content objects in Hadoop Streaming with other languages.
[rat:report]  */
[rat:report] public class ContentAsTextInputFormat
[rat:report]   extends SequenceFileInputFormat<Text, Text> {
[rat:report] 
[rat:report]   private static class ContentAsTextRecordReader
[rat:report]     implements RecordReader<Text, Text> {
[rat:report] 
[rat:report]     private final SequenceFileRecordReader<Text, Content> sequenceFileRecordReader;
[rat:report] 
[rat:report]     private Text innerKey;
[rat:report]     private Content innerValue;
[rat:report] 
[rat:report]     public ContentAsTextRecordReader(Configuration conf, FileSplit split)
[rat:report]       throws IOException {
[rat:report]       sequenceFileRecordReader = new SequenceFileRecordReader<Text, Content>(
[rat:report]         conf, split);
[rat:report]       innerKey = (Text)sequenceFileRecordReader.createKey();
[rat:report]       innerValue = (Content)sequenceFileRecordReader.createValue();
[rat:report]     }
[rat:report] 
[rat:report]     public Text createKey() {
[rat:report]       return new Text();
[rat:report]     }
[rat:report] 
[rat:report]     public Text createValue() {
[rat:report]       return new Text();
[rat:report]     }
[rat:report] 
[rat:report]     public synchronized boolean next(Text key, Text value)
[rat:report]       throws IOException {
[rat:report]       
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/ResolveUrls.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.tools;
[rat:report] 
[rat:report] import java.io.BufferedReader;
[rat:report] import java.io.File;
[rat:report] import java.io.FileReader;
[rat:report] import java.net.InetAddress;
[rat:report] import java.util.concurrent.ExecutorService;
[rat:report] import java.util.concurrent.Executors;
[rat:report] import java.util.concurrent.TimeUnit;
[rat:report] import java.util.concurrent.atomic.AtomicInteger;
[rat:report] import java.util.concurrent.atomic.AtomicLong;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.nutch.util.URLUtil;
[rat:report] 
[rat:report] /**
[rat:report]  * A simple tool that will spin up multiple threads to resolve urls to ip
[rat:report]  * addresses. This can be used to verify that pages that are failing due to
[rat:report]  * UnknownHostException during fetching are actually bad and are not failing due
[rat:report]  * to a dns problem in fetching.
[rat:report]  */
[rat:report] public class ResolveUrls {
[rat:report] 
[rat:report]   public static final Log LOG = LogFactory.getLog(ResolveUrls.class);
[rat:report] 
[rat:report]   private String urlsFile = null;
[rat:report]   private int numThreads = 100;
[rat:report]   private ExecutorService pool = null;
[rat:report]   private static AtomicInteger numTotal = new AtomicInteger(0);
[rat:report]   private static AtomicInteger numErrored = new AtomicInteger(0);
[rat:report]   private static AtomicInteger numResolved = new AtomicInteger(0);
[rat:report]   private static AtomicLong totalTime = new AtomicLong(0L);
[rat:report] 
[rat:report]   /**
[rat:report]    * A Thread which gets the ip address of a single host by name.
[rat:report]    */
[rat:report]   private static class ResolverThread
[rat:report]     extends Thread {
[rat:report] 
[rat:report]     private String url = null;
[rat:report] 
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/SearchLoadTester.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.tools;
[rat:report] 
[rat:report] import java.io.BufferedReader;
[rat:report] import java.io.File;
[rat:report] import java.io.FileReader;
[rat:report] import java.io.IOException;
[rat:report] import java.util.concurrent.ExecutorService;
[rat:report] import java.util.concurrent.Executors;
[rat:report] import java.util.concurrent.TimeUnit;
[rat:report] import java.util.concurrent.atomic.AtomicInteger;
[rat:report] import java.util.concurrent.atomic.AtomicLong;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.nutch.searcher.Hits;
[rat:report] import org.apache.nutch.searcher.NutchBean;
[rat:report] import org.apache.nutch.searcher.Query;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] 
[rat:report] /**
[rat:report]  * <p>A simple tool to perform load testing on configured search servers.  A 
[rat:report]  * queries file can be specified with a list of different queries to run against
[rat:report]  * the search servers.  The number of threads used to perform concurrent
[rat:report]  * searches is also configurable.</p>
[rat:report]  * 
[rat:report]  * <p>This tool will output approximate times for running all queries in the 
[rat:report]  * queries file.  If configured it will also print out individual queries times
[rat:report]  * to the log.</p>
[rat:report]  */
[rat:report] public class SearchLoadTester {
[rat:report] 
[rat:report]   public static final Log LOG = LogFactory.getLog(SearchLoadTester.class);
[rat:report] 
[rat:report]   private String queriesFile = null;
[rat:report]   private int numThreads = 100;
[rat:report]   private boolean showTimes = false;
[rat:report]   private ExecutorService pool = null;
[rat:report]   private static AtomicInteger numTotal = new AtomicInteger(0);
[rat:report]   private static AtomicInteger numErrored = new AtomicInteger(0);
[rat:report]   private static AtomicInteger numResolved = new AtomicInteger(0);
[rat:report]   private static AtomicLong totalTime = new AtomicLong(0L);
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/tools/compat/ReprUrlFixer.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.tools.compat;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] import java.net.MalformedURLException;
[rat:report] import java.net.URL;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.Iterator;
[rat:report] import java.util.List;
[rat:report] import java.util.Random;
[rat:report] 
[rat:report] import org.apache.commons.cli.CommandLine;
[rat:report] import org.apache.commons.cli.CommandLineParser;
[rat:report] import org.apache.commons.cli.GnuParser;
[rat:report] import org.apache.commons.cli.HelpFormatter;
[rat:report] import org.apache.commons.cli.Option;
[rat:report] import org.apache.commons.cli.OptionBuilder;
[rat:report] import org.apache.commons.cli.Options;
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.conf.Configured;
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.MapWritable;
[rat:report] import org.apache.hadoop.io.Text;
[rat:report] import org.apache.hadoop.io.WritableUtils;
[rat:report] import org.apache.hadoop.mapred.FileInputFormat;
[rat:report] import org.apache.hadoop.mapred.FileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.JobClient;
[rat:report] import org.apache.hadoop.mapred.JobConf;
[rat:report] import org.apache.hadoop.mapred.MapFileOutputFormat;
[rat:report] import org.apache.hadoop.mapred.OutputCollector;
[rat:report] import org.apache.hadoop.mapred.Reducer;
[rat:report] import org.apache.hadoop.mapred.Reporter;
[rat:report] import org.apache.hadoop.mapred.SequenceFileInputFormat;
[rat:report] import org.apache.hadoop.util.StringUtils;
[rat:report] import org.apache.hadoop.util.Tool;
[rat:report] import org.apache.hadoop.util.ToolRunner;
[rat:report] import org.apache.nutch.crawl.CrawlDatum;
[rat:report] import org.apache.nutch.crawl.CrawlDb;
[rat:report] import org.apache.nutch.metadata.Nutch;
[rat:report] import org.apache.nutch.scoring.webgraph.Node;
[rat:report] import org.apache.nutch.util.FSUtils;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] import org.apache.nutch.util.NutchJob;
[rat:report] import org.apache.nutch.util.URLUtil;
[rat:report] 
[rat:report] /**
[rat:report]  * <p>
[rat:report]  * Significant changes were made to representative url logic used for redirects.
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/EncodingDetector.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.util;
[rat:report] 
[rat:report] import java.io.BufferedInputStream;
[rat:report] import java.io.ByteArrayOutputStream;
[rat:report] import java.io.FileInputStream;
[rat:report] import java.io.IOException;
[rat:report] import java.nio.charset.Charset;
[rat:report] import java.util.ArrayList;
[rat:report] import java.util.HashMap;
[rat:report] import java.util.HashSet;
[rat:report] import java.util.List;
[rat:report] 
[rat:report] import org.apache.commons.logging.Log;
[rat:report] import org.apache.commons.logging.LogFactory;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.nutch.metadata.Metadata;
[rat:report] import org.apache.nutch.net.protocols.Response;
[rat:report] import org.apache.nutch.protocol.Content;
[rat:report] import org.apache.nutch.util.LogUtil;
[rat:report] import org.apache.nutch.util.NutchConfiguration;
[rat:report] 
[rat:report] import com.ibm.icu.text.CharsetDetector;
[rat:report] import com.ibm.icu.text.CharsetMatch;
[rat:report] 
[rat:report] /**
[rat:report]  * A simple class for detecting character encodings.
[rat:report]  *
[rat:report]  * <p>
[rat:report]  * Broadly this encompasses two functions, which are distinctly separate:
[rat:report]  *
[rat:report]  * <ol>
[rat:report]  *  <li>Auto detecting a set of "clues" from input text.</li>
[rat:report]  *  <li>Taking a set of clues and making a "best guess" as to the
[rat:report]  *      "real" encoding.</li>
[rat:report]  * </ol>
[rat:report]  * </p>
[rat:report]  *
[rat:report]  * <p>
[rat:report]  * A caller will often have some extra information about what the
[rat:report]  * encoding might be (e.g. from the HTTP header or HTML meta-tags, often
[rat:report]  * wrong but still potentially useful clues). The types of clues may differ
[rat:report]  * from caller to caller. Thus a typical calling sequence is:
[rat:report]  * <ul>
[rat:report]  *    <li>Run step (1) to generate a set of auto-detected clues;</li>
[rat:report]  *    <li>Combine these clues with the caller-dependent "extra clues"
[rat:report]  *        available;</li>
[rat:report]  *    <li>Run step (2) to guess what the most probable answer is.</li>
[rat:report]  * </p>
[rat:report]  */
[rat:report] public class EncodingDetector {
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/FSUtils.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.util;
[rat:report] 
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import org.apache.hadoop.fs.FileSystem;
[rat:report] import org.apache.hadoop.fs.Path;
[rat:report] import org.apache.hadoop.io.MapFile;
[rat:report] import org.apache.hadoop.io.SequenceFile;
[rat:report] 
[rat:report] /**
[rat:report]  * Utility methods for common filesystem operations.
[rat:report]  */
[rat:report] public class FSUtils {
[rat:report] 
[rat:report]   /**
[rat:report]    * Replaces the current path with the new path and if set removes the old
[rat:report]    * path. If removeOld is set to false then the old path will be set to the
[rat:report]    * name current.old.
[rat:report]    * 
[rat:report]    * @param fs The FileSystem.
[rat:report]    * @param current The end path, the one being replaced.
[rat:report]    * @param replacement The path to replace with.
[rat:report]    * @param removeOld True if we are removing the current path.
[rat:report]    * 
[rat:report]    * @throws IOException If an error occurs during replacement.
[rat:report]    */
[rat:report]   public static void replace(FileSystem fs, Path current, Path replacement,
[rat:report]     boolean removeOld)
[rat:report]     throws IOException {
[rat:report] 
[rat:report]     // rename any current path to old
[rat:report]     Path old = new Path(current + ".old");
[rat:report]     if (fs.exists(current)) {
[rat:report]       fs.rename(current, old);
[rat:report]     }
[rat:report] 
[rat:report]     // rename the new path to current and remove the old path if needed
[rat:report]     fs.rename(replacement, current);
[rat:report]     if (fs.exists(old) && removeOld) {
[rat:report]       fs.delete(old, true);
[rat:report]     }
[rat:report]   }
[rat:report] 
[rat:report]   /**
[rat:report]    * Closes a group of SequenceFile readers.
[rat:report]    * 
[rat:report]    * @param readers The SequenceFile readers to close.
[rat:report]    * @throws IOException If an error occurs while closing a reader.
[rat:report]    */
[rat:report]   public static void closeReaders(SequenceFile.Reader[] readers)
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/GenericWritableConfigurable.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.util;
[rat:report] 
[rat:report] import java.io.DataInput;
[rat:report] import java.io.IOException;
[rat:report] 
[rat:report] import org.apache.hadoop.conf.Configurable;
[rat:report] import org.apache.hadoop.conf.Configuration;
[rat:report] import org.apache.hadoop.io.GenericWritable;
[rat:report] import org.apache.hadoop.io.Writable;
[rat:report] 
[rat:report] /** A generic Writable wrapper that can inject Configuration to {@link Configurable}s */ 
[rat:report] public abstract class GenericWritableConfigurable extends GenericWritable 
[rat:report]                                                   implements Configurable {
[rat:report] 
[rat:report]   private Configuration conf;
[rat:report]   
[rat:report]   public Configuration getConf() {
[rat:report]     return conf;
[rat:report]   }
[rat:report] 
[rat:report]   public void setConf(Configuration conf) {
[rat:report]     this.conf = conf;
[rat:report]   }
[rat:report]   
[rat:report]   @Override
[rat:report]   public void readFields(DataInput in) throws IOException {
[rat:report]     byte type = in.readByte();
[rat:report]     Class clazz = getTypes()[type];
[rat:report]     try {
[rat:report]       set((Writable) clazz.newInstance());
[rat:report]     } catch (Exception e) {
[rat:report]       e.printStackTrace();
[rat:report]       throw new IOException("Cannot initialize the class: " + clazz);
[rat:report]     }
[rat:report]     Writable w = get();
[rat:report]     if (w instanceof Configurable)
[rat:report]       ((Configurable)w).setConf(conf);
[rat:report]     w.readFields(in);
[rat:report]   }
[rat:report]   
[rat:report] }
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/NodeWalker.java
[rat:report]  =======================================================================
[rat:report] package org.apache.nutch.util;
[rat:report] 
[rat:report] import java.util.Stack;
[rat:report] 
[rat:report] import org.w3c.dom.Node;
[rat:report] import org.w3c.dom.NodeList;
[rat:report] 
[rat:report] /**
[rat:report]  * <p>A utility class that allows the walking of any DOM tree using a stack 
[rat:report]  * instead of recursion.  As the node tree is walked the next node is popped
[rat:report]  * off of the stack and all of its children are automatically added to the 
[rat:report]  * stack to be called in tree order.</p>
[rat:report]  * 
[rat:report]  * <p>Currently this class is not thread safe.  It is assumed that only one
[rat:report]  * thread will be accessing the <code>NodeWalker</code> at any given time.</p>
[rat:report]  */
[rat:report] public class NodeWalker {
[rat:report] 
[rat:report]   // the root node the the stack holding the nodes
[rat:report]   private Node currentNode;
[rat:report]   private NodeList currentChildren;
[rat:report]   private Stack<Node> nodes;
[rat:report]   
[rat:report]   /**
[rat:report]    * Starts the <code>Node</code> tree from the root node.
[rat:report]    * 
[rat:report]    * @param rootNode
[rat:report]    */
[rat:report]   public NodeWalker(Node rootNode) {
[rat:report] 
[rat:report]     nodes = new Stack<Node>();
[rat:report]     nodes.add(rootNode);
[rat:report]   }
[rat:report]   
[rat:report]   /**
[rat:report]    * <p>Returns the next <code>Node</code> on the stack and pushes all of its
[rat:report]    * children onto the stack, allowing us to walk the node tree without the
[rat:report]    * use of recursion.  If there are no more nodes on the stack then null is
[rat:report]    * returned.</p>
[rat:report]    * 
[rat:report]    * @return Node The next <code>Node</code> on the stack or null if there
[rat:report]    * isn't a next node.
[rat:report]    */
[rat:report]   public Node nextNode() {
[rat:report]     
[rat:report]     // if no next node return null
[rat:report]     if (!hasNext()) {
[rat:report]       return null;
[rat:report]     }
[rat:report]     
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/org/apache/nutch/util/domain/package.html
[rat:report]  =======================================================================
[rat:report] <html>
[rat:report] <body>
[rat:report] <h2> org.apache.nutch.util.domain</h2>
[rat:report] 
[rat:report] <p>This package contains classes for domain analysis.</p>
[rat:report] 
[rat:report] for information please refer to following urls : 
[rat:report] <ul>
[rat:report] <li><a href="http://en.wikipedia.org/wiki/DNS">http://en.wikipedia.org/wiki/DNS</a></li>
[rat:report] <li><a href="http://en.wikipedia.org/wiki/Top-level_domain">http://en.wikipedia.org/wiki/Top-level_domain</a></li>
[rat:report] <li><a href="http://wiki.mozilla.org/TLD_List">http://wiki.mozilla.org/TLD_List</a></li>
[rat:report] <li><a href="http://publicsuffix.org/">http://publicsuffix.org/</a></li>
[rat:report] </ul>
[rat:report] 
[rat:report] </body>
[rat:report] </html>
[rat:report] 
[rat:report]  =======================================================================
[rat:report]  ==/home/sam/workspace/nutch-trunk-eu/src/java/overview.html
[rat:report]  =======================================================================
[rat:report] <html>
[rat:report] <head>
[rat:report]    <title>Nutch</title>
[rat:report] </head>
[rat:report] <body>
[rat:report] Nutch is the open-source search engine.<p>
[rat:report] </body>
[rat:report] </html>
[rat:report] 

BUILD SUCCESSFUL
Total time: 1 second

> Fix missing/wrong headers in source files
> -----------------------------------------
>
>                 Key: NUTCH-688
>                 URL: https://issues.apache.org/jira/browse/NUTCH-688
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Sami Siren
>            Assignee: Sami Siren
>            Priority: Blocker
>             Fix For: 1.0.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message