nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sagar Vibhute" <vibhutesa...@gmail.com>
Subject Re: Hits estimation?
Date Tue, 02 Oct 2007 10:52:58 GMT
Hi,

I believe you are looking for something entirely different. I was assuming
you only want to know what is the count of the terms in a performed crawl.
Anyways, here is the command I was talking of:

nutch org.apache.nutch.indexer.HighFreqTerms ./nutch_crawl/index

where 'nutch_crawl' was the directory where the crawl results were stored
when I performed the crawl. I am including a sample result as well (below).

----------------------------------------------------------------------------------------
nutch org.apache.nutch.indexer.HighFreqTerms ./nutch_crawl/index

content:into 126
content:some 128
content:our 128
content:nutch 128
content:changes 128
content:for-the 128
content:list 129
url:nutch 129
content:has 130
content:last 130
content:information 130
content:help 131
content:mailing 132
content:under 132
content:content 133
content:html 133
content:license 135
content:java 136
content:source 136
content:one 137
content:open 137
content:faq 138
content:how 138
content:which 139
content:home 140
content:4 140
content:on-the 141
content:http 143
content:projects 145
content:version 146
content:project 146
content:using 147
content:3 147
content:foundation 155
content:the-apache 155
content:is-a 156
content:also 156
content:web 159
content:other 161
content:copyright 161
content:have 161
content:text 161
content:we 162
content:new 163
content:like 164
content:lists 164
content:see 168
content:will 169
content:if 171
content:not 172
content:in-the 172
content:page 173
content:wiki 177
content:org 177
content:to-the 178
content:your 178
content:1 179
content:get 187
content:2 189
content:an 192
content:can 194
content:software 195
content:about 195
content:all 199
content:search 200
content:s 200
content:as 202
content:or 203
content:2007 206
content:site 206
content:it 209
content:use 213
content:at 214
content:be 217
content:apache 221
content:that 224
content:more 225
content:from 227
content:of-the 227
content:you 229
content:are 234
content:0 238
content:with 240
content:on 245
content:by 248
host:apache 251
url:apache 252
content:this 256
content:in 275
content:is 277
content:for 287
content:of 291
content:and 297
content:a 300
host:org 300
content:to 300
url:org 300
content:the 315
url:http 358
----------------------------------------------------------------------------------------

- Sagar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message