lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nan Yu <>
Subject number of files indexed (re-formatted)
Date Wed, 18 Dec 2019 15:40:14 GMT
Sorry that I just found out that the mailing list takes plain text and my previous post looks
really messy. So I reformatted it.

    I did a simple indexing of a directory that contains a lot of pdf, text, doc, zip etc.
There are no structures for the content of the files and I would like to index them and later
on search "key words" within the files.

    After creating the core, I indexed the files in the directory using the following command: 

bin/post -p 8983 -m 10g -c myCore /DATA_FOLDER > solr_indexing.log

    The log file shows something like below (the first and last few lines in the log file):

java -classpath /solr/solr-8.3.0/dist/solr-core-8.3.0.jar -Dauto=yes -Dport=8983 -Dm=15g -Dc=myCore
-Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool /DATA_FOLDER
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/myCore/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file Report.pdf (application/pdf) to [base]/extract
47256 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/myCore/update...
Time spent: 1:03:59.587

But when using browser to try to look at the result, the "overview" (http://localhost:8983/solr/#/myCore/core-overview)
Num Docs: 47648

Most of the files indexed has an metadata id has the value of the full path of the file indexed,
such as /DATA_FOLDER/20180321/Report.pdf 

But there are about 400 of them, the id looks like: 232d7bd6-c586-4726-8d2b-bc9b1febcff4.

So my questions are:
(1)why the two numbers are different (in log file vs. in the overview).
(2)for those ids that are not a full path of a file, how do I know where they comes from (the
original file)?

Thanks for your help!

PS: a few examples of query result for those strange ids:

        "bolt-small-online":["Test strip-north"],



View raw message