lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <>
Subject need to find locations of query hits in doc: works fine for regular text but not for phone numbers
Date Thu, 14 Jun 2012 02:52:19 GMT
Hello All,

I am using 3.4. I need to find locations of query hits in a document. What I've implemented
works fine for textual queries but does not work for phone numbers. 

Here's how I index my docs:

String oc = "Joe dialed 800-555-1212 but got a busy signal";
doc.add(new Field("contents", 

Now, here how I find locations. I search for a query. If I get a hit, I split my query (in
case it's multi-word) into words and search for each of them using TermFreqVector like this:

//String qstr = "my multiword query";	// for queries like this it works fine...
String qstr = "800-555-1212";	// ...but not for ones like this
Query query = parser.parse(qstr);
TopDocs results =, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;

String[] subTerms = qstr.split("\\s+");	// phone string stays intact here

for (int i = 0; i < hits.length; i++) {
	int docId = hits[i].doc;
	Document doc = searcher.doc(docId);
	TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");  
	TermPositionVector tpvector = (TermPositionVector)tfvector;   
	for (String subTerm : subTerms)
		String subq = subTerm.toLowerCase();
		int termidx = tfvector.indexOf(subq);  // get termidx = -1 here
		TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);  
            for (int j=0;j<tvoffsetinfo.length;j++) {  
            	int offsetStart = tvoffsetinfo[j].getStartOffset();  
            	int offsetEnd = tvoffsetinfo[j].getEndOffset();	
		// ...

For a query like "800-555-1212", tfvector.indexOf returns -1. What am I doing wrong? 


Ilya Zavorin

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message