lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Gainty <mgai...@hotmail.com>
Subject RE: Feedback of my Phd work in Lucene and Solr project
Date Thu, 10 Dec 2015 13:21:34 GMT



From: igor.wiese@gmail.com
Date: Wed, 9 Dec 2015 23:48:10 +0000
Subject: Feedback of my Phd work in Lucene and Solr project
To: dev@lucene.apache.org

Hi, Lucene and Solr Community. 
My name is Igor Wiese, phd Student from Brazil. In my research I am investigating two important
questions: What makes two files change together? Can we predict when they are going to co-change
again? 
I've tried to investigate this question on the Lucene and Solr project. I've collected data
from issue reports, discussions and commits and using some machine learning techniques to
build a prediction model.
I collected a total of 1382 commits in which a pair of files changed together and could correctly
predict 66% commits in the Lucene Project. For the Solr Project I collected a total of 111
commits in which a pair of files changed together and could correctly predict 47% commits.
These were the most useful information for predicting co-changes of files: - number of lines
of code added,- number of lines of code removed,- sum of number of lines of code added, modified
and removed,- number of words used to describe and discuss the issues, and- median value of
closeness, a social network measure obtained from issue comments.
To illustrate, consider the following example in Lucene Project from our analysis. For release
4.7, the files "lucene/index/IndexWriter.java" and "lucene/index/StandardDirectoryReader.java"
changed together in 4 commits. In another 11 commits, only the first file changed, but not
the second. Collecting contextual information for each commit made to first file in previous
release, we were able to predict 3 commits in which both files changed together in release
4.7, and we issued 0 false positive, and one wrong prediction. For this pair of files, the
most important contextual information was the number of lines of code added in each commit,
the number of words used to describe and discuss the issues, the number of comments in each
issue and the social network metric (closeness) obtained from issue comments.MG>if the
pairing was 100% accurate then yes a predictor for both files changing indicates a design
issue is lurking i.e
MG>IndexWriter and StandardDirectoryWriter "share functionality" which would suggest breaking
shared methods to interface
MG>refactoring IndexWriter and StandardDirectoryReader to each implement that shared Interface
MG>if attributes are to be shared then perhaps an abstract class should be created to contain
those shared attributes and implement
MG>the shared methods
MG>refactoring IndexWriter and StandardDirectoryReader to extend the abstract class should
force implementor to override/reuse
MG>shared attributes in the Abstract Base Class?
- Do these results surprise you? Can you think in any explanation for the results?- Do you
think that our rate of prediction is good enough to be used for building tool support for
the software community?MG>if the plugin can predict with 100% accuracy?

- Do you have any suggestion on what can be done to improve the change recommendation?MG>create
the tool as a maven plugin so we can bind this functionality to one of the pre compile phases
e.g. process-sources?
You can visit a webpage to inspect the results in details: Lucene Project: http://flosscoach.com/index.php/17-cochanges/73-luceneSolr
Project: http://flosscoach.com/index.php/17-cochanges/74-solr

All the best, Igor WiesePhd Candidate
MG>Obrigado do EEUU 		 	   		  
Mime
View raw message