How to run the BM25 similarity function using Lucene

Step 1: Download lucene's version 2.4 from here.

Step 2: Download the jar file from here and add it in your classpath . You can also download the javadoc for this package from here.

Step 3: Import the package src/org/ninit/models/bm25 into Lucene's code.

Step 4: Find the BM25TermScorer.java class from the package src/org/ninit/models/bm25 and overwrite the method public float score().

public float

score()

throws

IOException{

//IDF refers to the inverse document frequency (idf(qi,d)) and

//TF25 refers to the second factor in the definition of the BM25 scoring function

float

TF25;

float

num25;

float

den25;

float

length;

float

norm = Similarity.decodeNorm(this.norm[this.doc()]);
length = 1 / (norm * norm);
den25= this.b*(length / this.av_length);
den25= 1-this.b+den25;
den25= this.k1*den25;
den25= this.termDocs.freq()+den25;
num25= this.k1+1;
num25= num25*this.termDocs.freq();
TF25= num25/den25;

return

TF25*

this

.idf;

}
//end of score


Step 5: The following method finds the average document length (avgdl). The first parameter is an IndexReader that will parse the documents in the collection and the second parameter is a string that represents the Field for searching. In the next Step we will call this method to set the average document length automatically.

public static float

getAvgLength(IndexReader reader,

String

field)

throws

IOException{

int

sum=0;
for (

int

i = 0; i < reader.numDocs(); i++){
   TermFreqVector tfv= reader.getTermFreqVector(i, field);
   if(tfv!=

null

) {
      int[] tfs=tfv.getTermFrequencies();
      for(

int

j= 0;j < tfv.size(); j++){
         sum=sum+tfs[j];
      }
   }
}

float

avg=(

float

)sum/reader.numDocs();

// System.out.println

("average length = " + avg);< /FONT >

return 

avg;
}
//end of method


Step 6: Now you are ready to search using the BM25 score function.

First we set the parameters b,  k1 and the average document length value calculated in the Step 5.
For example if  b= 0.75,  k1= 2 :

BM25Parameters.setAverageLength(field,getAvgLength(reader,field));
  
// the variable, field, is a string that repsesents the name of the Searched Field.
BM25Parameters.setB(0.75f);
BM25Parameters.setK1(2f);


You can use the next example creating a new java file in your lucene code. For instance you can create the file named "bm25Search.java" to lucene/demo/bm25Search.java
Writing the following example you can search for the query = "Cystic hydroma" and print the top-10 results according to the BM25 score function.

public class

bm25Search{


public static float

getAvgLength(IndexReader reader,

String

field)

throws

IOException{

   int

sum=0;
   for (

int

i = 0; i < reader.numDocs(); i++){
      TermFreqVector tfv= reader.getTermFreqVector(i, field);
      if(tfv!=

null

) {
         int[] tfs=tfv.getTermFrequencies();
         for(

int

j= 0;j < tfv.size(); j++){
            sum=sum+tfs[j];
         }
      }
   }

   float

avg=(

float

)sum/reader.numDocs();

   //System.out.println

("average length = " + avg);

   return

avg;
}
//end of method


public static void

main (

String

args[])

throws

IOException, ParseException{

   String

index="index";
   IndexReader reader=

null

;

   try

{
      reader=IndexReader.open(index);
   }
  

catch

(CorruptIndexException e1) {e1.printStackTrace();}
  

catch

(IOException e1) {e1.printStackTrace();}

   String

field="Search_Field";
   Searcher searcher = new IndexSearcher("index");
   Analyzer analyzer = new StandardAnalyzer();
   //the second parameter calls the getAvgLength method and automatically calculates the   Average Length value

   BM25Parameters.setAverageLength(field,getAvgLength(reader,field));
   BM25Parameters.setB(0.75f);
   BM25Parameters.setK1(2f);
   BM25BooleanQuery query = new BM25BooleanQuery( "Cystic hydroma" ,field,analyzer);

   //System.out.println

("Searching for: " + query.toString(field));
   TopDocs top=searcher.search(query, 10);
   ScoreDoc[] docs = top.scoreDocs;
   for (

int

i= 0;i<10; i++){

      //System.out.println

("the document with id= " + docs[i].doc + " has score ="+docs[i].score);
   }
}
//end of main
}
//end of class





RETURN TO THE MAIN PAGE