How to run the Pivoted-Document Normalization similarity function using Lucene ?

Step 1: Download the lucene version 2.4 from   here.

Step 2:  The constructor of  TermScorer class has to be changed. The path for this file is lucene/search/TermScorer.java . The first change is to add an IndexReader as argument in the constructor in order to parse the documents in the collection.

TermScorer(Weight weight, TermDocs td, IndexReader reader,Similarity similarity,

byte[]

norms)

throws

IOException
{

   super

(similarity);

   this

.weight = weight;

   this

.termDocs = td;

   this

.norms = norms;

   this

.weightValue = weight.getValue();
   this.reader=reader;
   for (int i = 0; i < SCORE_CACHE_SIZE; i++){
      scoreCache[i] = getSimilarity().tf(i) * weightValue;
   }
}


Step 3: The TermScorer class is called from the TermQuery class. The path for this file is lucene/search/TermQuery.java . Find the method public Scorer scorer(IndexReader reader) and call the TermScorer object according to the change in Step 1.

public

Scorer scorer(IndexReader reader)

throws

IOException {
   TermDocs termDocs = reader.termDocs(term);
   if (termDocs == null)

   return null;


   return new

TermScorer(

this

, termDocs,reader,similarity, reader.norms(term.field()));
}


Step 4:  Add the following methods in the TermScorer class.

  • public float getAvgFreq() . This method returns the average term frequency for each matched document.

    public float

    getAvgFreq()

    throws

    IOException{

       String

    field = "Search_Field";
       TermFreqVector tfv=

    this

    .reader.getTermFreqVector(

    this

    .doc(), field);
       if (tfv!=

    null

    ) {

          this

    .setUd(tfv.size());

          int[]

    tfs=tfv.getTermFrequencies();

          int

    sum=0;
          for(int i= 0;i < tfv.size(); i++ ){
             sum=sum+tfs[i];
          }

       float

    avgFreq=(

    float

    )sum/tfv.size();

       return

    avgFreq;
       }
       else{

    return

    0f;}
    }

  • public float getLengthNorm(float slope, float pivot). This method returns the lengthNorm value for each matched document.

    public float

    getLengthNorm(

    float

    slope,

    float

    pivot)

    throws

    IOException{

       float

    den=(

    float

    )((1-slope)*pivot + (slope*

    this

    .getUd()));
       den=(

    float

    ) Math.sqrt(den);

       float

    lengthNorm=1/den;

       return

    lengthNorm;
    }

  • public void setUd(int unique) . This method sets the number of unique terms for each matched document.

    public void

    setUd(int unique){

       this

    .Ud=unique;
    }

  • public int getUd() . This method returns the number of unique terms for each matched document.

    public int

    getUd(){
       return this.Ud;
    }

    Step 5: This method calculates and returns the pivot value. The first parameter is an IndexReader that will parse the documents in the collection and the second parameter is a string that represents the Field for searching. We will define and call this method in the last Step 7. We have to call this method in order to find and pass by hand the pivot value as the second parameter in the public float getLengthNorm(float slope, float pivot) method.

    public static float

    getPivot(IndexReader reader,

    String

    field)

    throws

    IOException{

       int

    sum=0;
       for(

    int

    i= 0;i < reader.numDocs(); i++){
          TermFreqVector tfv= reader.getTermFreqVector(i, field);
          if(tfv!=

    null

    ){
             sum=sum+tfv.size();
          }
       }

       float

    pivot=(

    float

    )sum/reader.numDocs();

       //System.out.println

    ("pivot = " + pivot);

       return

    pivot;
    }
    //end of method


    Step 6: In this step we have to change the method public float score() in TermScorer class.
    In this example slope=0.35 and pivot=45 (We will calculate the pivot value by calling the method public static float getPivot(IndexReader reader, String field) in Step 7.).

    public float

    score()

    throws

    IOException {
         /* We use comments in order to use the changed code instead of default code.
         int f = freqs[pointer];
         float raw = // compute tf(f)*weight
         f < SCORE_CACHE_SIZE // check cache
         ? scoreCache[f] // cache hit
         : getSimilarity().tf(f)*weightValue; // cache miss
         return raw * Similarity.decodeNorm(norms[doc]); // normalize for field
         */
         //This is the changed code

         int

    f = freqs[pointer];

         float

    num=(

    float

    ) (1+Math.log10(f));

         float

    den=(

    float

    )(1+Math.log10(

    this

    .getAvgFreq()));

         float

    pivot_tf=num/den;

         float

    raw =
            f < SCORE_CACHE_SIZE ? pivot_tf*weightValue : pivot_tf*weightValue;

         return

    raw *

    this

    .getLengthNorm(0.35f, 45.0f);
       }


    Step 7: You can use the next example creating a new java file in your lucene code.For instance you can create this file named "pivotSearch.java" to lucene/demo/pivotSearch.java

    First of all we pass the pivot value to the public float getLengthNorm(float slope, float pivot) method in Step 6 before searching. We can use this example which prints the pivot value. After that, if pivot=45.0 and slope=0.35 we have to call the public float getLengthNorm(float slope, float pivot) method as we can see in Step 6 ( see the last line of code).

    public class

    pivotSearch{

    public static float

    getPivot(IndexReader reader,

    String

    field)

    throws

    IOException{

       int

    sum=0;
       for(

    int

    i= 0;i < reader.numDocs(); i++){
          TermFreqVector tfv= reader.getTermFreqVector(i, field);
          if(tfv!=

    null

    ){
             sum=sum+tfv.size();
          }
       }

       float

    pivot=(

    float

    )sum/reader.numDocs();

       System.out.println

    ("pivot = " + pivot);

       return

    pivot;
    }
    //end of method


    public static void

    main (

    String

    args[])

    throws

    IOException, ParseException{

       String

    index="index";
       IndexReader reader=

    null

    ;

       try

    {
          reader=IndexReader.open(index);
       }

    catch

    (CorruptIndexException e1) {e1.printStackTrace();}
         

    catch

    (IOException e1) {e1.printStackTrace();}

       String

    field = "Search_Field";
       //calculate and print the pivot value

       getPivot(reader,field);
       reader.close();
    }
    //end of main


    }
    //end of class


    Writing the following example and making all the changes in the previous steps we can search for the query = "Cystic hydroma" and print the top-10 results according to the Pivot score function.

    public class

    pivotSearch{

    public static float

    getPivot(IndexReader reader,

    String

    field)

    throws

    IOException{

       int

    sum=0;
       for(

    int

    i= 0;i < reader.numDocs(); i++){
          TermFreqVector tfv= reader.getTermFreqVector(i, field);
          if(tfv!=

    null

    ){
             sum=sum+tfv.size();
          }
       }

       float

    pivot=(

    float

    )sum/reader.numDocs();

       System.out.println

    ("pivot = " + pivot);

       return

    pivot;
    }
    //end of method


    public static void

    main (

    String

    args[])

    throws

    IOException, ParseException{

       String

    index="index";
       IndexReader reader=

    null

    ;

       try

    {
          reader=IndexReader.open(index);
       }

    catch

    (CorruptIndexException e1) {e1.printStackTrace();}
        

    catch

    (IOException e1) {e1.printStackTrace();}

       String

    field = "Search_Field";
       //calculate and print the pivot value

       getPivot(reader,field);
       Searcher searcher = new IndexSearcher(reader);
       Analyzer analyzer = new StandardAnalyzer();
       QueryParser parser = new QueryParser(field, analyzer);
       Query query = parser.parse("Cystic hygroma");
       TopDocs tp=searcher.search(query, 10);
       ScoreDoc[] docs = tp.scoreDocs;
       for (

    int

    i= 0;i<10; i++){

       System.out.println

    ("the document with id= " + docs[i].doc + " has score ="+docs[i].score);
       }
       reader.close();
    }
    //end of main


    }
    //end of class


    RETURN TO THE MAIN PAGE