The BM25 similarity function


The BM25 Scoring Function is defined by the function:


where
  1. f(qi,d) correlates to the term's frequency, defined as the number of times query term qi appears in the document d .

  2. | d | is the length of the document d in words (terms). In our implementation |d| is defined by: | d | = 1/(norm*norm) , where norm is the score factor used by Lucene's default similarity function.

  3. avgdl is the average document length over all the documents of the collection.

  4. k1 and b are free parameters, usually chosen as k1 = 2.0 and b = 0.75.

  5. idf(qi) is the inverse document frequency weight of the query term qi. It is computed by:


    where N is the total number of documents in the collection, and df(qi) is the number of documents containing the query term qi.


For implementation details follow the link below: How to embed the BM25 similarity function into Lucene

References

  1. Okapi BM25
  2. Joaquin Perez-Iglesias, Integrating the Probabilistic Model BM25: BM25F into Lucene , 2009.

  3. RETURN TO THE MAIN PAGE