The BM25 similarity function

The BM25 Scoring Function is defined by the function:

  1. f(qi,d) correlates to the term's frequency, defined as the number of times query term qi appears in the document d .

  2. | d | is the length of the document d in words (terms). In our implementation |d| is defined by: | d | = 1/(norm*norm) , where norm is the score factor used by Lucene's default similarity function.

  3. avgdl is the average document length over all the documents of the collection.

  4. k1 and b are free parameters, usually chosen as k1 = 2.0 and b = 0.75.

  5. idf(qi) is the inverse document frequency weight of the query term qi. It is computed by:

    where N is the total number of documents in the collection, and df(qi) is the number of documents containing the query term qi.

For implementation details follow the link below: How to embed the BM25 similarity function into Lucene


  1. Okapi BM25
  2. Joaquin Perez-Iglesias, Integrating the Probabilistic Model BM25: BM25F into Lucene , 2009.