The BM25 similarity function
The BM25 Scoring Function is defined by the function:
where
- f(qi,d) correlates to the term's frequency, defined as the number of times query term qi appears in the document d .
- | d | is the length of the document d in words (terms). In our implementation |d| is defined by: | d | = 1/(norm*norm) , where norm is the score factor used by Lucene's default similarity function.
- avgdl is the average document length over all the documents of the collection.
- k1 and b are free parameters, usually chosen as k1 = 2.0 and b = 0.75.
- idf(qi) is the inverse document frequency weight of the query term qi. It is computed by:
where N is the total number of documents in the collection, and df(qi) is the number of documents containing the query term qi.
For implementation details follow the link below: How to embed the BM25 similarity function into Lucene
References
- Okapi BM25
- Joaquin Perez-Iglesias, Integrating the Probabilistic Model BM25: BM25F into Lucene , 2009.
RETURN TO THE MAIN PAGE