Learn Before
Definition

Bigram Jaccard Similarity

Bigram Jaccard similarity applies the Jaccard coefficient to the bigram (2-gram) representations of two texts. Given a text TT, let B(T)B(T) be the set of distinct ordered token bigrams obtained by sliding a window of length 22 over its tokens; for tokens (t1,,tn)(t_1,\dots,t_n), B(T)={(ti,ti+1)1in1}B(T)=\{(t_i,t_{i+1})\mid 1\le i\le n-1\}. For two texts XX and YY the bigram Jaccard similarity is J2(X,Y)=B(X)B(Y)B(X)B(Y)J_{2}(X,Y)=\dfrac{|B(X)\cap B(Y)|}{|B(X)\cup B(Y)|}, taking values in [0,1][0,1]. Compared with unigram (bag-of-words) Jaccard, the bigram variant rewards short-range word-order agreement: two texts must reuse the same adjacent-word pairs, not just the same vocabulary, to score highly. It is reported as a lexical-overlap baseline alongside TF-IDF cosine and ROUGE-L when comparing paired drafts.

0

1

Updated 2026-05-16

Contributors are:

Who are from:

Tags

Science

Research Paper: Advanced Prompting

Related