Natural Language Processing

Bigram

A bigram is a contiguous sequence of two items (typically words or characters) from a given text. Bigram models estimate the probability of a word based on the immediately preceding word.

Understanding Bigram

Bigrams are the simplest multi-word unit in natural language processing, consisting of two consecutive tokens extracted from a text sequence, and they form the foundation of n-gram language modeling. A bigram model estimates the probability of a word by conditioning only on the single word that immediately precedes it, applying the Markov assumption that the future depends only on the present. For the sentence "the cat sat," the bigrams are "the cat" and "cat sat." Despite their simplicity, bigrams capture important local co-occurrence patterns useful for tasks like text classification, spell checking, and basic language generation. Bigram frequency analysis helps identify common phrases and collocations in a corpus. However, bigrams cannot capture long-range dependencies in language, which is why modern systems have moved toward attention mechanism-based architectures like transformers. Byte pair encoding, another subword technique, also starts from character-level pairs but uses them for tokenization rather than probabilistic modeling.

Is AI recommending your brand?

Find out if ChatGPT, Perplexity, and Gemini mention you when people search your industry.

Check your brand — $9

Related Natural Language Processing Terms

Abstractive Summarization

Abstractive summarization generates new text that captures the key points of a longer document, rather than simply extracting existing sentences. It requires deep language understanding and generation capabilities.

Beam Search

Beam search is a decoding algorithm that explores multiple candidate sequences simultaneously, keeping only the top-k most promising at each step. It balances between greedy decoding and exhaustive search in text generation.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google that reads text in both directions simultaneously. BERT revolutionized NLP by enabling deep bidirectional pre-training for language understanding tasks.

Binary Classification

Back to full glossary

Bigram

Understanding Bigram

Is AI recommending your brand?

Related Natural Language Processing Terms

Abstractive Summarization

Beam Search

BERT

Byte Pair Encoding

Corpus

Extractive Summarization

Grounding

Language Model