Natural Language Processing

Token

A token is the basic unit of text that a language model processes, which can be a word, subword, or character depending on the tokenizer. GPT-4 processes text in tokens, with roughly 4 characters per token in English.

Understanding Token

A token is the basic unit of text that language models process, which may represent a complete word, a subword fragment, a single character, or a punctuation mark. Tokenization algorithms like Byte Pair Encoding break text into these units, creating the vocabulary that defines what a model can read and generate. The token count of a prompt directly impacts processing cost, latency, and context window usage in commercial AI APIs. For example, the word "understanding" might be split into tokens like "under" and "standing," while common words like "the" remain single tokens. Understanding token economics is essential for managing costs when building applications with large language models, as pricing is typically based on input and output token counts. Different models use different tokenizers, leading to varying token counts for the same text.

Is AI recommending your brand?

Find out if ChatGPT, Perplexity, and Gemini mention you when people search your industry.

Check your brand — $9

Related Natural Language Processing Terms

Abstractive Summarization

Abstractive summarization generates new text that captures the key points of a longer document, rather than simply extracting existing sentences. It requires deep language understanding and generation capabilities.

Beam Search

Beam search is a decoding algorithm that explores multiple candidate sequences simultaneously, keeping only the top-k most promising at each step. It balances between greedy decoding and exhaustive search in text generation.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google that reads text in both directions simultaneously. BERT revolutionized NLP by enabling deep bidirectional pre-training for language understanding tasks.

Tokenization

Back to full glossary

Token

Understanding Token

Is AI recommending your brand?

Related Natural Language Processing Terms

Abstractive Summarization

Beam Search

BERT

Bigram

Byte Pair Encoding

Corpus

Extractive Summarization

Grounding