What Are Tokens in LLMs?
Tokens are the fundamental units that large language models read and generate. A token is not a word — it can be a word, a subword, a single character, or even a punctuation mark. For example, the word "unhappiness" might be split into three tokens: "un", "happiness", and so on, depending on the tokenizer.
Most modern LLMs use Byte Pair Encoding (BPE) or SentencePiece tokenizers. A rough rule of thumb: 1 token ≈ 4 characters in English, or about 0.75 words.
Understanding tokenization is critical because you pay per token, and your prompt + response must fit within the model's context window. Paste any text above to see exactly how it tokenizes.