Unused and under-trained tokens exist in large language models, such as ChatGPT, due to the separation between tokenization and model training processes. Unused tokens are present in the model's vocabulary but were not sufficiently seen during training, while under-trained tokens may or may not exist in the vocabulary and were not represented in the training data. These tokens can lead to undesirable behaviors in language models, such as hallucination and lack of accuracy.
Experiments using GPT-2 Small demonstrate the existence of unused tokens, including under-trained ones. For example, the model struggles to reproduce unused tokens, even with straightforward instructions. In one experiment, the model fails to predict the token "ú" and instead generates garbled text.
Another experiment involves generating sequences of repeated random tokens and evaluating the model's performance on the repeated sequence. The results show that the model performs poorly on unused tokens, with significantly lower log probabilities compared to commonly used tokens.
Under-trained tokens can assign non-negligible probabilities to unused tokens, despite being uncommonly used in most contexts. Researchers have proposed techniques to automatically identify under-trained tokens, including analyzing output embeddings generated by the model.
One approach involves computing the average embedding vector of unused tokens and using cosine distances to measure similarity to all tokens' embedding vectors. Tokens with cosine distances close to the mean embeddings are marked as candidates of under-trained tokens.
Recent research has proposed methods for identifying under-trained tokens, including works by Watkins and Rumbelow, and Fell. These methods can help mitigate the effects of under-trained tokens on language model outputs.
towardsdatascience.com
towardsdatascience.com