Token window - the limit on the length of content processed at any given time.
This includes both input data and data generated by the model (output).
Currently:
Controlling tokens
When using GPT API we need to handle the token window on our own. We can use tiktoken
for estimating the tokens used (tiktoken
provides the approximated value, due to the model being updated over time).
For gpt-3.5-turbo
and gpt-4
we must also consider the ChatML's structure (which also uses tokens).
Knowing the tokens used before actually using them is not sufficient, though.
You may need to take actions to control the number of tokens in the prompt, by e.g:
- Using a model that supports more tokens (e.g.
gpt-3.5-turbo-16k
) - Choosing different versions of the prompt or its parts
- Reducing the context
- cutting off earlier conversation messages
- compressing the information in the current context (e.g summarizing the conversation with the model up to this point)
INFO
Controlling the token window means finding the balance between providing the meaningful information for the current conversation and its volume.
Even though Claude2 allows for up to 100k tokens in the context, it might be expensive and the enormous context might be prone to noise.