Tokenisation

Tokenization - converting text into numerical representation.

The most popular strategy today is subword tokenization - encoding parts of words as numbers.

When generating responses, the model produces successive tokens (word fragments), answering the question: *Given the current text, what token continues it?*

The decision on which word fragments become tokens depends on:

the tokenization algorithm
data set

Tokenization creates a vocabulary (a component of the language model). However, generating a vocabulary alone is not enough, as it's also necessary to consider additional information, such as word meanings, which also need to be converted into embedding.

Llm

Neural-networks

Behavioural

Structural

Testing

Configuring-jest

Cypress

Fundamentals

Jest

Data-store

Databases

Testing

Linux-from-scratch

Pocket-chip

Zsh

Glossary

Hardening

Osint

Pentesting

Recce

Tools

Burp

Nmap

Write-ups

Dvwa

PicoCTF

Thm

Hooks

Packages

React-query

Ethereum

Frontend

Solidity

Woo-commerce

Keyboard-maestro

Macros

Script-kit

Shortcuts

See also