Hacker News new | ask | show | jobs
by montebicyclelo 1155 days ago
Huggingface have good guides on tokenization, and tokenizer training. BPE (e.g. used by gpt) and wordpiece (e.g. used by bert) are two commonly used methods https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt