Nguyen-Khanh Vu 0f8ec705e2
Remove redundant word in docstring (#7)
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
2022-12-16 20:41:08 -06:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 18:15:24 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 18:15:24 -08:00
2022-12-14 18:15:24 -08:00
2022-12-14 18:15:24 -08:00
2022-12-16 03:26:13 -06:00
2022-12-14 18:15:24 -08:00

tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

import tiktoken
enc = tiktoken.get_encoding("gpt2")
assert enc.decode(enc.encode("hello world")) == "hello world"

The open source version of tiktoken can be installed from PyPI:

pip install tiktoken

The tokeniser API is documented in tiktoken/core.py.

Performance

tiktoken is between 3-6x faster than a comparable open source tokeniser:

image

Performance measured on 1GB of text using the GPT-2 tokeniser, using GPT2TokenizerFast from tokenizers==0.13.2 and transformers==4.24.0.

Description
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
Readme 77 KiB
Languages
Rust 100%