tiktoken/README.md
2023-03-17 13:46:32 -07:00

105 lines
3.3 KiB
Markdown

# ⏳ tiktoken
tiktoken is a fast [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokeniser for use with
OpenAI's models.
```python
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"
# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4")
```
The open source version of `tiktoken` can be installed from PyPI:
```
pip install tiktoken
```
The tokeniser API is documented in `tiktoken/core.py`.
Example code using `tiktoken` can be found in the
[OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).
## Performance
`tiktoken` is between 3-6x faster than a comparable open source tokeniser:
![image](https://raw.githubusercontent.com/openai/tiktoken/main/perf.svg)
Performance measured on 1GB of text using the GPT-2 tokeniser, using `GPT2TokenizerFast` from
`tokenizers==0.13.2`, `transformers==4.24.0` and `tiktoken==0.2.0`.
## Getting help
Please post questions in the [issue tracker](https://github.com/openai/tiktoken/issues).
If you work at OpenAI, make sure to check the internal documentation or feel free to contact
@shantanu.
## Extending tiktoken
You may wish to extend `tiktoken` to support new encodings. There are two ways to do this.
**Create your `Encoding` object exactly the way you want and simply pass it around.**
```python
cl100k_base = tiktoken.get_encoding("cl100k_base")
# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken.Encoding(
# If you're changing the set of special tokens, make sure to use a different name
# It should be clear from the name what behaviour to expect.
name="cl100k_im",
pat_str=cl100k_base._pat_str,
mergeable_ranks=cl100k_base._mergeable_ranks,
special_tokens={
**cl100k_base._special_tokens,
"<|im_start|>": 100264,
"<|im_end|>": 100265,
}
)
```
**Use the `tiktoken_ext` plugin mechanism to register your `Encoding` objects with `tiktoken`.**
This is only useful if you need `tiktoken.get_encoding` to find your encoding, otherwise prefer
option 1.
To do this, you'll need to create a namespace package under `tiktoken_ext`.
Layout your project like this, making sure to omit the `tiktoken_ext/__init__.py` file:
```
my_tiktoken_extension
├── tiktoken_ext
│   └── my_encodings.py
└── setup.py
```
`my_encodings.py` should be a module that contains a variable named `ENCODING_CONSTRUCTORS`.
This is a dictionary from an encoding name to a function that takes no arguments and returns
arguments that can be passed to `tiktoken.Encoding` to construct that encoding. For an example, see
`tiktoken_ext/openai_public.py`. For precise details, see `tiktoken/registry.py`.
Your `setup.py` should look something like this:
```python
from setuptools import setup, find_namespace_packages
setup(
name="my_tiktoken_extension",
packages=find_namespace_packages(include=['tiktoken_ext*']),
install_requires=["tiktoken"],
...
)
```
Then simply `pip install ./my_tiktoken_extension` and you should be able to use your
custom encodings! Make sure **not** to use an editable install.