fasttext bg vectors
facebookIntroduction
fastText is an open-source library designed for efficient learning of word representations and text classification. It is lightweight and suitable for both standard hardware and mobile devices. It supports over 157 languages and provides pre-trained models on Wikipedia data.
Architecture
fastText employs CBOW with position-weights for training, utilizing character n-grams, a window of size 5, and 10 negatives. It is optimized for rapid training on multicore CPUs, enabling the processing of over a billion words in minutes. It can be integrated as a command line tool, a C++ library, or via Python bindings, ensuring versatility across use cases.
Training
The training data for fastText includes Common Crawl and Wikipedia. Tokenization is language-specific, using tools like the Stanford segmenter for Chinese and Mecab for Japanese. The training process leverages a 300-dimensional space with language-specific tokenization based on script type.
Guide: Running Locally
- Install fastText: Ensure Python and fastText are installed.
- Download Pre-trained Model: Use
hf_hub_download
from the Hugging Face hub to download the desired model.import fasttext from huggingface_hub import hf_hub_download model_path = hf_hub_download(repo_id="facebook/fasttext-bg-vectors", filename="model.bin") model = fasttext.load_model(model_path)
- Load Model: Use
fasttext.load_model(model_path)
to load the model. - Run Inference: Use methods like
model.get_nearest_neighbors()
to perform tasks like finding similar words.
For enhanced performance, consider using cloud GPUs on platforms like AWS, Google Cloud, or Azure.
License
The word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0. Users are free to share and adapt the work, provided appropriate credit is given and any modifications are distributed under the same license.