fasttext bg vectors

facebook

Introduction

fastText is an open-source library designed for efficient learning of word representations and text classification. It is lightweight and suitable for both standard hardware and mobile devices. It supports over 157 languages and provides pre-trained models on Wikipedia data.

Architecture

fastText employs CBOW with position-weights for training, utilizing character n-grams, a window of size 5, and 10 negatives. It is optimized for rapid training on multicore CPUs, enabling the processing of over a billion words in minutes. It can be integrated as a command line tool, a C++ library, or via Python bindings, ensuring versatility across use cases.

Training

The training data for fastText includes Common Crawl and Wikipedia. Tokenization is language-specific, using tools like the Stanford segmenter for Chinese and Mecab for Japanese. The training process leverages a 300-dimensional space with language-specific tokenization based on script type.

Guide: Running Locally

  1. Install fastText: Ensure Python and fastText are installed.
  2. Download Pre-trained Model: Use hf_hub_download from the Hugging Face hub to download the desired model.
    import fasttext
    from huggingface_hub import hf_hub_download
    
    model_path = hf_hub_download(repo_id="facebook/fasttext-bg-vectors", filename="model.bin")
    model = fasttext.load_model(model_path)
    
  3. Load Model: Use fasttext.load_model(model_path) to load the model.
  4. Run Inference: Use methods like model.get_nearest_neighbors() to perform tasks like finding similar words.

For enhanced performance, consider using cloud GPUs on platforms like AWS, Google Cloud, or Azure.

License

The word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0. Users are free to share and adapt the work, provided appropriate credit is given and any modifications are distributed under the same license.

More Related APIs in Feature Extraction