text2vec word2vec tencent chinese

shibing624

Introduction

The TEXT2VEC-WORD2VEC-TENCENT-CHINESE model provides pre-trained word embeddings for 8 million Chinese words and phrases. These 200-dimensional vectors are suitable for various NLP downstream tasks. The embeddings are derived from the Tencent AI Lab Embedding Corpus and cover a wide range of vocabulary, including both professional jargon and colloquial terms.

Architecture

The embeddings are constructed using a method called Directional Skip-Gram, which distinguishes left and right contexts for word embeddings. The corpus includes diverse data sources such as news articles, webpages, and novels. The vocabulary is built with contributions from resources like Wikipedia and Baidu Encyclopedia, along with techniques from the paper "Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches."

Training

Training incorporates a large dataset and a carefully designed algorithm to accurately capture the semantics of Chinese words and phrases. The training process retains stop words, numbers, and punctuation to accommodate specific use cases, requiring users to construct their own vocabulary lists when needed.

Guide: Running Locally

To use the Tencent word vectors for tasks such as synonym identification:

  1. Install the text2vec package:

    pip install text2vec
    
  2. Load the Model and Compute Embeddings:

    from text2vec import Word2Vec
    
    def compute_emb(model):
        sentences = [
            '卡', '银行卡', '如何更换花呗绑定银行卡',
            '花呗更改绑定银行卡', 'This framework generates embeddings for each input sentence',
            'Sentences are passed as a list of string.', 'The quick brown fox jumps over the lazy dog.',
            '敏捷的棕色狐狸跳过了懒狗',
        ]
        sentence_embeddings = model.encode(sentences, show_progress_bar=True, normalize_embeddings=True)
        for sentence, embedding in zip(sentences, sentence_embeddings):
            print("Sentence:", sentence)
            print("Embedding shape:", embedding.shape)
            print("Embedding head:", embedding[:10])
    
    if __name__ == "__main__":
        w2v_model = Word2Vec("w2v-light-tencent-chinese")
        compute_emb(w2v_model)
    
  3. Consider Using Cloud GPUs: For large-scale processing, consider using cloud GPU services like AWS, GCP, or Azure to enhance computational efficiency.

License

The model is licensed under the Apache-2.0 license, allowing for wide use and distribution.

More Related APIs in Sentence Similarity