xlsearch cross lang search zh vs classicical cn

raynardj

Introduction

The Cross Language Search model enables users to search classical Chinese literature using modern Chinese queries. It is particularly useful for individuals who wish to reference ancient texts without needing precise knowledge of the original language. This tool mirrors the Western practice of quoting Latin, offering a vast reservoir of classical Chinese literature.

Architecture

The model employs Sentence Transformers for generating text embeddings suitable for cosine similarity searches. It allows users to input modern Chinese phrases to find relevant segments within classical Chinese texts, leveraging the power of embedding techniques for semantic similarity.

Training

The model was developed to demonstrate how classical Chinese can be considered distinct from modern Chinese, much like different languages. This was achieved by training translation models between ancient and modern Chinese, highlighting the disparities and facilitating cross-language searches.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Installation:
    Install the necessary packages with the following commands:

    pip install -Uqq unpackai
    pip install -Uqq SentenceTransformer
    
  2. Encoding Sentences:
    Use the SentenceTransformer to encode your list of sentences:

    from unpackai.interp import CosineSearch
    from sentence_transformers import SentenceTransformer
    import pandas as pd
    import numpy as np
    
    TAG = "raynardj/xlsearch-cross-lang-search-zh-vs-classicical-cn"
    encoder = SentenceTransformer(TAG)
    
    all_lines = ["句子1", "句子2", ...]
    vec = encoder.encode(all_lines, batch_size=32, show_progress_bar=True)
    
  3. Performing Searches:
    Use the cosine similarity function to search for relevant sentences:

    def search(text):
        enc = encoder.encode(text)
        order = cosine(enc)
        sentence_df = pd.DataFrame({"sentence": np.array(all_lines)[order[:5]]})
        return sentence_df
    
  4. Cloud GPUs:
    For enhanced performance, consider utilizing cloud GPU services like AWS, Google Cloud, or Azure to handle large datasets and improve processing speed.

License

The project does not explicitly state a license in the provided content. Users should check the project's GitHub repository or the Hugging Face model page for any licensing information.

More Related APIs in Feature Extraction