xlsearch cross lang search zh vs classicical cn LLM Model

Introduction

The Cross Language Search model enables users to search classical Chinese literature using modern Chinese queries. It is particularly useful for individuals who wish to reference ancient texts without needing precise knowledge of the original language. This tool mirrors the Western practice of quoting Latin, offering a vast reservoir of classical Chinese literature.

Architecture

The model employs Sentence Transformers for generating text embeddings suitable for cosine similarity searches. It allows users to input modern Chinese phrases to find relevant segments within classical Chinese texts, leveraging the power of embedding techniques for semantic similarity.

Training

The model was developed to demonstrate how classical Chinese can be considered distinct from modern Chinese, much like different languages. This was achieved by training translation models between ancient and modern Chinese, highlighting the disparities and facilitating cross-language searches.

Guide: Running Locally

To run the model locally, follow these steps:

Installation:
Install the necessary packages with the following commands:
```
pip install -Uqq unpackai
pip install -Uqq SentenceTransformer
```

Encoding Sentences:
Use the SentenceTransformer to encode your list of sentences:

from unpackai.interp import CosineSearch
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np

TAG = "raynardj/xlsearch-cross-lang-search-zh-vs-classicical-cn"
encoder = SentenceTransformer(TAG)

all_lines = ["句子1", "句子2", ...]
vec = encoder.encode(all_lines, batch_size=32, show_progress_bar=True)

Performing Searches:
Use the cosine similarity function to search for relevant sentences:

def search(text):
    enc = encoder.encode(text)
    order = cosine(enc)
    sentence_df = pd.DataFrame({"sentence": np.array(all_lines)[order[:5]]})
    return sentence_df

Cloud GPUs:
For enhanced performance, consider utilizing cloud GPU services like AWS, Google Cloud, or Azure to handle large datasets and improve processing speed.

License

The project does not explicitly state a license in the provided content. Users should check the project's GitHub repository or the Hugging Face model page for any licensing information.

More Related APIs in Feature Extraction