xlsearch cross lang search zh vs classicical cn
raynardjIntroduction
The Cross Language Search model enables users to search classical Chinese literature using modern Chinese queries. It is particularly useful for individuals who wish to reference ancient texts without needing precise knowledge of the original language. This tool mirrors the Western practice of quoting Latin, offering a vast reservoir of classical Chinese literature.
Architecture
The model employs Sentence Transformers for generating text embeddings suitable for cosine similarity searches. It allows users to input modern Chinese phrases to find relevant segments within classical Chinese texts, leveraging the power of embedding techniques for semantic similarity.
Training
The model was developed to demonstrate how classical Chinese can be considered distinct from modern Chinese, much like different languages. This was achieved by training translation models between ancient and modern Chinese, highlighting the disparities and facilitating cross-language searches.
Guide: Running Locally
To run the model locally, follow these steps:
-
Installation:
Install the necessary packages with the following commands:pip install -Uqq unpackai pip install -Uqq SentenceTransformer
-
Encoding Sentences:
Use theSentenceTransformer
to encode your list of sentences:from unpackai.interp import CosineSearch from sentence_transformers import SentenceTransformer import pandas as pd import numpy as np TAG = "raynardj/xlsearch-cross-lang-search-zh-vs-classicical-cn" encoder = SentenceTransformer(TAG) all_lines = ["句子1", "句子2", ...] vec = encoder.encode(all_lines, batch_size=32, show_progress_bar=True)
-
Performing Searches:
Use the cosine similarity function to search for relevant sentences:def search(text): enc = encoder.encode(text) order = cosine(enc) sentence_df = pd.DataFrame({"sentence": np.array(all_lines)[order[:5]]}) return sentence_df
-
Cloud GPUs:
For enhanced performance, consider utilizing cloud GPU services like AWS, Google Cloud, or Azure to handle large datasets and improve processing speed.
License
The project does not explicitly state a license in the provided content. Users should check the project's GitHub repository or the Hugging Face model page for any licensing information.