A comprehensive Text Embedding model: from text2vec, openai-ada-002 to m3e, bge-CSDN blog

Content

Preface

This is already the 31st technical article related to large models this year, if

  • Half a year ago, writing blogs was more out of personal interest + reader demand
  • After our company established the LLM project team in Q3 of 23, writing blogs became: personal interest + reader demand + project demand It is really fortunate to have all three, writing blogs.

I and our company are very happy to discuss with everyone through blogs, courses, internal training, and projects how to better and faster apply advanced large-scale model technology to various business scenarios in different industries, empowering the practical business of thousands of companies.

And this article originally belonged to: This article "Knowledge Base Q&A LangChain+LLM's Secondary Development: Typical Issues and Improvement Solutions for Commercial Use" in section 1.2 (The initial draft of this 1.2 section comes from the bingo of our LLM project team's third project group), which was initiated by our third project group of the company's "Knowledge Base Q&A project," but in order to explain the Text Embedding model more accurately and comprehensively, that part of the content was extracted and continuously improved into this article.

Ultimately, it is as detailed as possible compared to other existing information online.

Part 1 Ranking of Text Vector Representation Effectiveness: MTEB, C-MTEB

1.2 《MTEB: Massive Text Embedding Benchmark(海量文本嵌入基准)》

Determine which text embedding models perform better usually requires a evaluation metric for comparison, "MTEB: Massive Text Embedding Benchmark" is a benchmark for evaluating massive text embedding models.

  • Paper link: https://arxiv.org/abs/2210.07316 MTEB contains 8 semantic vector tasks, covering 58 datasets and 112 languages. By benchmarking 33 models on MTEB, the most comprehensive text embedding benchmark to date has been established. We found that no specific text embedding method dominates in all tasks. This indicates that the field has not yet converged on a universal text embedding method and extended it to provide state-of-the-art results across all embedding tasks.
  • Github link: https://github.com/embeddings-benchmark/mteb#leaderboard

1.2 Chinese Massive Text Embedding Benchmark: C-MTEB

From Chinese Massive Text Embedding Benchmark, you can see the latest rankings for various tasks related to embedding massive Chinese text, with separate rankings for different task scenarios.

Task list includes:

  • Retrieval
  • STS
  • PairClassification
  • Classification
  • Reranking
  • Clustering

In the local knowledge base task, the main focus is on retrieving similar local knowledge text segments from the vector database based on the embedding representation of the query. Therefore, this scenario is mainly a retrieval task. The retrieval task list is as follows:

The most effective model under the current retrieval task leaderboard is the bge series bge-large-zh, and the default m3e-base in the langchain-chatchat project is also in a relatively high position.

Part 2 text-embedding-ada-002

2.1 Model Introduction

text-embedding-ada-002 is an embedding model provided by OpenAI, but it requires payment to use the API. It has the following characteristics:

  • Unified Capabilities: OpenAI has combined five separate models (text similarity, text search-query, text search-document, code search-text, and code search-code) into a new model. In a range of different text search, sentence similarity, and code search benchmarks, this single representation outperforms previous embedding models.
  • Context: The context length is 8192, making it more convenient for processing long documents.
  • Embedding Size: Only 1536 dimensions, which is one-eighth of the davinci-001 embedding size, making the new embedding more cost-effective for processing vector databases.

2.2 Model Usage

The following is the code example provided in the official OpenAI documentation for text search

1. from openai.embeddings_utils import get_embedding, cosine_similarity
2. def search_reviews(df, product_description, n=3, pprint=True):
3. embedding = get_embedding(product_description, model='text-embedding-ada-002')
4. df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
5. res = df.sort_values('similarities', ascending=False).head(n)
6. return res
7. res = search_reviews(df, 'delicious beans', n=3)'

Part Three m3e Model

3.1 Introduction to m3e Model

M3E(Moka Massive Mixed Embedding, m3e-base model download address: https://huggingface.co/moka-ai/m3e-base, m3e GitHub address: GitHub - wangyingdong/m3e-base), its

  • Using the contrastive learning method with in-batch negative sampling for training on sentence pair datasets, in order to ensure the effectiveness of in-batch negative sampling, A100 was used to maximize the batch size, and 1 epoch was trained on a total of 22M+ sentence pairs dataset (including Chinese encyclopedia, finance, medical, legal, news, academic and other fields)
  • Using the instruction dataset, M3E used a fine-tuning dataset of 300K+ instructions, allowing M3E to follow instructions when encoding text. This work was mainly inspired by instructor-embedding
  • For the base model, M3E used the Roberta series models for training, currently providing small and base versions This article "Knowledge Base Q&A LangChain+LLM Secondary Development: Typical Issues and Improvement Solutions for Commercial Use" defaults to using m3e-base for langchain-chatchat

3.1.1 m3e and openai text-embedding-ada-002

The following are some major updates for m3e

  • 2023.06.08,添加检索任务的评测结果,在 T2Ranking 1W 中文数据集上,m3e-base 在 ndcg@10 上达到了 0.8004,超过了 openai-ada-002 的 0.7786
    见下图s2p ndcg@10那一列(其中s2p, 即 sentence to passage ,代表了异质文本之间的嵌入能力,适用任务:文本检索,GPT 记忆模块等)
  • 2023.06.07,添加文本分类任务的评测结果,在 6 种文本分类数据集上,m3e-base 在 accuracy 上达到了 0.6157,超过了 openai-ada-002 的 0.5956
    见下图s2s ACC那一列(其中s2s, 即 sentence to sentence ,代表了同质文本之间的嵌入能力,适用任务:文本相似度,重复问题检测,文本分类等)

Furthermore, the m3e team suggests

  1. If the main usage scenario is Chinese with a small amount of English, it is recommended to use the m3e series of models
  2. For multilingual usage scenarios, and if data privacy is not a concern, the author's team recommends using openai text-embedding-ada-002
  3. For code retrieval scenarios, it is recommended to use openai text-embedding-ada-002
  4. For text retrieval scenarios, please use a model with text retrieval capabilities. Models trained only on S2S for text embedding cannot complete text retrieval tasks.

3.2 m3e model fine-tuning

  • Fine-tuning script: m3e uses the uniem script for fine-tuning
   1. from datasets import load_dataset 
   2. from uniem.finetuner import FineTuner 
   3. dataset = load_dataset('shibing624/nli_zh', 'STS-B') 
   4. # Specify the model for training as m3e-small 
   5. finetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset) 
   6. finetuner.run(epochs=3) 

The detailed tutorial will be temporarily placed in the 'Online Camp for Large-scale Project Development,' as for the subsequent updates of this article.

Part Four bge Model

4.1 Introduction to bge Model

BGE is a Chinese-English semantic vector model released by the Beijing Academy of Artificial Intelligence Research Institute (hf address: https://huggingface.co/BAAI/bge-large-zh, GitHub address: https://github.com/FlagOpen/FlagEmbedding/blob/master/README%5Fzh.md), below are the technical highlights of BGE

  1. Efficient pre-training and large-scale text fine-tuning; 2. RetroMAE pre-training algorithm was adopted on two large-scale corpora, further enhancing the model's semantic representation capability; 3. Enhanced the discriminative power of semantic vectors through negative sampling and hard negative example mining; 4. Borrowing the strategy of Instruction Tuning to enhance the general ability in multi-task scenarios

4.1.1 Pre-training Steps of RetroMAE

The current mainstream language models are pre-trained on token-level tasks, such as MLM or Seq2Seq. However, these training tasks make it difficult for the model to obtain a high-quality sentence-level sentence vector, limiting the potential of language models in retrieval tasks. To address this drawback, there are currently two pre-training strategies for retrieval models.

  • The first type is self-contrastive learning, which is often limited by the quality of data augmentation and requires a very large number of negative samples.
  • The other type is based on auto-encoding, a self-reconstruction method that is not affected by data augmentation and negative sample sampling strategies. The performance of models based on this method has two key factors: One is that the reconstruction task must have sufficient requirements for encoding quality, and the other is that the training data needs to be fully utilized.

Based on this, researchers proposed RetraoMAE (RetroMAE paper: https://arxiv.org/abs/2205.12035), which consists of two modules, one being an encoder similar to BERT for generating sentence vectors, and the other being a single-layer transformer decoder for reconstructing sentences, as shown in the following figure.

4.1.1.1 Encoding

所谓编码,即Mask(EN)掉一小部分token然后通过BERT编码得到句子嵌入_sentence embedding_h_x,具体步骤如下

  1. 给定一个句子输入XNorwegian forest cat is a breed of dom-estic cat originating in northern Europe
  2. 随机Mask(EN)掉其中一小部分token后得到X_{enc}[M] forest cat is a breed of [M] cat originating in [M] Europe
    这里通常会采用一定的mask比例(15%~30%),从而能保留句子原本大部分的信息
  3. 然后利用类似BERT的编码器\Phi^{e n c}(\cdot)对其进行编码,得到对应的的句子嵌入\mathbf{h}_{\tilde{X}}一般将[CLS]位置最后一层的隐状态作为句子嵌入」,如下公式所示
    \mathbf{h}{\tilde{X}} \leftarrow \Phi{e n c}\left(\tilde{X}_{e n c}\right)
    We apply a BERT like encoder with 12 layers and768 hidden-dimensions, which helps to capture thein-depth semantics of the sentence. Following the common practice, we select the [CLS] token’s finalhidden state as the sentence embedding.
4.1.1.2 Decoding

所谓解码,即Mask(DE)很大一部分token然后结合句子嵌入_sentence embedding_\mathbf{h}_{\tilde{X}},让解码器重构原始句子

Specifically, it is to combine the following two parts so that the decoder can reconstruct the original sentence based on these two parts.

  • 利用Mask(DE)后的文本输入\tilde{X}_{d e c}[MI [M] cat is MI [M} of dom-estic [M] [M] in northern [M]
    (这里采取了比encoder部分更加激进的mask比例,比如50%~70%)
  • 与上一节encoder生成的句子嵌入(sentence embedding)
    论文中对这一步骤的英文阐述是:The masked input is joined with the sentence embedding, based on which the original sentence is reconstructed by the decoder.

有个细节是,这里的Mask(DE)输入带上位置嵌入了,即句子嵌入\mathbf{h}{\tilde{X}}和带有位置的掩码输入被组合成以下序列 \mathbf{H}{\tilde{X}{d e c}} \leftarrow\left[\mathbf{h}{\tilde{X}}, \mathbf{e}{x{1}}+\mathbf{p}{1}, \ldots, \mathbf{e}{x{N}}+\mathbf{p}{N}\right] 其中,e{xi}表示xi的嵌入,在xi的基础上增加了一个额外的位置嵌入pi


接下来,通过优化以下目标,学习解码器\Phi{d e c}来重构原始句子X \mathcal{L}{d e c}=\sum{x{i} \in \text { masked }} \mathrm{CE}\left(x{i} \mid \Phi{\text {dec }}\left(\mathbf{H}{\tilde{X}{d e c}}\right)\right)
其中,CE是交叉熵损失

Due to the extremely simple network structure and very aggressive mask ratio used in the decoder part, the decoding task becomes extremely challenging, forcing the encoder to generate high-quality sentence vectors in order to accurately reconstruct the original text.

4.1.1.3 Enhanced Decoding

The previously mentioned decoding strategy has a flaw, which is that the training signal only comes from the masked token, and each masked token is reconstructed based on the same context. Therefore, researchers have proposed a new decoding method: Enhanced Decoding, with the following specific approach

  • c)最终利用attention后的输出A跟H1一起过FNN(即resnet)去重建原文本,这里重建的目标不仅仅是被mask掉的token,而是全部token
    \mathcal{L}{d e c}=\sum{x_{i} \in X} \mathrm{CE}\left(x_{i} \mid \mathbf{A}, \mathbf{H}_{1}\right)
    最终RetroMAE的损失由两部分相加得到,其一是encoder部分的MLM损失,其二是deocder部分自重建的交叉熵损失

Finally, summarize the RetroMAE pre-training steps

  1. (A) Encoding: the input is moderately masked and encoded as the sentence embedding (the green rectangle)
  2. (B) Decoding: the input is aggressively masked, and joined with the sentence embedding to reconstruct the masked tokens (the shadowed tokens).
  3. (C) Enhanced encoding: all input tokens are reconstructed based on the sentence embedding and the visible context in each row (defined in Eq. 7); the main diagonal positions are filled with −∞ (grey), and positions for the visible context are filled with 0 (blue).

4.2 Fine-tuning of bge Model

  • Fine-tuning script: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune
  • Data format
{"query": str, "pos": List[str], "neg":List[str]}  
  • Hard negative sample mining Hard negative sampling is a widely used method to improve the quality of sentence embeddings. Hard negative samples can be mined using the following method
   1. python -m FlagEmbedding.baai_general_embedding.finetune.hn_mine \  
   2. --model_name_or_path BAAI/bge-base-en-v1.5 \  
   3. --input_file toy_finetune_data.jsonl \  
   4. --output_file toy_finetune_data_minedHN.jsonl \  
   5. --range_for_sampling 2-200 \  
   6. --use_gpu_for_searching  
  • Training
   1. python -m FlagEmbedding.baai_general_embedding.finetune.hn_mine \  
   2. --model_name_or_path BAAI/bge-base-en-v1.5 \  
   3. --input_file toy_finetune_data.jsonl \  
   4. --output_file toy_finetune_data_minedHN.jsonl \  
   5. --range_for_sampling 2-200 \  
   6. --use_gpu_for_searching  

References and Recommended Reading

Summary
This article discusses the evaluation of text embedding models, including the Massive Text Embedding Benchmark (MTEB) and the Chinese Massive Text Embedding Benchmark (C-MTEB). MTEB evaluates 33 models on 8 semantic vector tasks, while C-MTEB ranks models for various Chinese text embedding tasks. The article also introduces the text-embedding-ada-002 model from OpenAI, which combines five independent models and has a context length of 8192. Additionally, the M3E model (Moka Massive Mixed Embedding) is described, which uses in-batch negative sampling for contrastive learning and is trained on over 22 million sentence pairs. The model also incorporates a directive dataset and is based on the Roberta series models.