Building an Efficient RAG System: Part 2 | Jiayuan (Forrest)

Content

devv.ai is how to build an efficient RAG system Part 2 This series of threads will share the experience of building the entire Retrieval Augmented Generation System behind devv.ai, including some practices in the production environment. This is the second article in the series, with the theme of "How to evaluate an RAG system". 🧵

In the previous section, we mentioned what the RAG system is and its basic components. Let's review it here again. A basic RAG system consists of the following 3 parts: 1. Language model 2. External knowledge base 3. External knowledge required in the current scenario

To optimize the entire system, you can break down the problem into optimizing each part of the system. However, the difficulty of optimizing an LLM-based system lies in the fact that this system is essentially a black box with no effective evaluation methods. Without even the most basic benchmark, talking about improving the corresponding metrics is just empty talk.

So the first thing we need to do is to establish an evaluation system for the entire RAG system. This paper from Stanford mainly does this work, evaluating the verifiability of generative search engines. Evaluating Verifiability in Generative Search Engines

This paper, although used to evaluate the Generative Search Engine, can also be completely applied to RAG. Essentially, the Generative Search Engine is a subset of RAG, as well as a RAG system tailored to specific domain data.

The prerequisite mentioned in the paper for a trustworthy Generative Search Engine is verifiability. We all know that LLM often talks nonsense seriously (hallucination), generating seemingly correct but actually wrong content. An advantage of RAG is providing reference material to the model, reducing the probability of hallucination.

And the extent to which this illusion is reduced can be assessed using the verifiability metric. An ideal RAG system should have: - High citation recall, meaning all generated content is adequately supported by citations (external knowledge) - High citation precision, meaning each citation genuinely supports the generated content

In fact, it is impossible for these two indicators to reach 100%. According to the experimental results in the paper, the content generated by the existing Generative Search Engine often contains unsubstantiated statements and inaccurate citations, with the two data being 51.5% and 74.5% respectively. In simple terms, the generated content does not match external knowledge.

The paper evaluates 4 mainstream Generative Search Engines: - Bing Chat - NeevaAI (acquired by Snowflake) - Perplexity - YouChat The evaluation questions come from different topics and fields.

The evaluation was conducted using 4 indicators: 1. fluency, whether the generated text is smooth and coherent 2. perceived utility, whether the generated content is useful 3. citation recall, the proportion of content generated that is fully supported by citations 4. citation precision, the proportion of citations supporting the generated content

Indicator 1 and 2 are usually basic conditions. If these are not met, the entire RAG system is meaningless (it's no use being accurate if you can't even meet the basic conditions). An excellent RAG system should achieve high scores in citation recall and citation precision.

How is the specific evaluation framework implemented? This part uses some knowledge of junior high school mathematics, and the detailed process can be directly referred to the original paper. The evaluation method for the entire experiment uses a 'human-made' evaluation method.

  1. Evaluate the fluency and practicality of the corresponding evaluation indicators, such as xxx is considered fluent, and calculate it using a five-point Likert scale, from Strongly Disagree to Strongly Agree. Also, have the evaluators rate their level of agreement with the statement 'The response is a helpful and informative answer to the query'.

2)Evaluation of Citation Recall Citation recall refers to: generated content supported by citations / generated content worth validating Therefore, calculating recall requires: 1. Identifying the part of the generated content worth validating 2. Evaluating whether each validated content is supported by relevant citations

What is 'worth verifying' can be simply understood as the part of the generated content that contains information. In practice, almost all generated content can be seen as content worth verifying, so the recall rate can be approximately equal to: Recall rate = generated content supported by citations / total generated content

  1. Measure Citation Precision (Citation Precision) Citation precision refers to the proportion of generated citations that support their related statements. If the generated content references all web pages on the internet for each generated statement, then the citation recall will be high, but the citation precision will be low because many articles are irrelevant and do not support the generated content.

For example, AI search engines such as Bing Chat often reference content from CSDN, Zhihu, and Baidu Knows when using Chinese for inquiries. While the citation recall rate is high, with citations for almost every generated content, the accuracy of the citations is low. Most citations do not support the generated content or are of poor quality.

devv.ai has made many optimizations in citation accuracy, especially for multilingual scenarios. When asking questions in Chinese, the accuracy is significantly better than Perplexity, Bing Chat, Phind, and other products.

Specific calculation methods for citation accuracy are not elaborated here and can be referred to the description in the paper.

Once we have citation recall and citation precision, we can calculate the final metric Citation F (harmonic mean). To achieve a high F, the entire system must have high citation precision and high citation recall.

The above is the complete evaluation method for the verifiability of the RAG system. With this evaluation system in place, the evaluation set can be rerun each time RAG is optimized to determine the changes in relevant indicators, allowing for a macro-level judgment of whether the entire RAG system is improving or deteriorating.

Also sharing some practices of devv.ai when using this system: 1) Evaluation set The selection of the evaluation set should correspond to the scenes of RAG, for example, devv.ai has chosen evaluation sets related to programming and added many multilingual evaluation sets.

2)Automated Evaluation Framework The paper still adopts the method of human evaluation, for example, 34 evaluators were involved in the evaluation. The disadvantages are: 1. Consumes manpower and time 2. Small sample size, with certain errors

So for industrial scenarios, we are building an automated evaluation framework. The core idea is:

  1. Train an evaluation model based on llama 2 (validate recall and citation accuracy) 2. Build a large number of evaluation sets and automatically sample evaluation sets based on online data 3. After the core module of RAG is modified, CI will automatically run the entire evaluation framework and generate data points and reports

By using this method, it is possible to efficiently conduct testing and improvements. For example, for changes to the prompt, a quick a/b test can be conducted, and then different test groups can run through the evaluation framework to obtain the final results. Currently, this framework is still being developed and tested internally, and there may be consideration for open-sourcing the corresponding evaluation model and framework code in the future. (It feels like just this evaluation framework alone could start a new startup.)

Today's thread mainly shared the content of the paper 'Evaluating Verifiability in Generative Search Engine,' as well as some specific practices of devv.ai in RAG evaluation. Due to space limitations, many details were not covered, such as how to evaluate fine-grained modules, which will be shared at a later opportunity.

Finally, let me introduce devv.ai devv.ai is a next-generation AI search engine specifically designed for developers, aiming to replace the scenarios where developers commonly use Google / StackOverflow / documentation, helping developers improve efficiency and create value. The product has been released for over a month, and already has tens of thousands of developers using devv.ai 🚀

Also, any feedback and suggestions can be submitted in this repo. Feel free to share this thread with more people, you can directly repost it to other platforms, just remember to credit the source.

Summary
The article discusses the evaluation of a Retrieval Augmented Generation (RAG) system, emphasizing the importance of verifiability in generative search engines. It introduces the concept of verifiability and its significance in reducing hallucination in language models. The article outlines the evaluation criteria for RAG systems, including fluency, perceived utility, citation recall, and citation precision. It also presents the evaluation framework used in a Stanford paper, which involves human evaluation for fluency and perceived utility, and mathematical calculations for citation recall and precision. The article highlights the importance of automated evaluation frameworks for industrial applications and shares some practical insights from devv.ai, such as the selection of evaluation sets aligned with the RAG's domain and the development of an automated evaluation framework to address the limitations of human evaluation.