Building Efficient RAG Systems: A Deep Dive into devv.ai | Jiayuan (Forrest)

Content

devv.ai is how to build an efficient RAG system 🔎 I promised to share the technology involved in the underlying devv.ai in this series of threads, this is the first one. We have also opened a GitHub Repo specifically for feedback and suggestions, welcome feedback. 🧵github.com/devv-ai/devv

RAG's full name is: Retrieval Augmented Generation (retrieval-enhanced generation), which originally came from a 2020 Facebook paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (yes, you read that right, this technology existed in 2020).

This paper addresses a very simple problem: how to enable language models to utilize external knowledge for generation. Typically, the knowledge of pre-trained models is stored in the parameters, which leads to the model being unaware of knowledge beyond the training set (e.g. search data, industry knowledge). Previous approaches involved re-finetuning the pre-trained model whenever new knowledge was available.

This approach has several issues: 1. Fine-tuning is required every time new knowledge is added 2. The cost of training the model is very high. Therefore, this paper proposes the RAG method, where the pre-trained model can understand new knowledge, so we can simply provide the new knowledge that we want the model to understand through prompts.

So the smallest RAG system consists of 3 parts: 1. Language model 2. Collection of external knowledge required by the model (stored in the form of vectors) 3. External knowledge required in the current scenario

langchain, llama-index essentially does this RAG system (including the agent built on top of RAG). If you understand the essence, there is actually no need to add an additional layer of abstraction, you can build this system according to your own business situation. For example, in order to maintain high performance, we use a Go + Rust architecture, which can support high-concurrency RAG requests.

Simplify the problem, whether it's building any kind of RAG, optimizing this system is to optimize these 3 modules separately.

  1. Why did the language model in this paper from 2020 only become popular this year? One main reason is that the previous base models were not powerful enough. If the underlying model is not smart, even with abundant external knowledge, the model cannot deduce based on this knowledge. The improvement can be seen from some benchmarks in the paper, but it is not particularly significant.

1.1) The first appearance of GPT-3 made RAG available for the first time. The first wave of companies based on RAG + GPT-3 have achieved very high valuations & ARR (Annual Recurring Revenue): - Copy AI - Jasper Both of these are products that build RAG in the marketing field, and they once became star AI unicorns, but of course, their valuations have also significantly declined after the charm has faded.

1.2) Since 2023, there has been a large number of open-source & closed-source base models, most of which can be used to build RAG systems. The most common ways are: - GPT-3.5/4 + RAG (closed-source solution) - Llama 2 / Mistral + RAG (open-source solution)

  1. The external knowledge set required by the model Now everyone should understand the embedding model, including the recall of embedding data. Essentially, embedding transforms data into vectors, and then uses cosine similarity to find the best matching two or more vectors. knowledge -> chunks -> vector user query -> vector

2.1) This module is divided into two parts: 1. embedding model 2. Database for storing embedding vectors The former mostly uses OpenAI's embedding model, and there are many optional solutions for the latter, including Pinecone, Zilliz from domestic teams, Chroma from open source, and pgvector built on relational databases, etc.

2.2)These companies that build embedding databases have also received very high financing and valuation in this wave of AI hype. However, from first principles thinking, the purpose of Module 2 is to store external knowledge collections and recall them when needed. This step does not necessarily require an embedding model, and traditional search matching may be more effective in some scenarios (Elasticsearch).

2.3)devv.ai uses the approach of embedding + traditional relation db + Elasticsearch. And has done a lot of optimization in each scenario. One idea is that the more work is done when encoding knowledge, the faster and more accurate the retrieval will be (the difference between doing work first and doing work later).

2.4) We use Rust to build a complete knowledge index including: - GitHub code data - Development document data - Search engine data

3)Better recall the external knowledge needed in the current scenario According to the principle of doing the most important work first, we have done a lot of processing on the original knowledge data during encoding: - Program analysis of code - Logical chunking of development documents - Extraction of web information & page ranking optimization

3.1) After completing the above work, we ensure that the data we retrieve is already structured, so we don't need to do too much processing, and it can improve the accuracy of recall.

In 2022, the search engine based on this RAG system has already gained tens of millions of traffic in perplexity every month, and LangChain has also been valued at several billion US dollars.

Regardless of whether it's a general RAG or a proprietary RAG, this is a field where it's easy to do a mediocre job, but very difficult to achieve a score of 90. There are no best practices for every step, such as embedding chunk size and the need for integration with search engines, which need to be experimented with based on the actual business scenario. There are many related papers, but not every method mentioned in each paper is useful.

Today, I simply provided a high-level popular science explanation of some of the technologies used in the underlying devv.ai without delving too deeply into technical details. The purpose is to encourage developers who want to enter this field to think from first principles and demystify the technology.

In the next phase, a weekly technical sharing tweet related to LLM will be posted. I used devv.ai extensively while writing this thread, and it was very helpful.

If you want to see more related content in the future, feel free to share this tweet with more people. If there are any inaccuracies in the above content, please feel free to point them out in the comments. Detailed articles will be summarized on github.com/devv-ai/devv, and your support by starring the repository is also welcome (more importantly, submitting feedback for devv.ai!)

Summary
RAG, or Retrieval Augmented Generation, is a system that allows language models to use external knowledge for generation. The system consists of three main components: the language model, the external knowledge set stored in vector form, and the specific external knowledge needed for the current scenario. The article discusses the development of RAG systems, highlighting the importance of optimizing each of these components. It also mentions the emergence of advanced language models like GPT-3, which has made RAG more feasible. The article emphasizes the significance of embedding models and databases for storing external knowledge and the need for efficient retrieval of relevant external knowledge for a given scenario. The article concludes by highlighting the complexity of achieving high performance in building RAG systems and encourages developers to think from first principles. It also mentions that detailed technical articles related to LLM will be shared in the future.
Â