RAG (Retrieval-Augmented Generation)

August 19, 2024

Tech

RAG (Retrieval-Augmented Generation): A Technology to Enhance the Accuracy and Reliability of AI Responses

In a previous article, we explored various Prompt Engineering methods to obtain desired responses from Large Language Models (LLMs). This time, I would like to introduce a Natural Language Processing technology, RAG (Retrieval-Augmented Generation), which complements LLMs by addressing limitations that are hard to overcome through prompts alone. RAG is one of the most promising technologies today, helping LLMs provide more accurate and useful answers. In this content, we will dive into the [Concept and Necessity] of RAG, [Working Process], and the [Impact of RAG].

‍

[Concept and Necessity] Why is RAG Needed?

LLMs possess powerful language comprehension and generation capabilities by pre-training on vast amounts of text data, but they also have some limitations. Notably, LLMs may struggle to provide up-to-date information or specific domain knowledge not included in the training data, leading to issues with outdated information or reliability.

[Common Issues Due to LLM Limitations]
- Occurrence of Hallucinations, where incorrect or fabricated information is provided.
- Difficulty in accessing the latest information due to reliance on pre-trained data.
- Lack of specialized knowledge in specific fields due to limited data.

‍

Typically, LLMs provide answers based on the information they have learned.

‍

For instance, imagine a customer support chatbot for an electronics company developed using an LLM. If a customer asks, "What are the updates in the latest firmware for the TV I recently purchased?", the LLM is likely unaware of the company’s latest firmware release and may provide an irrelevant answer. This problem arises because the model only relies on pre-trained data and does not consider the company’s product or internal information.

To solve this issue, various methods have been proposed, one of which is fine-tuning. Fine-tuning involves optimizing the LLM by training it further with specific domain data. This allows the LLM to acquire specialized knowledge and provide more accurate and expert answers. However, fine-tuning is time-consuming, costly, and may reduce the model’s generalization ability.

‍

From left to right: Prompt Engineering, RAG, Fine-tuning, Pre-train – increasing complexity and cost.
source: [Data-centric MLOps and LLMOps] Databricks

‍

To more efficiently address the inefficiencies of fine-tuning, RAG (Retrieval-Augmented Generation) has garnered attention. RAG is a technology that references external, reliable knowledge bases (e.g., web documents, internal company documents) to retrieve necessary information before the AI generates a response. This allows real-time integration of the latest information and specific domain knowledge without needing to retrain the model. It retains the model's generalization and adaptability while generating accurate and reliable answers. Therefore, RAG is considered the most efficient approach to overcoming the limitations of LLMs while leveraging their strengths. (Though efficient fine-tuning techniques like PEFT or LoRA can reduce costs, they still lag behind RAG in terms of cost-effectiveness.)

[Working Process] How Does RAG Operate?

The operation of RAG consists of three main stages:

(Stage 1) Embedding and Vector DB Construction: External data that the LLM has not learned is collected from various sources. The collected data is embedded (converted into numerical representations) using an embedding model and stored in a vector database (Vector DB). This vector DB is utilized in the next stage by the Retriever to find information (documents) relevant to the user’s query.
- Before embedding the data, it is often divided into small chunks, a process known as chunking, which significantly impacts the quality of RAG. If the chunks are too large, noise may be introduced, or processing time and costs may increase. Conversely, if the chunks are too small, contextual information may be lost, degrading RAG’s search quality.

source : Context Aware Chunking for Enhanced Retrieval Augmented Generation (Kevin Tam)

‍

(Stage 2) Information Retrieval: First, the user’s query is vectorized. Using various retrieval techniques, the system extracts the most relevant information from the external knowledge stored in the vector DB. The retrieved relevant information is then provided to the LLM along with the user’s query.
- The quality of the document retrieval significantly influences the quality of the LLM’s response. Therefore, various retrieval algorithms (e.g., MMR, BM25, Multi-Query) and methods (e.g., keyword search, semantic search, hybrid search) have been developed and utilized.‍
(Stage 3) Augmentation and Response Generation: The retrieved information is used to augment the user’s query. This allows the LLM to generate more accurate and contextually appropriate responses. The LLM generates the final answer based on the user’s query and the relevant information retrieved, providing a more reliable answer with accurate sourcing.

The process involves RETRIEVING relevant external knowledge and AUGMENTING the query to GENERATE an answer.

‍

While the RAG process may seem complex, a simple analogy is that RAG is like doing an assignment with the help of a librarian who finds the right books and information in the library. It’s similar to how a librarian recommends specialized books and the student uses them to complete their report.
‍
1. The library organizes and classifies various books systematically. (Vector DB Construction)
2. The librarian finds and recommends the books most relevant to the user’s query. (Information Retrieval)
3. The user combines the recommended books with their knowledge to complete the assignment. (Response Generation)

‍

[Impact of RAG] What Are the Benefits of Using RAG?

1. Efficient LLM Optimization

RAG provides the most cost-effective way to add new information to an LLM. Previously, to teach an LLM specific domain knowledge, retraining was necessary. However, with RAG, this process becomes unnecessary. RAG systems allow LLMs to search external data in real-time to generate responses, significantly saving time and cost.

There are various methods to optimize LLMs. RAG falls into the category requiring less training
but a higher demand for external knowledge.
source: Retrieval-Augmented Generation for Large Language Models: A Survey (27 Mar 2024)

2. Ability to Provide Up-to-Date Information

The data an LLM has been trained on may become outdated over time, but RAG allows continuous reflection of the latest research, statistics, news, etc. For example, by connecting to sources that are frequently updated, such as social media feeds or news sites, RAG can always provide users with the most current information.

(Example) AI search engine services like Perplexity AI utilize RAG technology to search the web in real-time in response to user queries. By generating responses based on the latest information, Perplexity provides highly timely answers to users. The image below shows an example where Perplexity explained a situation involving the KOSPI(Korean Stock Market) index’s sharp decline triggering a circuit breaker(the left side), referencing the latest news articles and YouTube content(the right image).

‍

‍

3. Increased User Trust

Through RAG, LLMs can provide accurate answers by sourcing information from reliable sources. The generated output can include citations or references to the sources, allowing users to directly access the original documents if they want further details. This plays a crucial role in enhancing user trust in AI.

(Example) Recently, many AI services have focused on providing trustworthy responses by incorporating source citation (aka fact-checking) based on RAG. The image below is an example from Liner, a service I frequently use. When I asked about the word ‘Warranty Support’, Liner's agent searched various websites to generate a response(the left side) and displayed all the sources used(the right side), as shown in the image.

‍

‍

4. Improved Control of LLMs for Developers

RAG allows developers to more effectively manage and improve LLMs. Developers can directly control and update the information sources the LLM references. Additionally, if an incorrect information source is referenced for a specific query, it can be easily corrected. This makes developing LLM-based applications much more manageable.

Conclusion

Currently, RAG is a hot and trendy technology that many researchers focus on to maximize the performance of LLMs. Ongoing research aims to provide higher quality responses, and methods to improve the efficiency and accuracy of RAG (e.g., Advanced embedding techniques, New retrieval algorithms) will continue to evolve.

RAG enhances AI’s information processing capabilities, opening up possibilities for AI to be utilized in more domains. This is expected to enable AI to provide more specialized and reliable information and value across various industries.

In future content, we will explore methods for evaluating RAG’s performance and the various retrieval technologies utilized in RAG. Discussions will also continue on how RAG’s advancements might impact the future of AI and what innovations it could enable. Please look forward to the next article.

References

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang (2024) Retrieval-Augmented Generation for Large Language Models: A Survey
(Part 1) What is RAG? CLOVA Studio Link
RAG, the Latest Hot Topic Among Developers_#1 Link
Understanding RAG in 10 Minutes Link
What is RAG (Retrieval-Augmented Generation)? – A Technology to Complement LLM’s Shortcomings Link