In today's competitive business environment, automation of business processes, especially document processing workflows, has become critical for companies seeking to improve efficiency and reduce manual errors. Traditional methods often struggle to keep up with the volume and complexity of the tasks, while human-led processes are slow, error-prone, and may not always deliver consistent results.
Large Language Models (LLMs) like OpenAI GPT-4 have made significant strides in handling complex tasks involving human-like text generation. However, they often face challenges with domain-specific data. LLMs are usually trained on broad (publicly available) data, and while they can provide general answers, their responses might be inaccurate when it comes to specialized knowledge. They might also generate outputs that appear reasonable but are essentially hallucinations — plausible-sounding but false pieces of information. Moreover, companies often have vast amounts of domain-specific data tucked away in documents but lack the tools to utilize this information effectively.
Here's where Retrieval-Augmented Generation (RAG) steps in. RAG offers an exciting breakthrough, enabling the integration of domain-specific data in real time without the need for constant model retraining or fine-tuning. It stands as a more affordable, secure, and explainable alternative to general-purpose LLMs, drastically reducing the likelihood of hallucination.
In this blog post, we'll explore the application of RAG across various domains and scenarios, emphasizing its advantages. We'll also dive into the architecture of RAG, break it down into building blocks, and provide guidance on how to construct automated document processing workflows.
Join us as we navigate the transformative impact of Retrieval-Augmented Generation and Large Language Models on business process automation.
In the sphere of artificial intelligence, the RAG approach stands out as a powerful technique, combining the precision of information retrieval with the sophistication of text generation. This synergy leads to a suite of unique capabilities enabling RAG-powered applications to offer accurate, contextually relevant, and dynamic responses. Let's explore these functionalities and delve into their practical applications in various business use cases within the supply chain, finance, insurance, and retail domains.
In this subsection, we'll emphasize the transformative effect of RAG on the supply chain landscape and explore several business use cases, as depicted in Figure 1. We will discuss each of these and clarify the challenges that RAG addresses.
RAG's fact verification and compliance validation functionalities make it a valuable asset in the legal and compliance domains. When dealing with legal documents or regulatory requirements, RAG can cross-reference information from trusted sources, contributing to the creation of accurate legal documents. Its fact-checking ability ensures that the information presented aligns with legal standards, minimizing the risk of errors and enhancing overall compliance.
In the B2B sales process, responding to Requests for Proposals (RFPs) or Requests for Information (RFIs) can be time-consuming. Utilizing RAG, companies can auto-populate these forms by retrieving relevant product details, pricing, and past responses. RAG ensures consistency, accuracy, and speed in generating responses, streamlining the sales process, reducing manual efforts, and improving the chances of winning bids by promptly addressing client needs.
For optimal procurement decisions, accurate recommendations are key. Using RAG, organizations can analyze past purchasing patterns, vendor performance, and market trends to automatically generate tailored procurement recommendations. RAG's insights ensure better supplier choices, cost savings, and risk mitigation, guiding businesses toward strategic purchasing and fostering stronger vendor partnerships.
In today's complex supply chains, leveraging multifaceted data is paramount. RAG dives deep into a plethora of internal documents, including real-time inventory logs, past purchase orders, vendor correspondence, and shipment histories. Drawing from these diverse sources, RAG auto-generates intricate supply chain reports. These reports spotlight vital performance metrics, unveil potential bottlenecks, and suggest areas for refinement. Through RAG's automated reporting, businesses gain enriched insights, fostering agile decision-making, enhanced operational efficiencies, and bolstered supply chain robustness.
In this segment, we'll delve into the possible applications of RAG within retail, showcasing various business scenarios, as illustrated in Figure 2. We'll elaborate on each scenario, highlighting the specific challenges that RAG overcomes in retail.
Utilizing RAG, retailers can empower chatbots to fetch specific product details from vast databases in near real-time, enhancing customer queries' responsiveness and accuracy. This not only streamlines customer support but also ensures precise product information is relayed, leading to improved customer satisfaction and shopping experiences.
By delving deep into comments, reviews, and ratings, RAG synthesizes a holistic view of customer sentiments. It auto-generates detailed reports highlighting prevalent preferences and discernible pain points. With such insights at their fingertips, retailers can make informed adjustments to products, services, or strategies, ensuring a more attuned and enhanced shopping experience for their clientele.
In retail marketing, understanding campaign performance is extremely important. RAG offers a robust solution by diving deep into past campaign data, intertwining it with customer feedback and observed sales trends. By dissecting this information, RAG crafts detailed insights into which strategies truly resonate and which channels drive maximum engagement. Furthermore, it identifies potential gaps or areas of improvement in past campaigns. Retailers equipped with these insights are better positioned to refine their future marketing endeavors, ensuring they capture their audience's attention and foster lasting customer relationships.
RAG harnesses customer data, including past purchases and browsing history, to understand individual preferences. It dynamically generates tailored product suggestions aligned with user interests. As users interact in real time, recommendations adjust accordingly. The result? Enhanced user engagement, prolonged browsing sessions, and higher conversion rates, all achieved through RAG's data synthesis and near real-time response capabilities.
In this section, we'll describe how RAG can be utilized within the finance domain, detailing various business scenarios, as highlighted in Figure 3. For each scenario, we'll shed light on the specific problems that RAG resolves within the world of finance.
In the dynamic world of finance, clients frequently seek insights on investment tactics, market projections, and intricate financial products. To cater to this, financial institutions can employ RAG to automate the response mechanism, pulling accurate information from comprehensive financial databases. This ensures clients receive tailored advice grounded in up-to-date data and expert analyses. Consequently, the process not only elevates the client experience by offering rapid and precise answers but also optimizes the advisory function, making it more efficient and data-driven.
Insurance claims often involve sifting through extensive documentation and data. RAG can be utilized to quickly retrieve relevant policy details, claim histories, and regulatory guidelines when processing a claim. It can generate preliminary assessments, flag potential fraudulent activities based on historical patterns, or auto-populate forms with relevant details, streamlining the claim approval process and ensuring consistency and accuracy.
The financial world is filled with complex data, ranging from individual transactions to broad economic trends. Navigating this maze and presenting concise insights is extremely important for stakeholder comprehension and informed decision-making. RAG steps in as a transformative tool for this purpose. By accessing and analyzing vast data sets, RAG can distill complex financial narratives into coherent, digestible reports. Stakeholders, armed with these clear summaries, are better positioned to make strategic decisions. Through this automation, financial reporting becomes not only efficient but also consistently accurate and timely.
Portfolio management is a delicate dance of balancing risks and returns, influenced by many factors. RAG emerges as a vital tool in this realm. By delving into past transaction histories, gauging current market dynamics, and understanding individual investor risk appetites, RAG can craft optimized portfolio strategies. These strategies, rooted in comprehensive data analysis, offer tailored recommendations or necessary adjustments for investors. As a result, investment portfolios become more aligned with market opportunities and individual financial goals. With RAG's capabilities, both novice and seasoned investors gain a data-driven edge in wealth maximization.
Now that we’ve explored various industry applications, it’s time to see RAG in action. This case study delves into how RAG can be used for creating an application designed to automate the intricacies of document processing, focusing primarily on composing and filling out responses for an RFP.
Step 1: Uploading RFP and supplementary documents
As depicted in Figure 4, the workflow begins with a user uploading the RFP file and supplementary documents, which may include product details, company information, and so forth. These documents can vary in format, spanning from PDFs and Excel sheets to Word documents, plain text files, etc.
Step 2: Previewing RFP and posing queries
Upon uploading, users can preview the RFP and pose queries related to the content. Questions can range from seeking a summary of the RFP to more nuanced inquiries like understanding the primary concerns of the RFP.
Step 3: Receiving recommendations from RAG
The application, harnessing the power of RAG, offers insightful recommendations to enhance the likelihood of winning the RFP bid.
Step 4: Activating automatic answer generation
After preview and initial queries, users can opt to activate the automatic answer generation feature. This function identifies all the questions and requirements outlined in the RFP and crafts corresponding responses.
Step 5: Answer preview and modification
Once generated, users are presented with a preview of the question-answer pairs, granting them the autonomy to select and modify any answers they find unsatisfactory.
Step 6: Answer refinement phase
This leads to the answer refinement phase. Here, users can provide a prompt or directive on how they envision the answer, and the integrated RAG with LLM will regenerate the response accordingly. If satisfied with the refinement, users can save their selections and progress to the subsequent stages.
It's worth noting that while the workflow offers the efficiency of full automation, users retain the flexibility to assess and adjust intermediate outcomes, provide feedback via textual prompts, and repeat and revisit any phase as desired.
Step 7: Exporting the finalized RFP
The culminating step in this workflow is the exportation of the finalized RFP, filled with tailored answers, primed for submission to the RFP issuer.
This case study exemplifies how RAG can revitalize the traditional RFP process, eliminating inefficiencies like manual form filling, inconsistent responses, and inaccuracies, all while dramatically reducing response times. More details about this application can be found here.
You’ve no doubt realized the value and potential of RAG by now, and might be wondering how to get started with it in your organization. One of the key strengths of RAG is its modularity. Its foundational building blocks can be assembled in various configurations, allowing businesses to craft custom solutions suited to their specific needs. Before diving deep into the individual building blocks, let’s understand how these components can be cohesively integrated to form valuable RAG flows and use cases.
In the upcoming sections, we'll delve deeper into the architecture of RAG and its building blocks, providing insights into how each one functions and how they can be optimized for best results.
Understanding RAG architecture is key to fully harnessing its benefits and potential. The process essentially consists of two primary components: The Retriever and the Generator, which together form a seamless flow of information processing. This process is displayed below:
The Retriever's role is to fetch relevant documents from the data store in response to a query. The Retriever can use different techniques for this retrieval, most notably Sparse Retrieval and Dense Retrieval.
Sparse Retrieval has traditionally been used for information retrieval and involves techniques like TF-IDF or Okapi BM25 to create high-dimensional sparse vector representations of documents. However, this approach often requires exact word matches between the query and the document, limiting its ability to handle synonyms and paraphrasing.
On the other hand, Dense Retrieval transforms both the query and the documents into dense lower-dimensional vector representations. These vector representations, often created using transformer models like BERT, RoBERTa, ELECTRA, or other similar models, capture the semantic meaning of the query and the documents, allowing for a more nuanced understanding of language and more accurate retrieval of relevant information.
Once the Retriever has fetched the relevant documents, the Generator comes into play. The Generator, often a model like GPT, Bard, PaLM2, Claude, or open-source LLM from Hugging Face, takes the query and the retrieved documents, and generates a comprehensive response.
The foundational RAG process we discussed previously can be refined and orchestrated into a comprehensive automated workflow, culminating in a holistic business solution, as illustrated in Figure 6:
This orchestrated process can combine various RAG operations:
To seamlessly blend these operations into automated solutions, we need an orchestrator with the following components:
In this subsection, we will compile all the steps and building blocks of the RAG process flow, as shown in Figure 7:
In this section, we will delve into the best practices for implementing each building block in the RAG process flow, providing additional elaboration to ensure a more comprehensive understanding.
Text data, the raw material for LLMs, can come in many forms, ranging from unstructured plain text files (.txt), rich documents like PDF (.pdf) or Microsoft Word (.doc, .docx), data-focused formats such as Comma-Separated Values (.csv) or JavaScript Object Notation (.json), web content in Hypertext Markup Language (.html, .htm), documentation in Markdown (.md), or even programming code written in diverse languages (.py, .js, .java, .cpp, etc.), and many more. The process of preparing and loading these varied sources for use in a LLM often involves tasks like text extraction, parsing, cleaning, formatting, and converting to plain text.
Among the tools that assist in this process, LangChain stands out. This popular framework is widely recognized in the field of LLM application development. What sets it apart is its impressive ability to handle more than 80 different types of documents, making it an extremely versatile tool for data loading. LangChain's data loaders are comprehensive, including Transform Loaders that are used to convert different document formats into a unified format that can be easily processed by LLMs. Additionally, it supports Public Dataset Loaders, which provide access to popular and widely-used datasets, as well as Proprietary Dataset or Service Loaders, which enable integration with private, often company-specific, data sources or APIs.
The process of document splitting begins once the documents are successfully loaded, parsed, and converted into text. The core activity in this stage involves segmenting these texts into manageable chunks, a procedure also known as text splitting or chunking. This becomes essential when handling extensive documents. Given the token limit imposed by many LLMs (like GPT-3's approximate limit of 2048 tokens), and considering the potential size of documents, text splitting becomes indispensable. The chosen method for text splitting primarily depends on the data's unique nature and requirements.
Even if the process of dividing these documents into smaller, manageable segments seems straightforward, this process is filled with intricacies that could significantly impact subsequent steps. A naive approach is to use fixed chunk size, but if we do that, we can end up with part of one sentence in one chunk and another in another chunk, as shown in Figure 8. When we come to question answering, we won’t have the right information in either chunk because it is split apart.
As such, it is crucial to consider semantics when dividing a document into chunks. Most document segmentation algorithms operate on the principle of chunk size and overlap. Below, in Figure 9, is a simplified diagram that depicts this principle. Chunk size, which can be measured by character count, word count, or token count, refers to each segment's length. Overlaps permit a portion of text to be shared between two adjacent chunks, operating like a sliding window. This strategy facilitates continuity and allows a piece of context to be present at the end of one chunk and at the beginning of the next, ensuring the preservation of semantic context.
Fixed-size chunking with overlap is a straightforward approach that is often favored due to its simplicity and computational efficiency. Besides fixed-size chunking, there are more sophisticated 'content-aware' chunking techniques:
When deciding on chunk size, if the common chunking approaches do not fit your use case, a few pointers can guide you toward choosing an optimal chunk size:
In conclusion, there's no one-size-fits-all solution to document splitting, and what works for one use case may not work for another. This section should help provide an intuitive understanding of how to approach document chunking for your specific application.
Following the document splitting process, the text chunks undergo a transformation into vector representations that can be easily compared for semantic similarity. This 'embedding' encodes each chunk in such a way that similar chunks cluster together in vector space.
Vector embeddings constitute an integral part of modern machine learning (ML) models. They involve mapping data from complex, unstructured forms like text or images to points in a mathematical space, often of lower dimensionality. This mathematical space, or vector space, enables efficient calculations, and crucially, the spatial relationships in this space can capture meaningful characteristics of the original data. For instance, in the case of text data, embeddings capture semantic information. Text that conveys similar meaning, even if worded differently, will map to close points in the embedding space.
To illustrate, the sentences "The cat chases the mouse" and "The feline pursues the rodent" might have different surface forms, but their semantic content is quite similar. A well-trained text embedding model would map these sentences to proximate points in the embedding space.
The visualization of text embeddings can provide intuitive insight into how this works. In a two- or three-dimensional representation of the embedding space, similar words or sentences cluster together, indicating their semantic proximity. For example, embeddings of 'dog', 'cat', 'pet' might be closer to each other than to the embedding of 'car', as depicted in Figure 12.
Producing these embeddings involves sophisticated ML models. Initially, models like Word2Vec and GloVe made strides by learning word-level embeddings that captured many useful semantic relationships. These models essentially treated words in isolation, learning from their co-occurrence statistics in large text corpora.
The current state-of-the-art has moved towards transformer-based models like BERT, RoBERTa, ELECTRA, T5, GPT, and their variants, which generate context-aware embeddings. Unlike previous models, these transformers take into account the whole sentence context when producing the embedding for a word or sentence. This context-awareness allows for a much richer capture of semantic information and ambiguity resolution.
For instance, the word 'bank' in "I sat on the bank of the river", and "I deposited money in the bank" has different meanings, which would be captured by different embeddings in a transformer-based model. Such transformer-based models are central to the latest advances in NLP, including RAG. In RAG, transformer-based models are utilized to retrieve relevant information from a large corpus of documents (the 'retrieval' part) and use it to generate detailed responses (the 'generation' part). The high-quality embeddings produced by transformer models are essential to this process, both for retrieving semantically relevant documents and for generating coherent and context-appropriate responses.
After documents are segmented into semantically meaningful chunks and subsequently converted into vector space, the resulting embeddings are stored in a vector store. Vector stores are unique search databases designed to enable vector searches and handle storage and certain facets of vector management. Essentially, a vector store is a database that allows straightforward lookups for similar vectors. Efficient execution of a RAG model requires an effective vector store or index to house transformed document chunks and their associated IDs.The choice of a vector store depends on numerous variables, such as data scale and computational resources. Some noteworthy vector stores are:
By selecting the correct text embedding technique and vector store, it's possible to establish an efficient and effective system for indexing document chunks. Such a system enables the quick retrieval of the most relevant chunks for any query, which is a vital step in RAG.
In the next section, we explore the process of managing incoming queries and retrieving the most relevant chunks from the index.
The retrieval process is an integral part of any information retrieval system, such as the one used for document searching or question answering. The retrieval process starts when a query is received, and it is transformed into a vector representation using the same embedding model used for document indexing. This results in a semantically meaningful representation of the user's question which can subsequently be compared with the chunk vectors of documents stored in the index, also known as the vector store.
The primary objective of the retrieval is to return relevant chunks of documents that correspond to the received query. The specific definition of relevance depends on the type of retriever being used. The retriever does not need to store documents; its sole purpose is to retrieve the IDs of relevant document chunks, thereby aiding in narrowing down the search space by identifying chunks likely to contain relevant information.
Different types of search mechanisms can be employed by a retriever. For instance, a ‘similarity search’ identifies documents similar to the query based on cosine similarity. Another search type, the maximum marginal relevance (MMR), is useful if the vector store supports it. This search method ensures the retrieval of documents that are not only relevant to the query but also diverse, thereby eliminating redundancy and enhancing diversity in the retrieved results. In contrast, the ‘similarity search’ mechanism only takes semantic similarity into account.
RAG also utilizes a similarity score threshold retrieval method. This method sets a similarity score threshold and only returns documents with a score exceeding that threshold. During the search for similar documents, it is common to specify the top 'k' documents to retrieve using the 'k' parameter.
There's another type of retrieval known as self-query or LLM-aided retriever. This type of retrieval becomes particularly beneficial when dealing with questions that are not solely about the content that we want to look up semantically but also include some mention of metadata for filtering. The LLM can effectively split the query into search and filter terms. Most vector stores can facilitate a metadata filter to help filter records based on specific metadata. In essence, LLM-aided retrieval combines the power of pre-trained language models with conventional retrieval methods, enhancing the accuracy and relevance of document retrieval.
One more significant retrieval method incorporates compression, which aims to reduce the size of indexed documents or embeddings, thereby improving storage efficiency and retrieval speed. This process involves the Compression LLM examining all documents and extracting the most relevant ones for the final LLM. Though this technique involves making more LLM calls, it also aids in focusing the final answer on the most crucial aspects. It's a necessary trade-off to consider. Compression in retrieval is particularly crucial when dealing with large document collections. The choice of compression method depends on a variety of factors, including the specific retrieval system, the size of the document collection, available storage resources, and the preferred balance between storage efficiency and retrieval speed.
It is also worth noting that other retrieval methods exist that don't involve a vector database, instead using more traditional NLP techniques, such as Support Vector Machines (SVM) and Term Frequency-Inverse Document Frequency (TF-IDF). However, these methods are not commonly used for RAG.
In the final phase, the document chunks identified as relevant are used alongside the user query to generate a context and prompt for the LLM. This prompt (Figure 15), which is essentially a carefully constructed question or statement, guides the LLM in generating a response that is both relevant and insightful.
By default, we funnel all the chunks into the same context window within a single LLM call. In LangChain, this approach is known as the “Stuff” method, and it is the simplest form of question-answering. It follows a straightforward approach, where a prompt is processed, and an answer is immediately returned based on the LLM understanding. The "Stuff" method does not involve any intermediate steps or complex algorithms, making it ideal for straightforward questions that demand direct answers. However, a limitation emerges when dealing with an extensive volume of documents, as it may become impractical to accommodate all of them within the context window, potentially resulting in a lack of depth when confronted with complex queries. Nevertheless, there are a few different methods to get around the issue of short context windows, such as: “Map-reduce”, “Refine”, and “Map-rerank”.
The Map-reduce method, inspired by the widely embraced parallel processing paradigm, works by initially sending each document separately to the language model to obtain individual answers. These individual responses are then combined into a final response through a final call to the LLM. Although this method entails more interactions with the language model, it has the distinct advantage of processing an arbitrary number of document chunks. This method proves particularly effective for complex queries as it enables simultaneous processing of different aspects of a question, thus generating more comprehensive responses. However, this method is not without its drawbacks. It tends to be slower and, in certain cases, may yield suboptimal results. For instance, the absence of a clear answer based on the given chunk of the document may arise due to the fact that responses are based on individual document chunks. Hence, if relevant information is dispersed across two or more document chunks, the necessary context might be lacking, leading to potential inconsistencies in the final answer.
The Refine method follows an iterative approach. It refines the answer by iteratively updating the prompt with relevant information. It is particularly useful in dynamic and evolving contexts, where the first answer may not be the best or most accurate.
The Map-rerank method is a sophisticated method that ranks the retrieved documents based on their relevance to the query. This method is ideal for scenarios where multiple plausible answers exist, and there is a need to prioritize them based on their relevance or quality.
Each mentioned method has its own advantages and can be chosen based on the desired level of abstraction for question-answering. In summary, the different chain types of question-answering provide flexibility and customization options for retrieving and distilling answers from documents. They can be used to improve the accuracy and relevance of the answers provided by the language model.
The successful interplay of all these steps in the RAG flow can lead to a highly effective system for automating document processing and generating insightful responses to a wide variety of queries.
The adoption of business process automation has notably increased, largely attributed to its ability to boost efficiency, minimize mistakes, and free up human resources for more strategic roles. However, the effective incorporation and utilization of domain-specific data still pose considerable challenges. RAG and LLMs provide an efficient solution to these challenges, offering several key benefits:
Unlike traditional models that require constant retraining and fine-tuning to incorporate new data, RAG facilitates real-time data integration. As soon as a new document enters the system, it becomes available as a part of the knowledge base, ready to be utilized for future queries.
The retraining of large models with new data can be computationally expensive, time-consuming and costly. RAG circumvents this issue by indexing and retrieving relevant information from the document store, significantly reducing both computational costs and time.
In conventional LLMs, sensitive data often has to be included in the training phase to generate accurate responses. In contrast, RAG keeps sensitive data in the document store, never exposed directly to the model, enhancing the security of the data. Additionally, access restrictions to documents can be applied in real-time, ensuring restricted documents aren't available to everyone, something that fine-tuning approaches lack.
One of the key advantages of RAG is its explainability. Each generated response can be traced back to the source documents from which the information was retrieved, providing transparency and accountability — critical factors in a business context.
Hallucination, or the generation of plausible but false information, is a common issue with general-purpose and fine-tuned LLMs that have no clear distinction between “general” and “specific” knowledge. RAG significantly reduces the likelihood of hallucination as it relies on the actual documents in the system to generate responses.
LLMs usually have a context size limitation, with most allowing around 4000 tokens per request. This limitation makes it challenging to provide a large amount of data. However, RAG circumvents this issue. The retrieval of similar documents ensures only relevant ones are sent, allowing the model to rely on virtually unlimited data.
Ready to harness the power of RAG and LLMs in your organization? Get in touch with us to start the discovery and POC phases now.