Retrieval Augmented Generation(RAG) tutorial using OpenAI and Langchain

Introduction

In a world overflowing with information, finding the right answers quickly is crucial. Imagine having a virtual assistant that not only understands your questions but also provides responses with a touch of intelligence. That’s where the magic of Langchain and OpenAI comes in!

In this blog, we’ll embark on a journey to create a RAG (Retrieval-Augmented Generation) Question and Answer system. Don’t worry if the terms sound complex; we’re here to break it down into simple steps. Langchain, a powerful language tool, teams up with OpenAI’s advanced models to make your Q&A dreams a reality.

Whether you’re a coding enthusiast or a curious mind eager to explore the world of AI, this guide will help you understand the basics and take your first steps toward crafting your very own intelligent Q&A system. Let’s dive in and turn your questions into conversations with the help of Langchain and OpenAI!

We’ll be using ChatGPT as our Language Model (LLM) to add a conversational touch to our Q&A system. Therefore, our first step is to obtain an OpenAI API key by following the below steps:

Go to the API key section by clicking on this link: https://platform.openai.com/api-keys
Next, click on “Create new secret key,” and a pop-up window will appear. Give a name to your key and click the “Create secret key” button.

This action will generate a new secret key in your account, which we can later use in the code.

Steps

Step 1:

The first step is to install all the required libraries for our project, as described in the following command:

!pip install langchain openai chromadb pypdf tiktoken

Step 2:

Next, we initialize the embeddings and the Language Model (LLM). Use the following code snippet to set up the embeddings and load the ChatGPT model:

# load required library
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.chains.question_answering import load_qa_chain


import os

# set OpenAI key as the environmet variable
os.environ['OPENAI_API_KEY'] = 'Your_openai_key'

# Load the embedding and LLM model
embeddings_model = OpenAIEmbeddings()
llm = ChatOpenAI(model_name = "gpt-3.5-turbo", max_tokens = 200)

Make sure to replace the “Your_openai_key” variable with your own openai key which you have generated earlier.

Step 3:

Here, we are going to perform the QNA over the PDF file. You just need to provide the web link where your PDF is hosted or the local path on your computer where the PDF is stored.

For this blog, we’ve chosen a research paper in PDF format for our Question and Answer (QnA) tasks. The paper, titled “DEEP LEARNING APPLICATIONS AND CHALLENGES IN BIG DATA ANALYTICS,” can be accessed through the link below.

https://journalofbigdata.springeropen.com/articles/10.1186/s40537-014-0007-7

Simply download the PDF, place it in your current working directory, and provide its path to the variable named “pdf_link.”

Once we have successfully extracted data from the PDF, the next step is to transform the data into smaller chunks using the “RecursiveCharacterTextSplitter” from Langchain. This tool takes the PDF data and divides it into smaller, manageable chunks, helping us overcome the token limitation of the LLM models.

pdf_link = "s40537-014-0007-7.pdf"
loader = PyPDFLoader(pdf_link, extract_images=False)
pages = loader.load_and_split()

# Split data into chunks
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 4000,
    chunk_overlap  = 20,
    length_function = len,
    add_start_index = True,
)
chunks = text_splitter.split_documents(pages)

Step 4:

As we progress further, our next step involves creating an embedding for the chunks and storing it in the vector database. We are using Chroma as the vector database for this process. In this method, you’ll need to provide the chunk data for which you want to create an embedding, specify the embedding model used for creating the embedding, and indicate the directory where you want to store the database for future use.

# Store data into database
db = Chroma.from_documents(chunks, embedding = embeddings_model, persist_directory="test_index")
db.persist()

Step 5:

Once the information is securely stored in the database, there’s no need to repeat the preceding steps every time. We can load the pre-existing database using the provided code snippet.

Following this, we’ll proceed by initializing the retriever, responsible for fetching the most suitable chunk from the database that may contain the answer to the user’s question. In this context, the “search_kwargs” parameter, with “k” set to 3, ensures retrieval of the top 3 most relevant chunks from the database.

Subsequently, we’ll load a Q&A chain, employing the LLM model to generate a response and specify the type of the chain.

# Load the database
vectordb = Chroma(persist_directory="test_index", embedding_function = embeddings_model)

# Load the retriver
retriever = vectordb.as_retriever(search_kwargs = {"k" : 3})
chain = load_qa_chain(llm, chain_type="stuff")

Step 6:

Moving on to the next stage, we are creating a helpful function to generate responses. This function takes the user’s question as input. Within the function, the question is passed to the retriever. The retriever internally matches the question’s embedding with the stored documents in the database and retrieves the most relevant chunk. Subsequently, this chunk, along with the original question, is passed to the Q&A chain, which generates the answer.

# A utility function for answer generation
def ask(question):
    context = retriever.get_relevant_documents(question)
    answer = (chain({"input_documents": context, "question": question}, return_only_outputs=True))['output_text']
    return answer

Step 7:

Now, we are all set to perform Question and Answer (Q&A) on the PDF data. To pose a question, you need to use the following lines of code in your script:

# Take the user input and call the function to generate output
user_question = input("User: ")
answer = ask(user_question)
print("Answer:", answer)

Test results

Below are some test examples showcasing how our Q&A system handles various questions and generates responses.

Q1: Which are the 2 high focuses of data science?

Q2: What is feature engineering?

Q3: What are the 2 main focuses of the paper?

Q4: List down the 4 Vs of Big Data characteristics.

Q5: What is the full form of SIFT?

Conclusion

In conclusion, our journey through building a RAG Q&A system using Langchain and OpenAI has unveiled the seamless fusion of advanced language models and intelligent data processing. From installing the necessary libraries to conducting Q&A on PDF data, we’ve navigated through the essential steps.

This exploration empowers both coding enthusiasts and curious minds to enhance their projects with dynamic Q&A capabilities. By leveraging Langchain and OpenAI, we’ve embraced a future where questions meet intelligent responses, unlocking the potential for more interactive and engaging applications. As you embark on your own projects, remember that the fusion of language models and data processing holds the key to transforming mere queries into meaningful conversations.

Chatbots, Python Development, Machine Learning, Natural Language Processing (NLP)