This blog dives deep into the world of Retrieval Augmented Generation (RAG) and equips you with the tools and knowledge to build your own RAG app using Mistral AI and Langchain.

Imagine needing an assistant capable of answering questions about specific events or any other specific topic. To achieve this, language models need to acquire more information to provide better answers. This is precisely what we’re delving into here—improving our assistants’ understanding of contextual limitations to enhance the accuracy of their responses. We can accomplish or achieve this goal using the RAG concept.

Retrieval Augmented Generation (RAG) is an innovative method in the world of artificial intelligence that merges two powerful elements: finding information and creating language-based responses. Essentially, it helps AI systems to look for information outside their existing knowledge and use it to give better, more detailed answers. Imagine it as teaching a computer not just to respond, but also to find and use information from various sources to make its responses smarter and more fitting to the question asked.
Here, our focus is on developing an assistant capable of handling questions and answers related to PDF, which we are providing.

Let’s start with the implementation of the code.


Step 1

Let’s start with the implementation of the code, you will need first install all the Python libraries required for the code.

!pip install chromadb
!pip install langchain
!pip install pypdf
!pip install sentencepiece
!pip install -q -U bitsandbytes
!pip install -q -U git+
!pip install -q -U git+
!pip install -q -U git+
!pip install git+
!pip install git+
!pip install --upgrade git+
!pip install -U sentence-transformers

Step 2

Now, we’ll begin by initializing the Language Model (LLM) for both text embedding and response generation. For this, by default, “sentence-transformers/all-mpnet-base-v2” model will be used for the embedding. We’ll use the “Mistral-7B-Instruct-v0.2” model, specifically its quantized version, to generate a response. Loading this quantized model requires a GPU with at least 16 GB of RAM. Please check if your GPU meets this requirement. We will need a configuration where we can load the model in 4 bits, as in FP4 quantization. We utilized a Colab Pro account equipped with a T4 GPU for this purpose.

In the below code we have imported the required libraries, performed the quantization technique, and then loaded the quantized model using HuggingFace Pipeline.

# load required library
import os
import torch
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from langchain.document_loaders import PyPDFLoader
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate

quantization_config = BitsAndBytesConfig(

model_kwargs = {'device': 'cuda'}
embeddings = HuggingFaceEmbeddings(model_kwargs=model_kwargs)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", device_map='auto', quantization_config=quantization_config)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=150)
llm = HuggingFacePipeline(pipeline=pipe)

Step 3

Looking ahead, we will proceed by reading the PDF data using the PyPDFLoader from Langchain. You need to provide the web link where your PDF is hosted or the local path of your system where the PDF is located.

For this blog, we have used a research paper in PDF format to perform Question and Answer (QnA) tasks. The research paper, titled “Deep Learning Applications and Challenges in Big Data Analytics,” is available at the link below. You can download the PDF, place it in your current working directory, and give its path to the variable named “pdf_link” in the below code.

Once the PDF data is loaded, we will process it in chunks using the “RecursiveCharacterTextSplitter” from Langchain. This tool will take the data and divide it into manageable chunks for further processing.

# Load the PDF file
pdf_link = "<YOUR PDF PATH/LINK>"
loader = PyPDFLoader(pdf_link, extract_images=False)
pages = loader.load_and_split()

# Split data into chunks
text_splitter = RecursiveCharacterTextSplitter(
   chunk_size = 4000,
   chunk_overlap  = 20,
   length_function = len,
   add_start_index = True,
chunks = text_splitter.split_documents(pages)

Step 4

Once we’ve successfully ingested and transformed the data, the next step involves storing it in the Chroma database. For this, we will provide the data chunks we want to store in the database, along with the name of the embedding model. The system will then internally create embeddings of the text data from the chunks and store them in the database. The name of the database in this example is “test_database”, but feel free to change it according to your preferences.

# Store data into database

Step 5

Once the data is successfully stored in the database, there’s no need to repeat the previous steps each time. You can simply load the preloaded database as outlined in the following lines of code.

After that, we’ll initialize the retriever, which is in charge of retrieving the most appropriate chunk from the database that may contain the answer to the user’s question. “search_kwargs” with “k” set to 3 in this context means it will retrieve the top three most relevant chunks from the database.

You can change “qna_prompt_template” prompt as per your requirements.

Next, we’ll load a QNA chain, which involves using the LLM model to generate a response, along with specifying the type of the chain.

# Load the database
vectordb = Chroma(persist_directory="test_index", embedding_function = embeddings)

# Load the retriver
retriever = vectordb.as_retriever(search_kwargs = {"k" : 3})
qna_prompt_template="""### [INST] Instruction: You will be provided with questions and related data. Your task is to find the answers to the questions using the given data. If the data doesn't contain the answer to the question, then you must return 'Not enough information.'


### Question: {question} [/INST]"""

PROMPT = PromptTemplate(
   template=qna_prompt_template, input_variables=["context", "question"]
chain = load_qa_chain(llm, chain_type="stuff", prompt=PROMPT)

Step 6

In the following step, we will define a helper function that will generate a response. This function will take as input the user’s question. We will pass the user’s query to the retriever using this function. In turn, the retriever will internally match the embedding of the question with the stored documents in the database and retrieve the most appropriate chunk. This chunk, along with the question, will then be passed to the QNA chain, which will generate the answer.

# A utility function for answer generation
def ask(question):
   context = retriever.get_relevant_documents(question)

   answer = (chain({"input_documents": context, "question": question}, return_only_outputs=True))['output_text']
   return answer

Step 7

Now, with everything in place, we are ready to conduct Question and Answer (QnA) on the PDF data. To ask a question, simply add the following lines of code to your script:

# Take the user input and call the function to generate output
user_question = input("User: ")
answer = ask(user_question)
print("Answer:", answer)

Test Results

Here are a few test examples of how our QnA system handles different questions and provides responses.

Q: What is the most important skill for a computer scientist?
Answer: Based on the context provided, the most important skill for a computer scientist in the context of Big Data Analytics and Machine Learning would be the ability to develop and apply advanced algorithms, particularly in the area of feature engineering and deep learning, to extract meaningful abstract representations from large volumes of unsupervised data. This skill is crucial for automating the process of data representation extraction and enabling the application of more conventional discriminative models to large datasets with relatively fewer supervised/labeled data points. Additionally, the ability to work with high-dimensional data, distributed data sources, and scalable algorithms is essential in the field of Big Data Analytics.

Q: The name of the variable we pass as an argument is unrelated to the name of the parameter. (True/False)
Answer: In the context provided, the variable being passed as an argument to the algorithms such as TF-IDF or Deep Learning models is not explicitly stated to have the same name as the parameter. Therefore, based on the information given, the statement is true.

Q: Give me the name of the transformation algorithms
The text mentions several transformation algorithms, including Principal Component Analysis (PCA), Histogram of Oriented Gradients (HOG), and Scale Invariant Feature Transform (SIFT). However, it’s important to note that these algorithms are not used in the deep architecture of deep learning algorithms, as they are linear transformations and cannot represent the complex non-linear transformations that deep learning algorithms learn. Instead, deep learning algorithms use non-linear transformations in their layers to extract underlying explanatory factors in the data and construct more abstract and complicated representations of the data.

Q: What is the RBMs?
The RBMs (Restricted Boltzmann Machines) are a type of neural network used in unsupervised learning, specifically in constructing Deep Belief Networks. They consist of one visible layer and one hidden layer, with no interaction between units of the same layer and connections solely between units from different layers. The Contrastive Divergence algorithm is used to train the Boltzmann machine.


Throughout our testing, we began the exciting journey, developing a Question and Answer (QnA) system within the Langchain framework using Mistral AI‘s sophisticated AI service. We’ve successfully used the Mistral 7B model to extract answers from PDF content, significantly improving information accessibility. While our system has demonstrated its capabilities, the field of AI is constantly evolving, leaving room for improvements and advancements.

Retrieval Augmented Generation (RAG) Tutorial Using VertexAI Gen AI And Langchain:

Retrieval Augmented Generation (RAG) Tutorial Using OpenAI And Langchain:

Leverage Phi-3: Exploring RAG Based Q&A With Microsoft’s Phi-3:

Excited to delve into the world of RAG-based applications or chatbots powered by Mistral AI? Let’s turn your vision into reality! Reach out to us at or simply share your requirements here to kickstart the conversation. Your innovative project awaits – let’s make it happen together!

Categories: Natural Language Processing NLP

Leave a Reply

Your email address will not be published.

You may use these <abbr title="HyperText Markup Language">HTML</abbr> tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>