All our demos Question Answering System In Python Using BERT and Closed-Domain Chatbot Using BERT In Python can be purchased now.

Our case study Question Answering System in Python using BERT NLP and BERT based Question and Answering system demo, developed in Python + Flask, got hugely popular garnering hundreds of visitors per day. We got a lot of appreciative and lauding emails praising our QnA demo. Along with that, we also got number of people asking about how we created this QnA demo. And till the day, we keep getting requests on how to develop such a QnA system using BERT pre-trained model open-sourced by Google. 

To start with, the readme file on the official GitHub repository of BERT provides a good amount of information about how to fine-tune the model on SQuAD 2.0 but we could see that developers are still facing issues. So, we decided to publish a step-by-step tutorial to fine-tune the BERT pre-trained model and generate inference of answers from the given paragraph and questions on Colab

In this tutorial, we are not going to cover how to create web-based interface using Python + Flask. We’ll just cover the fine-tuning and inference on Colab. You can create your own interface using Flask or Django. And if you want the exact same demo like ours then we will provide it with some nominal charges. For more information please refer Buy Question n Answering Demo using BERT in Python + Flask

Overview

In this tutorial we will see how to perform a fine-tuning task on SQuAD using Google Colab, for that we will use BERT GitHub Repository, BERT Repository includes:
1) Huggingface transformers code for the BERT model architecture.
2) Pre-trained models for both the lowercase and cased version of BERT-Base and BERT-Large.

You can also refer or copy our colab file to follow the steps. Make sure you have transformers in your colab. 

If transformers is not then you can install it using below command.

				
					!pip install transformers
				
			

Steps to perform BERT Fine-tuning on Google Colab

1) Import Some libraries.

Here we use panda’s python library to read CSV file and write CSV file.

				
					import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
				
			

2) Download the SQUAD2.0 Dataset

For the Question Answering task, we will be using SQuAD2.0 Dataset.

SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

SQuAD2.0 combines the 100,000+ questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. You can download the dataset from SQUAD site https://rajpurkar.github.io/SQuAD-explorer/

				
					SQuAD = pd.read_json('https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json')
SQuAD.head()
				
			

Output:

versiondata
0v2.0{‘title’: ‘Beyoncé’, ‘paragraphs’: [{‘qas’: [{…
1v2.0{‘title’: ‘Frédéric_Chopin’, ‘paragraphs’: [{‘…
2v2.0{‘title’: ‘Sino-Tibetan_relations_during_the_M…
3v2.0{‘title’: ‘IPod’, ‘paragraphs’: [{‘qas’: [{‘qu…
4v2.0{‘title’: ‘The_Legend_of_Zelda:_Twilight_Princ…

Data Cleaning

We will be dealing with the “data” column, so let’s just delete the “version” column.

				
					del SQuAD["version"]

cols = ["text","question","answer"]
 
comp_list = []
for index, row in SQuAD.iterrows():
   for i in range(len(row['data']['paragraphs'])):
       for j in (row['data']['paragraphs'][i]['qas']):
           temp_list = []
           temp_list.append((row["data"]["paragraphs"][i]["context"]))
           temp_list.append(j['question'])
           if j["answers"]:
               temp_list.append(j["answers"][0]["text"])
           else:
               temp_list.append("")
       comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns=cols)
				
			

Data Loading from Local CSV File

				
					new_df.to_csv("SQuAD_data.csv", index=False)

data = pd.read_csv("SQuAD_data.csv")
data.head()
				
			

Output:

textquestionanswer
0Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b…What was the name of Beyoncé’s first solo album?Dangerously in Love
1Following the disbandment of Destiny’s Child i…What is the name of Beyoncé’s alter-ego?Sasha Fierce
2A self-described “modern-day feminist”, Beyonc…What magazine named Beyoncé as the most powerful…Forbes
3Beyoncé Giselle Knowles was born in Houston, T…Beyoncé was raised in what religion?Methodist
4Beyoncé attended St. Mary’s Elementary School …What choir did Beyoncé sing in for two years?St. John’s United Methodist Church

3) Download the BERT PRETRAINED MODEL

BERT Pretrained Model List :

  • BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
  • BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
  • BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
  • BERT-Base, Cased: 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
  • BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Base, Multilingual Uncased (Orig, not recommended): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

BERT has released BERT-Base and BERT-Large models, that have uncased and cased version. Uncased means that the text is converted to lowercase before performing Workpiece tokenization, e.g., John Smith becomes john smith, on the other hand, cased means that the true case and accent markers are preserved.

When using a cased model, make sure to pass –do_lower=False at the time of training.

You can download any model of your choice. We have used the BERT-Large-Uncased Model.

				
					model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
				
			

Asking a Question:

Let’s randomly select a question number.

				
					random_num = np.random.randint(0,len(data))
 
question = data["question"][random_num]
text = data["text"][random_num]
				
			

Let’s tokenize the question and text as a pair.

				
					input_ids = tokenizer.encode(question, text)
print("The input has a total of {} tokens.".format(len(input_ids)))
				
			

Let’s see how many tokens this question and text pair have.

The input has a total of 427 tokens.

To look at what our tokenizer is doing, let’s just print out the tokens and their IDs.

				
					tokens = tokenizer.convert_ids_to_tokens(input_ids)
 
for token, id in zip(tokens, input_ids):
   print('{:8}{:8,}'.format(token, id))
				
			

BERT processes the tokenized inputs in a unique way. The screenshot up top shows the special tokens [CLS] and [SEP]. The [CLS] token, which stands for classification and is used to represent sentence-level classification, is what we use when we classify. Another token used by BERT is [SEP]. It separates the text’s two sections. The screenshots up top show two [SEP] tokens, one after the question and the other after the text.

In addition to “Token Embeddings,” BERT also employs “Segment Embeddings” and “Position Embeddings” internally. BERT can distinguish between a question and the text with the use of segment embeddings. In reality, if the embeddings are from sentence 1, we use a vector of 0s, and if they are from sentence 2, we use a vector of 1. Word placement in a sequence can be specified with the use of position embeddings. The input layer is given access to all of these embeddings.

Segment embeddings can be produced by the Transformers library on its own using PretrainedTokenizer.encode plus (). However, we are also capable of making our own. For each token, we only need to declare a 0 or 1.

				
					#first occurence of [SEP] token
sep_idx = input_ids.index(tokenizer.sep_token_id)
print(sep_idx)
 
#number of tokens in segment A - question
num_seg_a = sep_idx+1
print(num_seg_a)
 
#number of tokens in segment B - text
num_seg_b = len(input_ids) - num_seg_a
print(num_seg_b)
 
segment_ids = [0]*num_seg_a + [1]*num_seg_b
print(segment_ids)
 
assert len(segment_ids) == len(input_ids)
				
			

Let’s now feed this to our model.

				
					#token input_ids to represent the input
#token segment_ids to differentiate our segments - text and question
output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))
#print(output.start_logits, output.end_logits)
				
			

Looking at the most probable start and end words and providing answers only if the end token is after the start token.

				
					#tokens with highest start and end scores
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)
#print(answer_start, answer_end)

if answer_end >= answer_start:
   answer = " ".join(tokens[answer_start:answer_end+1])
else:
   print("I am unable to find the answer to this question. Can you please ask another question?")
  
print("Text:\n{}".format(text.capitalize()))
print("\nQuestion:\n{}".format(question.capitalize()))
print("\nAnswer:\n{}.".format(answer.capitalize()))
				
			

Here, is our question and its answer.

				
					Question:
Where was the auction held?

Answer:
[sep].
				
			

Wordpiece tokenization is used by BERT. Rare words are divided up into subwords and parts in BERT. ## is used by wordpiece tokenization to distinguish split tokens. As an illustration, since “Karin” is a widely used term, wordpiece does not break it. Being a rare word, “Karingu” was divided into “Karin” and “##gu” by wordpiece. ## has been added to the beginning of the word “gu” to denote that it is the second half of a split word.

The purpose of wordpiece tokenization is to condense the vocabulary in order to enhance training effectiveness. Take the verbs run, running, and runner. The model must independently store and learn the meaning of each word without wordpiece tokenization. However, wordpiece tokenization would separate each of the three words into “run” and the relevant “##SUFFIX” (assuming any suffix is present, such as “run”, “##ning”, or “##ner”). The model will now pick up on the word “run’s” context, and the rest of its meaning will be stored in its suffix, which it will learn from words with related suffixes.

That’s interesting.  The following straightforward code can be used to rebuild these words.

				
					answer = tokens[answer_start]
 
for i in range(answer_start+1, answer_end+1):
   if tokens[i][0:2] == "##":
       answer += tokens[i][2:]
   else:
       answer += " " + tokens[i]
				
			

The above answer will now become: Agnes karingu

Let us now turn this question-answering process into a function for ease.

				
					def question_answer(question, text):
  
   #tokenize question and text in ids as a pair
   input_ids = tokenizer.encode(question, text)
  
   #string version of tokenized ids
   tokens = tokenizer.convert_ids_to_tokens(input_ids)
  
   #segment IDs
   #first occurence of [SEP] token
   sep_idx = input_ids.index(tokenizer.sep_token_id)
 
   #number of tokens in segment A - question
   num_seg_a = sep_idx+1
 
   #number of tokens in segment B - text
   num_seg_b = len(input_ids) - num_seg_a
  
   #list of 0s and 1s
   segment_ids = [0]*num_seg_a + [1]*num_seg_b
  
   assert len(segment_ids) == len(input_ids)
  
   #model output using input_ids and segment_ids
   output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))
  
   #reconstructing the answer
   answer_start = torch.argmax(output.start_logits)
   answer_end = torch.argmax(output.end_logits)
 
   if answer_end >= answer_start:
       answer = tokens[answer_start]
       for i in range(answer_start+1, answer_end+1):
           if tokens[i][0:2] == "##":
               answer = ""
           else:
               answer += " " + tokens[i]
              
   if answer.startswith("[CLS]"):
       answer = "Unable to find the answer to your question."
  
#     print("Text:\n{}".format(text.capitalize()))
#     print("\nQuestion:\n{}".format(question.capitalize()))
   print("\nAnswer:\n{}".format(answer.capitalize()))
				
			

Let’s test this function out by using a text and question from our dataset.

				
					text = """New York (CNN) -- More than 80 Michael Jackson collectibles -- including the late pop star's famous rhinestone-studded glove from a 1983 performance -- were auctioned off Saturday, reaping a total $2 million. Profits from the auction at the Hard Rock Cafe in New York's Times Square crushed pre-sale expectations of only $120,000 in sales. The highly prized memorabilia, which included items spanning the many stages of Jackson's career, came from more than 30 fans, associates and family members, who contacted Julien's Auctions to sell their gifts and mementos of the singer. Jackson's flashy glove was the big-ticket item of the night, fetching $420,000 from a buyer in Hong Kong, China. Jackson wore the glove at a 1983 performance during \"Motown 25,\" an NBC special where he debuted his revolutionary moonwalk. Fellow Motown star Walter \"Clyde\" Orange of the Commodores, who also performed in the special 26 years ago, said he asked for Jackson's autograph at the time, but Jackson gave him the glove instead. "The legacy that [Jackson] left behind is bigger than life for me,\" Orange said. \"I hope that through that glove people can see what he was trying to say in his music and what he said in his music.\" Orange said he plans to give a portion of the proceeds to charity. Hoffman Ma, who bought the glove on behalf of Ponte 16 Resort in Macau, paid a 25 percent buyer's premium, which was tacked onto all final sales over $50,000. Winners of items less than $50,000 paid a 20 percent premium."""
question = "Where was the Auction held?"
 
question_answer(question, text)
				
			

Output:

Predicted answer:

Hard rock cafe in new york ‘ s times square

Original answer:

Hard Rock Cafe

Not bad at all. In fact, our BERT model gave a more detailed response.

Here, is a small function to test out how well BERT understands contexts. I just made the question-answering process as a loop to play around with the model.

				
					text = input("Please enter your text: \n")
question = input("\nPlease enter your question: \n")
 
while True:
   question_answer(question, text)
  
   flag = True
   flag_N = False
  
   while flag:
       response = input("\nDo you want to ask another question based on this text (Y/N)? ")
       if response[0] == "Y":
           question = input("\nPlease enter your question: \n")
           flag = False
       elif response[0] == "N":
           print("\nBye!")
           flag = False
           flag_N = True
          
   if flag_N == True:
       break
				
			

And, the result!

				
					Please enter your text: 
New York (CNN) -- More than 80 Michael Jackson collectibles -- including the late pop star's famous rhinestone-studded glove from a 1983 performance -- were auctioned off Saturday, reaping a total $2 million. Profits from the auction at the Hard Rock Cafe in New York's Times Square crushed pre-sale expectations of only $120,000 in sales. The highly prized memorabilia, which included items spanning the many stages of Jackson's career, came from more than 30 fans, associates and family members, who contacted Julien's Auctions to sell their gifts and mementos of the singer. Jackson's flashy glove was the big-ticket item of the night, fetching $420,000 from a buyer in Hong Kong, China. Jackson wore the glove at a 1983 performance during \"Motown 25,\" an NBC special where he debuted his revolutionary moonwalk. Fellow Motown star Walter \"Clyde\" Orange of the Commodores, who also performed in the special 26 years ago, said he asked for Jackson's autograph at the time, but Jackson gave him the glove instead. "The legacy that [Jackson] left behind is bigger than life for me,\" Orange said. \"I hope that through that glove people can see what he was trying to say in his music and what he said in his music.\" Orange said he plans to give a portion of the proceeds to charity. Hoffman Ma, who bought the glove on behalf of Ponte 16 Resort in Macau, paid a 25 percent buyer's premium, which was tacked onto all final sales over $50,000. Winners of items less than $50,000 paid a 20 percent premium.

Please enter your question: 
when was auction held?

Answer:
Saturday

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
where was auction held?

Answer:
Hard rock cafe in new york ' s times square

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
How much profit earned from the auction?

Answer:
$ 2 million

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
Who bought the glove?

Answer:
Hoffman ma

Do you want to ask another question based on this text (Y/N)? N

Bye!
				
			

To make it easier for you, we have already created a Colab file which you can copy in your Google Drive and execute the commands. You can access the colab file at: Question Answering System using BERT + SQuAD on Colab.

Feel free to comment your doubts/questions. We would be glad to help you.

If you are looking for Chatbot Development or Natural Language Processing services then do contact us or send your requirement at letstalk@pragnakalp.com. We would be happy to offer our expert services. 

Want to talk to an Expert Developer?

Our experts in Generative AI, Python Programming, and Chatbot Development can help you build innovative solutions and scale your business faster.

Thanks!