December 16, 2019 27 Comments

All our demos Question Answering System In Python Using BERT and Closed-Domain Chatbot Using BERT In Python can be purchased now.

Our case study Question Answering System in Python using BERT NLP and BERT based Question and Answering system demo, developed in Python + Flask, got hugely popular garnering hundreds of visitors per day. We got a lot of appreciative and lauding emails praising our QnA demo. Along with that, we also got number of people asking about how we created this QnA demo. And till the day, we keep getting requests on how to develop such a QnA system using BERT pre-trained model open-sourced by Google.

To start with, the readme file on the official GitHub repository of BERT provides a good amount of information about how to fine-tune the model on SQuAD 2.0 but we could see that developers are still facing issues. So, we decided to publish a step-by-step tutorial to fine-tune the BERT pre-trained model and generate inference of answers from the given paragraph and questions on Colab.

In this tutorial, we are not going to cover how to create web-based interface using Python + Flask. We’ll just cover the fine-tuning and inference on Colab. You can create your own interface using Flask or Django. And if you want the exact same demo like ours then we will provide it with some nominal charges. For more information please refer Buy Question n Answering Demo using BERT in Python + Flask.

Overview

In this tutorial we will see how to perform a fine-tuning task on SQuAD using Google Colab, for that we will use BERT GitHub Repository, BERT Repository includes:
1) Huggingface transformers code for the BERT model architecture.
2) Pre-trained models for both the lowercase and cased version of BERT-Base and BERT-Large.

You can also refer or copy our colab file to follow the steps. Make sure you have transformers in your colab.

If transformers is not then you can install it using below command.

				
					!pip install transformers
				
			

Steps to perform BERT Fine-tuning on Google Colab

1) Import Some libraries.

Here we use panda’s python library to read CSV file and write CSV file.

				
					import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
				
			

2) Download the SQUAD2.0 Dataset

For the Question Answering task, we will be using SQuAD2.0 Dataset.

SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

SQuAD2.0 combines the 100,000+ questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. You can download the dataset from SQUAD site https://rajpurkar.github.io/SQuAD-explorer/

				
					SQuAD = pd.read_json('https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json')
SQuAD.head()
				
			

Output:

versiondata
0v2.0{‘title’: ‘Beyoncé’, ‘paragraphs’: [{‘qas’: [{…
1v2.0{‘title’: ‘Frédéric_Chopin’, ‘paragraphs’: [{‘…
2v2.0{‘title’: ‘Sino-Tibetan_relations_during_the_M…
3v2.0{‘title’: ‘IPod’, ‘paragraphs’: [{‘qas’: [{‘qu…
4v2.0{‘title’: ‘The_Legend_of_Zelda:_Twilight_Princ…

Data Cleaning

We will be dealing with the “data” column, so let’s just delete the “version” column.

				
					del SQuAD["version"]

cols = ["text","question","answer"]
 
comp_list = []
for index, row in SQuAD.iterrows():
   for i in range(len(row['data']['paragraphs'])):
       for j in (row['data']['paragraphs'][i]['qas']):
           temp_list = []
           temp_list.append((row["data"]["paragraphs"][i]["context"]))
           temp_list.append(j['question'])
           if j["answers"]:
               temp_list.append(j["answers"][0]["text"])
           else:
               temp_list.append("")
       comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns=cols)
				
			

Data Loading from Local CSV File

				
					new_df.to_csv("SQuAD_data.csv", index=False)

data = pd.read_csv("SQuAD_data.csv")
data.head()
				
			

Output:

textquestionanswer
0Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b…What was the name of Beyoncé’s first solo album?Dangerously in Love
1Following the disbandment of Destiny’s Child i…What is the name of Beyoncé’s alter-ego?Sasha Fierce
2A self-described “modern-day feminist”, Beyonc…What magazine named Beyoncé as the most powerful…Forbes
3Beyoncé Giselle Knowles was born in Houston, T…Beyoncé was raised in what religion?Methodist
4Beyoncé attended St. Mary’s Elementary School …What choir did Beyoncé sing in for two years?St. John’s United Methodist Church

3) Download the BERT PRETRAINED MODEL

BERT Pretrained Model List :

  • BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
  • BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
  • BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
  • BERT-Base, Cased: 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
  • BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Base, Multilingual Uncased (Orig, not recommended): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

BERT has released BERT-Base and BERT-Large models, that have uncased and cased version. Uncased means that the text is converted to lowercase before performing Workpiece tokenization, e.g., John Smith becomes john smith, on the other hand, cased means that the true case and accent markers are preserved.

When using a cased model, make sure to pass –do_lower=False at the time of training.

You can download any model of your choice. We have used the BERT-Large-Uncased Model.

				
					model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
				
			

Asking a Question:

Let’s randomly select a question number.

				
					random_num = np.random.randint(0,len(data))
 
question = data["question"][random_num]
text = data["text"][random_num]
				
			

Let’s tokenize the question and text as a pair.

				
					input_ids = tokenizer.encode(question, text)
print("The input has a total of {} tokens.".format(len(input_ids)))
				
			

Let’s see how many tokens this question and text pair have.

The input has a total of 427 tokens.

To look at what our tokenizer is doing, let’s just print out the tokens and their IDs.

				
					tokens = tokenizer.convert_ids_to_tokens(input_ids)
 
for token, id in zip(tokens, input_ids):
   print('{:8}{:8,}'.format(token, id))
				
			

BERT processes the tokenized inputs in a unique way. The screenshot up top shows the special tokens [CLS] and [SEP]. The [CLS] token, which stands for classification and is used to represent sentence-level classification, is what we use when we classify. Another token used by BERT is [SEP]. It separates the text’s two sections. The screenshots up top show two [SEP] tokens, one after the question and the other after the text.

In addition to “Token Embeddings,” BERT also employs “Segment Embeddings” and “Position Embeddings” internally. BERT can distinguish between a question and the text with the use of segment embeddings. In reality, if the embeddings are from sentence 1, we use a vector of 0s, and if they are from sentence 2, we use a vector of 1. Word placement in a sequence can be specified with the use of position embeddings. The input layer is given access to all of these embeddings.

Segment embeddings can be produced by the Transformers library on its own using PretrainedTokenizer.encode plus (). However, we are also capable of making our own. For each token, we only need to declare a 0 or 1.

				
					#first occurence of [SEP] token
sep_idx = input_ids.index(tokenizer.sep_token_id)
print(sep_idx)
 
#number of tokens in segment A - question
num_seg_a = sep_idx+1
print(num_seg_a)
 
#number of tokens in segment B - text
num_seg_b = len(input_ids) - num_seg_a
print(num_seg_b)
 
segment_ids = [0]*num_seg_a + [1]*num_seg_b
print(segment_ids)
 
assert len(segment_ids) == len(input_ids)
				
			

Let’s now feed this to our model.

				
					#token input_ids to represent the input
#token segment_ids to differentiate our segments - text and question
output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))
#print(output.start_logits, output.end_logits)
				
			

Looking at the most probable start and end words and providing answers only if the end token is after the start token.

				
					#tokens with highest start and end scores
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)
#print(answer_start, answer_end)

if answer_end >= answer_start:
   answer = " ".join(tokens[answer_start:answer_end+1])
else:
   print("I am unable to find the answer to this question. Can you please ask another question?")
  
print("Text:\n{}".format(text.capitalize()))
print("\nQuestion:\n{}".format(question.capitalize()))
print("\nAnswer:\n{}.".format(answer.capitalize()))
				
			

Here, is our question and its answer.

				
					Question:
Where was the auction held?

Answer:
[sep].
				
			

Wordpiece tokenization is used by BERT. Rare words are divided up into subwords and parts in BERT. ## is used by wordpiece tokenization to distinguish split tokens. As an illustration, since “Karin” is a widely used term, wordpiece does not break it. Being a rare word, “Karingu” was divided into “Karin” and “##gu” by wordpiece. ## has been added to the beginning of the word “gu” to denote that it is the second half of a split word.

The purpose of wordpiece tokenization is to condense the vocabulary in order to enhance training effectiveness. Take the verbs run, running, and runner. The model must independently store and learn the meaning of each word without wordpiece tokenization. However, wordpiece tokenization would separate each of the three words into “run” and the relevant “##SUFFIX” (assuming any suffix is present, such as “run”, “##ning”, or “##ner”). The model will now pick up on the word “run’s” context, and the rest of its meaning will be stored in its suffix, which it will learn from words with related suffixes.

That’s interesting.  The following straightforward code can be used to rebuild these words.

				
					answer = tokens[answer_start]
 
for i in range(answer_start+1, answer_end+1):
   if tokens[i][0:2] == "##":
       answer += tokens[i][2:]
   else:
       answer += " " + tokens[i]
				
			

The above answer will now become: Agnes karingu

Let us now turn this question-answering process into a function for ease.

				
					def question_answer(question, text):
  
   #tokenize question and text in ids as a pair
   input_ids = tokenizer.encode(question, text)
  
   #string version of tokenized ids
   tokens = tokenizer.convert_ids_to_tokens(input_ids)
  
   #segment IDs
   #first occurence of [SEP] token
   sep_idx = input_ids.index(tokenizer.sep_token_id)
 
   #number of tokens in segment A - question
   num_seg_a = sep_idx+1
 
   #number of tokens in segment B - text
   num_seg_b = len(input_ids) - num_seg_a
  
   #list of 0s and 1s
   segment_ids = [0]*num_seg_a + [1]*num_seg_b
  
   assert len(segment_ids) == len(input_ids)
  
   #model output using input_ids and segment_ids
   output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))
  
   #reconstructing the answer
   answer_start = torch.argmax(output.start_logits)
   answer_end = torch.argmax(output.end_logits)
 
   if answer_end >= answer_start:
       answer = tokens[answer_start]
       for i in range(answer_start+1, answer_end+1):
           if tokens[i][0:2] == "##":
               answer = ""
           else:
               answer += " " + tokens[i]
              
   if answer.startswith("[CLS]"):
       answer = "Unable to find the answer to your question."
  
#     print("Text:\n{}".format(text.capitalize()))
#     print("\nQuestion:\n{}".format(question.capitalize()))
   print("\nAnswer:\n{}".format(answer.capitalize()))
				
			

Let’s test this function out by using a text and question from our dataset.

				
					text = """New York (CNN) -- More than 80 Michael Jackson collectibles -- including the late pop star's famous rhinestone-studded glove from a 1983 performance -- were auctioned off Saturday, reaping a total $2 million. Profits from the auction at the Hard Rock Cafe in New York's Times Square crushed pre-sale expectations of only $120,000 in sales. The highly prized memorabilia, which included items spanning the many stages of Jackson's career, came from more than 30 fans, associates and family members, who contacted Julien's Auctions to sell their gifts and mementos of the singer. Jackson's flashy glove was the big-ticket item of the night, fetching $420,000 from a buyer in Hong Kong, China. Jackson wore the glove at a 1983 performance during \"Motown 25,\" an NBC special where he debuted his revolutionary moonwalk. Fellow Motown star Walter \"Clyde\" Orange of the Commodores, who also performed in the special 26 years ago, said he asked for Jackson's autograph at the time, but Jackson gave him the glove instead. "The legacy that [Jackson] left behind is bigger than life for me,\" Orange said. \"I hope that through that glove people can see what he was trying to say in his music and what he said in his music.\" Orange said he plans to give a portion of the proceeds to charity. Hoffman Ma, who bought the glove on behalf of Ponte 16 Resort in Macau, paid a 25 percent buyer's premium, which was tacked onto all final sales over $50,000. Winners of items less than $50,000 paid a 20 percent premium."""
question = "Where was the Auction held?"
 
question_answer(question, text)
				
			

Output:

Predicted answer:

Hard rock cafe in new york ‘ s times square

Original answer:

Hard Rock Cafe

Not bad at all. In fact, our BERT model gave a more detailed response.

Here, is a small function to test out how well BERT understands contexts. I just made the question-answering process as a loop to play around with the model.

				
					text = input("Please enter your text: \n")
question = input("\nPlease enter your question: \n")
 
while True:
   question_answer(question, text)
  
   flag = True
   flag_N = False
  
   while flag:
       response = input("\nDo you want to ask another question based on this text (Y/N)? ")
       if response[0] == "Y":
           question = input("\nPlease enter your question: \n")
           flag = False
       elif response[0] == "N":
           print("\nBye!")
           flag = False
           flag_N = True
          
   if flag_N == True:
       break
				
			

And, the result!

				
					Please enter your text: 
New York (CNN) -- More than 80 Michael Jackson collectibles -- including the late pop star's famous rhinestone-studded glove from a 1983 performance -- were auctioned off Saturday, reaping a total $2 million. Profits from the auction at the Hard Rock Cafe in New York's Times Square crushed pre-sale expectations of only $120,000 in sales. The highly prized memorabilia, which included items spanning the many stages of Jackson's career, came from more than 30 fans, associates and family members, who contacted Julien's Auctions to sell their gifts and mementos of the singer. Jackson's flashy glove was the big-ticket item of the night, fetching $420,000 from a buyer in Hong Kong, China. Jackson wore the glove at a 1983 performance during \"Motown 25,\" an NBC special where he debuted his revolutionary moonwalk. Fellow Motown star Walter \"Clyde\" Orange of the Commodores, who also performed in the special 26 years ago, said he asked for Jackson's autograph at the time, but Jackson gave him the glove instead. "The legacy that [Jackson] left behind is bigger than life for me,\" Orange said. \"I hope that through that glove people can see what he was trying to say in his music and what he said in his music.\" Orange said he plans to give a portion of the proceeds to charity. Hoffman Ma, who bought the glove on behalf of Ponte 16 Resort in Macau, paid a 25 percent buyer's premium, which was tacked onto all final sales over $50,000. Winners of items less than $50,000 paid a 20 percent premium.

Please enter your question: 
when was auction held?

Answer:
Saturday

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
where was auction held?

Answer:
Hard rock cafe in new york ' s times square

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
How much profit earned from the auction?

Answer:
$ 2 million

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
Who bought the glove?

Answer:
Hoffman ma

Do you want to ask another question based on this text (Y/N)? N

Bye!
				
			

To make it easier for you, we have already created a Colab file which you can copy in your Google Drive and execute the commands. You can access the colab file at: Question Answering System using BERT + SQuAD on Colab.

Feel free to comment your doubts/questions. We would be glad to help you.

If you are looking for Chatbot Development or Natural Language Processing services then do contact us or send your requirement at letstalk@pragnakalp.com. We would be happy to offer our expert services.

27 Comments

  • Pragnakalp Techlabs

    With out creating bucket , we canable to run the squad or not .
    And you didn’t used any Tensor flow or pytorch transformers .how it is possible .

    • Pragnakalp Techlabs

      Hello Uma,
      We have to create the bucket on Google Storage as the code doesn’t work with local files.
      And we are using tensorflow. You will find that in requirements.txt
      tensorflow >= 1.11.0
      So, tensorflow is being installed and used to run the script.

      • Pragnakalp Techlabs

        Hi , which exact version of Tensorflow are you using? I am using
        class AdamWeightDecayOptimizer(tf.train.Optimizer):
        AttributeError: module ‘tensorflow._api.v2.train’ has no attribute ‘Optimizer’

        • Pragnakalp Techlabs

          Hello Matteo,
          We use tensorflow version 1.14. Please use that.
          The error you have mentioned will be solved if you use 1.14. Give it a try and let us know please.

  • Pragnakalp Techlabs

    And how do you given the Id number to the newly creating json file , do you the id numbers .
    How we are able to know the which id number we have to give .

    • Pragnakalp Techlabs

      Hi Uma,
      We have randomly given the ID numbers, You can give the ID of your choice, but for each question there should be a unique ID.

  • Pragnakalp Techlabs

    How did you generate the value for id in the testing file ?

    • Pragnakalp Techlabs

      Hi Sonam,
      We have randomly given the ID numbers, You can give the ID of your choice, but for each question there should be a unique ID.

  • Pragnakalp Techlabs

    Hi, Thanks for providing the code snippet it is very helpful. I had one doubt, where does we get the output/answers of the questions used in the test file? I tried to check in the output/ folder but was not able to find the output.

    • Pragnakalp Techlabs

      Hello Yogesh,
      We are glad that you found our code snippet useful!

      Regarding your question, the output will be generated in the directory which you have given in the –output_dir parameter. The “output” named directory will be created in the Colab file system in BERT folder, and it contains three files named
      (i) eval.tf_record
      (ii) nbest_predictions.json
      (iii) predictions.json
      You need to check the prediction.json file for your answers.

      Please let us know if you need further assistance.

  • Pragnakalp Techlabs

    Hi, thank you for this article it was extremely helpful.

    For a QA system with a UI, I assume the backend runs the run_squad.py file programmatically. I was wondering how the hyperparameters are passed along with the command to run the file.

    • Pragnakalp Techlabs

      Thanks Sameer. We are glad that you found it useful.

      To run python file with passing parameters, you can use os.system(command). It will let you run the python file as command and you can also pass the parameters.

      Hope that answers your question!

      • Pragnakalp Techlabs

        I have another question.

        Is the run_squad.py command for running the model input.json file as shown in the article only for SQuAD version 2.0 or can the same command be used for SQuAD v1.1 (with a different checkpoint file ofcourse). If not, could you tell me the command for v1.1.

        Thank you so much for your help.

        • Pragnakalp Techlabs

          Hi Sameer,
          You can use the same command for inference on SQuAD v1.1 too.

  • Pragnakalp Techlabs

    Hi. Thank you for your article.

    After training on SQuAD, we get 3 checkpoints with the following extensions: .ckpt-7299.data, .ckpt.index and .ckpt.meta. I wanted to ask which ckpt file to use while running “run_squad.py” for our input file.

    Thank you!

    • Pragnakalp Techlabs

      Hi Hemant,
      While running run_squad.py you will need all 3 files, you need to pass “model.ckpt-7299” in the “run_squad.py” command. The script will utilize all 3 ckpt-7299 files automatically.

      Hope this answers you question, Please let us know if you have any other questions.

  • Pragnakalp Techlabs

    Hi, thank you so much for this incredible tutorial! I’m confused as to how to find my tpu_name — I’ve set up the GCS bucket, but haven’t figured out where to find the tpu_name address?

    • Pragnakalp Techlabs

      Hello C,
      You should get the TPU address in step “5) Set up your TPU environment”. Can you please try it and check?
      If you still don’t get the TPU address then take a screenshot of step 5 and post here please.

  • Pragnakalp Techlabs

    hi thanks for the blog..its very useful..
    Can you please guide me how to train squad1.1 for bert.
    I am getting error as KeyError: ‘is_impossible’..

    • Pragnakalp Techlabs

      Hi Anuj,
      We are glad that you found it useful. To train squad1.1 you need to set the “version_2_with_negative”:True hyper-parameter to the False i.e version_2_with_negative:False.

  • Pragnakalp Techlabs

    Great article! Thank you for sharing. It would be great if this ran on TF 2 as 1.14 seems very outdated at this time.

    • Pragnakalp Techlabs

      Hello Jean,

      If you want to run on TF 2 then you can do that by changing all the outdated methods name to the new one in the BERT repo.
      Do let us know how it works for you if you give it a try.

  • Pragnakalp Techlabs

    Hey! Great work! I couldn’t figure it out from the BERT repo but this was perfect! Can you tell me how much time it’ll take to run and what F1 score does this code achieve? Thank you!

    • Pragnakalp Techlabs

      Hi Bhargav,
      You may require about 2 hours on TPU. And we haven’t tried the F1 score but according to BERT official repo for BERT-Large, Uncased (Original) Model F1 score is 91.0.
      Hope that helps!

  • Pragnakalp Techlabs

    Hey! If I use Squad 2, I’m getting a warning “Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or …” and the training is stuck.
    In Squad 1.1, it’s stuck with this example “I0512 06:26:55.105924 140589080323968 run_squad.py:451] start_position: 53
    INFO:tensorflow:end_position: 54
    I0512 06:26:55.106061 140589080323968 run_squad.py:452] end_position: 54
    INFO:tensorflow:answer: february 1848
    I0512 06:26:55.106162 140589080323968 run_squad.py:454] answer: february 1848” The training is again stuck. Can you help me on this?

    • Pragnakalp Techlabs

      Hello Dharun,
      It looks like that TPU is not being allocated in your colab file. You can try resetting the runtime (From Menu RUNTIME -> Factory Reset Runtime) or use colab with any other Google account.

  • Pragnakalp Techlabs

    How can I do this in a closed domain way ? Like i would like to get answers from my own document for the questions asked.

Write a comment

Your email address will not be published. Required fields are marked *

Pragnakalp Techlabs: Your trusted partner in Python, AI, NLP, Generative AI, ML, and Automation. Our skilled experts have successfully delivered robust solutions to satisfied clients, driving innovation and success.