Question Answering System in Python using BERT NLP

Chatbots, Python Development, Machine Learning, Natural Language Processing (NLP)

What is Question Answering system?

Question Answering (QnA) model is one of the very basic systems of Natural Language Processing. In QnA, the Machin Learning based system generates answers from the knowledge base or text paragraphs for the questions posed as input. Various machine learning methods can be implemented to build Question Answering systems.


Create a Question Answering Machine Learning model system which will take comprehension and questions as input, process the comprehension and prepare answers from it.

With the Concept of Natural Language Processing, we can achieve this objective. NLP helps the system to identify and understand the meaning of any sentences with proper contexts.

Implementation or Usage of QnA model in industry/project

  • To develop a Common sense reasoning model that mimics likes a Human reasoning.
  • Prepare FAQs from knowledge base, product manual or documentation.
  • For creating smart chatbot that can answer FAQs for different industries like Healthcare, Travel, Agriculture, Eduction, Manufacturing, Online commerce, etc.


With the massive growth of the web, we have a large amount of data. And only some text data are annotated. For a task in field like Natural Language Processing we need lot of annotated data for supervising learning or unannotated data for unsupervised learning. Various researchers prefer unsupervised learning. They highlighted a few techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training). BERT is one such pre-trained model developed by Google which can be fine-tuned on new data which can be used to create NLP systems like question answering, text generation, text classification, text summarization and sentiment analysis. As BERT is trained on huge amount of data, it makes the process of language modeling easier. The main benefit for using pre-trained model of BERT is achievment in substantial accuracy improvements compared to training on these datasets from scratch.

BERT builds upon recent work in pre-training contextual representations, it is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. BERT represents Contextual representation with both left context and right. BERT is conceptually simple and empirically powerful. BERT is better than previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP having features of Domain Adaptation. As per the BERT paper it can be established that, with proper language model training method, the Transformer(self-attention) based encoder could be potentially used as an alternative to the previous language models.


An RNN (theoretically) gives infinite left context (words to the left of the target word). But what we might like is to use each left and right contexts to see how well the word fits within the sentence.

RNNs is network architecture used for translation, processing language sequentially. The sequential nature makes difficult to fully achieve the power of parallel processing units like TPUs. RNN suffers from vanishing and exploding gradient problems. RNNs have short term memories as it’s not good for remembering their inputs over a long period

While a Transformer network applies self-attention mechanism which scans through every word and appends attention scores(weights) to the words. Transformers’ training efficiency and superior performance in capturing long-distance dependencies is better compared to recurrent neural network architecture.


The usage of LSTM models restricts the prediction ability to a short range. While BERT uses a “masked language model” (MLM). MLM objective permits the representation of both the left and the right context, which allows to pre-train a deep bidirectional Transformer.


When applying fine-tuning based approaches to token-level tasks such as SQuAD question-answering, it is crucial to incorporate context from both directions while with OpenAI GPT, it uses a left-to-right architecture, where every token can only be attended to previous tokens in the self-attention layers of the Transformer.

  • GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only introduced at a fine-tuning time.
  • BERT learns [SEP](special token), [CLS](classifier token) and sentence embeddings throughout pre-training.
  • GPT used a similar learning rate of 5e-5 for all fine-tuning experiments. BERT chooses a task-specific fine-tuning learning rate that performs the most effective on the development set.


  • As we have lots of training data it becomes quite difficult to train even with a GPU, so we used Google’s TPU for fine-tuning task.
  • The time taken for inference was very large. Hence, we tweaked hyperparameters to make system accurate and give result in optimal time so we maintained a log for each hyperparameter and took an optimized combination of hyperparameters.


Our version of QnA using BERT can be tested at BERT NLP QnA Demo using Python.

(As the system is hosted on low-end configuration server, it currently takes around 50 seconds to process the sample comprehension and prepare answers from it. By increasing the resources the process can be completed in less time.)

Future Roadmap/improvement plan

  • Train model further on CoQA to get better accurate
  • Improve the inference speed to make the system production ready
  • Multi-language support (for example, Hindi or Gujarati comprehensions should also work)
  • Investigate the linguistic phenomena that may or may not be captured by this system.
  • To add voice assistant support.


Cover Image by geralt on Pixabay