In our previous case study about BERT based QnA, Question Answering System in Python using BERT NLP, developing chatbot using BERT was listed in roadmap and here we are, inching closer to one of our milestones that is to reduce the inference time. Currently it’s taking about 23 – 25 Seconds approximately on QnA demo which we wanted to bring down to less than 3 seconds.

Approaches we tried

1. We have analysed that the pretrained Bert Large model size is 1.2 GB, and after fine-tuning that model on our Dataset we got our new fine-tuned model of ~4.0 GB. We realised that this huge model is taking too much time for loading. Hence we started by optimizing the model. For that we tried various approaches one of them was we have excluded all adam variables, which had reduced our model size from 4.0 G.B to 1.3 G.B. But reducing the model size didn’t make any impact on our inference time. It was same as previously.

2. After that we had tried Tensorflow-serving, that is to create a .pb file and serve the model through it, which is generally used at the time of production. But this also doesn’t make any change in our inference time. Then we got an another approach from the bert-issues, i.e we can use the bert-as-a-service which uses bert as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations, but that also didn’t give us the desired success.

3. We even tried the Nvidia Real-Time Natural Language Understanding with BERT Using TensorRT https://devblogs.nvidia.com/nlu-with-tensorrt-bert/ , but to implement that we need high resources and we wanted to achieve our target through minimal resources. So we dropped it.

4. Then we moved towards the DistilBERT from Hugging face, it is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than bert-base-uncased. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBert already has fine-tuned models, we have used one of the fine-tuned model, which gives us a bit low accuracy but we were able to achieve the inference time of approximately 6 seconds. At now DistilBERT doesn’t support Multilingual languages and our roadmap was not only to reduce the inference time but also to make the Question & Answering System on various languages. Hence we need to find out some other way.

5. We studied the Transformers from huggingface and got an idea that we can first load our model and then do the inference which will save time while doing inference. And this GitHub repo by Kamal Raj helped in setting up the same. First, we developed a system in which one can provide a paragraph and a single question, that takes less than 1 second to answer the question. Then we have configured it, in which one can provide maximum 5 questions. We have observed that as the number of questions increases the time also increases. We even tested it by providing different length of paragraphs, and observed that as the length of paragraph increases the time also increases. But that can be overcome by using the GPU, even if we are using the low configuration GPU then also we are getting answers in less than a second.

Further Roadmap

  • It works on a paragraph worth 1000 characters long only. If we increase the paragraph length then it takes longer. We are working on the increase the length of the paragraph to 1 million characters.
  • It’s not fully conversational. It provides answer from the set paragraph only. We want to make it conversational so that it doesn’t sound simple QnA but proper conversational AI bot.

Demo

The BERT Chatbot Demo is available here.

The demo is set up on a server with very minimal resources so it still takes 3-4 seconds of time. On the local system, it takes less than 2 seconds to get the response.

We could achieve the inference speed of less than 2 seconds on 10,000 characters long paragraph for English language and less than 1 second on 10,000 characters long paragraph for other languages on system with RTX 2080 GPU.