March 12, 2020 No Comments

Exploring more capabilities of Google’s pre-trained model BERT (github), we are diving in to check how good it is to find entities from the sentence.

What is NER?

In any text content, there are some terms that are more informative and unique in context. Named Entity Recognition (NER) also known as information extraction/chunking is the process in which algorithm extracts the real world noun entity from the text data and classifies them into predefined categories like person, place, time, organization, etc.

Importance of NER in NLP

Natural Language Processing includes various tasks like Machine Translation, Question and AnsweringSentiment Analysis, Part-of-speech (POS) Tagging, etc. for better understanding and processing of language. NER also one of the NLP Task. It is a sub-classification task of Information Extraction (IE) in Natural Language Processing. Many blogs, articles, and other long contents are being posted on websites, web portals and social media on a daily basis. NER is the right tool to find people, organizations, places, time, etc information included in the article and getting the major out of the long descriptions and categorizing them. NER also can be used in the NLP tasks such as text summarization, information retrieval, question answering system, semantic parsing, and coreference resolution.

What is BERT?

BERT (Bidirectional Encoder Representations from Transformers) is a general-purpose language model trained on the large dataset. This pre-trained model can be fine-tuned and used for different tasks such as sentimental analysis, question answering system, sentence classification and others. BERT is the state-of-the-art method for transfer learning in NLP.

For our demo, we have used the BERT-base uncased model as a base model trained by the HuggingFace with 110M parameters, 12 layers, , 768-hidden, and 12-heads.

Datasets for NER

There are many datasets for finetuning the supervised BERT Model. The Most Basic Dataset is CONLL 2003, concentrating on four types of named entities related to persons, locations, organizations, and names of miscellaneous entities. CONLL 2003 follow BIO schema which contain four columns separated by a single space.

BIO (Beginning, Inside, Outside) schema is a common tagging format for tagging sentence tokens for NER. Here B-prefix indicates that the tag is at the beginning of every chunk. Same I-prefix for Inside of Chunk and O-prefix for no entity inside chunk.

Let’s take an example, for the input “Joseph Wu Is Chairman of Taiwan’s Mainland Affairs Council”, the entities would be:

Joseph B-PER
Wu I-PER
Is O
Chairman O
Of O
Taiwan B-LOC
‘s O
Mainland B-ORG
Affairs I-ORG
Council I-ORG

We have converted this dataset into a dataset containing only two columns which are word for sentence and name entity tag.

CONLL 2003 dataset has only 4 entities. To increase the categories of the entities we have merged other 4 datasets: Ontonote-5.0, GMB(Groningen Meaning Bank), NAACL 2019, wnut2017.

CONLL 2003

Entities: Miscellaneous, Person, Location.

Ontonote-5.0

Entities: Organization, Art Work, Numbers in word, Numbers, Quantity, Person, Location, Geopolitical Entity, Time, Date, Facility, Event, Law, Nationalities or religious or political groups, Language, Currency, Percentage, Product.

GMB(Groningen Meaning Bank)

Entities: Natural Phenomenon, Person, Geographical, Organization, Art Work, Event, Time, Geopolitical.

NAACL 2019

Entities: Organization, Person, Location, Geopolitical, Facility, Vehicles.

Wnut2017

Entities: Location, Person, Product, Groups, Corporations, Creative.

These all datasets had a different format. We have merged and converted them into a single format. We have not used the whole dataset from all these five datasets, but selected part of them based on the number of entities, to generate an unbiased dataset.

In the final merged dataset with more than 40K sentences has a total of 17 entities with 45 tags (As per BIO schema).

Fine-tuning

BERT is a powerful NLP model but using it for NER without fine-tuning it on NER dataset won’t give good results.

So, once the dataset was ready, we fine-tuned the BERT model.

We have used the merged dataset generated by us to fine-tune the model to detect the entity and classify them in 22 entity classes.

In the evaluation of the fine-tuned model, we got an accuracy of 93.11%.

Demo

If you are eager to know how the NER system works and how accurate our trained model’s result, have a look at our demo:

Bert Based Named Entity Recognition Demo

To test the demo provide a sentence in the Input text section and hit the submit button. In a few seconds, you will have results containing words and their entities.

The fine-tuned model used on our demo is capable of finding below entities:

  • Person
  • Facility
  • Location
  • Organization
  • Work Of Art
  • Event
  • Date
  • Time
  • Nationality / Religious / Political group
  • Law Terms
  • Product
  • Percentage
  • Currency
  • Language
  • Quantity
  • Ordinal Number
  • Cardinal Number

We would love to get your feedback on our demo. Do check out our demo of the BERT based named entity Recognition system and let us know in the comment section below.

Make your own NER using BERT + CONLL

We have created this colab file using which you can easily make your own NER system:

BERT Based NER on Colab

It includes training and fine-tuning of BERT on CONLL dataset using transformers library by HuggingFace.

Further Roadmap

We believe in “There is always a scope of improvement!” philosophy.

This is the initial version of NER system we have created using BERT and we have already planned many improvements in that.

  • Add more and more entities as much as possible to categories the entities in more specific manners.
  • Find or prepare a good dataset in any other languages then fine-tune a model for other languages.
  • Fine-tune the model for domain-specific datasets like medical, political, education, etc.

Purchase BERT Based NER

If you liked our demo and want to set up the same on your own server, then you can purchase it.

The basic version with 4 entities can be created easily by using the Colab file we have shared above so if you just want to do that then no need to purchase. If you want to have it with more entities then you can buy our model which is fine-tuned on 5 datasets.

Find more details on Buy BERT based Named Entity Recognition (NER) fine-tuned model and PyTorch based Python + Flask code.

Acknowledgment

We are thankful to Google Research for releasing BERT, Huggingface for open sourcing pytorch transformers library and Kamalraj for his fantastic work on BERT-NER.

If you are looking for custom BERT based NER then do contact us or send email at letstalk@pragnakalp.com to avail our Natural Language Processing services.

Write a comment

Your email address will not be published. Required fields are marked *

Pragnakalp Techlabs: Your trusted partner in Python, AI, NLP, Generative AI, ML, and Automation. Our skilled experts have successfully delivered robust solutions to satisfied clients, driving innovation and success.