What is NER?
In any text content, there are some terms that are more informative and unique in context. Named Entity Recognition (NER) also known as information extraction/chunking is the process in which algorithm extracts the real world noun entity from the text data and classifies them into predefined categories like person, place, time, organization, etc.
Importance of NER in NLP
Natural Language Processing includes various tasks like Machine Translation, Question and Answering, Sentiment Analysis, Part-of-speech (POS) Tagging, etc. for better understanding and processing of language. NER also one of the NLP Task. It is a sub-classification task of Information Extraction (IE) in Natural Language Processing. Many blogs, articles, and other long contents are being posted on websites, web portals and social media on a daily basis. NER is the right tool to find people, organizations, places, time, etc information included in the article and getting the major out of the long descriptions and categorizing them. NER also can be used in the NLP tasks such as text summarization, information retrieval, question answering system, semantic parsing, and coreference resolution.
What is BERT?
BERT (Bidirectional Encoder Representations from Transformers) is a general-purpose language model trained on the large dataset. This pre-trained model can be fine-tuned and used for different tasks such as sentimental analysis, question answering system, sentence classification and others. BERT is the state-of-the-art method for transfer learning in NLP.
For our demo, we have used the BERT-base uncased model as a base model trained by the HuggingFace with 110M parameters, 12 layers, , 768-hidden, and 12-heads.
Datasets for NER
There are many datasets for finetuning the supervised BERT Model. The Most Basic Dataset is CONLL 2003, concentrating on four types of named entities related to persons, locations, organizations, and names of miscellaneous entities. CONLL 2003 follow BIO schema which contain four columns separated by a single space.
BIO (Beginning, Inside, Outside) schema is a common tagging format for tagging sentence tokens for NER. Here B-prefix indicates that the tag is at the beginning of every chunk. Same I-prefix for Inside of Chunk and O-prefix for no entity inside chunk.
Let’s take an example, for the input “Joseph Wu Is Chairman of Taiwan’s Mainland Affairs Council”, the entities would be:
We have converted this dataset into a dataset containing only two columns which are word for sentence and name entity tag.
CONLL 2003 dataset has only 4 entities. To increase the categories of the entities we have merged other 4 datasets: Ontonote-5.0, GMB(Groningen Meaning Bank), NAACL 2019, wnut2017.
Entities: Miscellaneous, Person, Location.
Entities: Organization, Art Work, Numbers in word, Numbers, Quantity, Person, Location, Geopolitical Entity, Time, Date, Facility, Event, Law, Nationalities or religious or political groups, Language, Currency, Percentage, Product.
GMB(Groningen Meaning Bank)
Entities: Natural Phenomenon, Person, Geographical, Organization, Art Work, Event, Time, Geopolitical.
Entities: Organization, Person, Location, Geopolitical, Facility, Vehicles.
Entities: Location, Person, Product, Groups, Corporations, Creative.
These all datasets had a different format. We have merged and converted them into a single format. We have not used the whole dataset from all these five datasets, but selected part of them based on the number of entities, to generate an unbiased dataset.
In the final merged dataset with more than 40K sentences has a total of 17 entities with 45 tags (As per BIO schema).
BERT is a powerful NLP model but using it for NER without fine-tuning it on NER dataset won’t give good results.
So, once the dataset was ready, we fine-tuned the BERT model.
We have used the merged dataset generated by us to fine-tune the model to detect the entity and classify them in 22 entity classes.
In the evaluation of the fine-tuned model, we got an accuracy of 93.11%.
If you are eager to know how the NER system works and how accurate our trained model’s result, have a look at our demo:
To test the demo provide a sentence in the Input text section and hit the submit button. In a few seconds, you will have results containing words and their entities.
The fine-tuned model used on our demo is capable of finding below entities:
- Work Of Art
- Nationality / Religious / Political group
- Law Terms
- Ordinal Number
- Cardinal Number
We would love to get your feedback on our demo. Do check out our demo of the BERT based named entity Recognition system and let us know in the comment section below.
Make your own NER using BERT + CONLL
We have created this colab file using which you can easily make your own NER system:
It includes training and fine-tuning of BERT on CONLL dataset using transformers library by HuggingFace.
We believe in “There is always a scope of improvement!” philosophy.
This is the initial version of NER system we have created using BERT and we have already planned many improvements in that.
- Add more and more entities as much as possible to categories the entities in more specific manners.
- Find or prepare a good dataset in any other languages then fine-tune a model for other languages.
- Fine-tune the model for domain-specific datasets like medical, political, education, etc.
Purchase BERT Based NER
If you liked our demo and want to set up the same on your own server, then you can purchase it.
The basic version with 4 entities can be created easily by using the Colab file we have shared above so if you just want to do that then no need to purchase. If you want to have it with more entities then you can buy our model which is fine-tuned on 5 datasets.