Hello folks!!! We are glad to introduce another blog on the NER(Named Entity Recognition). After successful implementation of the model to recognise 22 regular entity types, which you can find here – BERT Based Named Entity Recognition (NER), we are here tried to implement domain-specific NER system. It reduces the labour work to extract the domain-specific dictionaries.
Why NER in bio-medical?
Nowadays research in the field of biomedical is expanding and also the literature of the biomedical domain is increasingly growing. With the progress in the NLP (Natural Language Processing) and deep learning, we decide to develop the NER system to recognise and classify the Biomedical substances from the text. This BIO-NER system can be used in various areas like a question-answering system or summarization system and many more areas of the domain-dependent NLP research.
BIOBERT is model that is pre-trained on the biomedical datasets. In the pre-training, weights of the regular BERT model was taken and then pre-trained on the medical datasets like (PubMed abstracts and PMC). This domain-specific pre-trained model can be fine-tunned for many tasks like NER(Named Entity Recognition), RE(Relation Extraction) and QA(Question-Answering system). As per the analysis, it is proven that fine-tuning BIOBERT model outperformed the fine-tuned BERT model for the biomedical domain-specific NLP tasks.
Dataset for BIO-NER
In our previous NER Model, we have used multiple datasets to increase entity types. The same concept we are applying here to increase the entity types and as well as dataset size for better results.
Hence, we have searched the different Biomedical domain-specific NER datasets and we found some of the good datasets named AnatEM, BC2GM, BC4CHEMD, BC5CDR, BIONLP09, BIONLP11ID, BIONLP13CG, BIONLP13PC, CRAFT, NCBI. These datasets are available in two formats BIOS schema and BIO schema. We have used BIO Schema for our task. In earlier NER blog we already discussed what is BIO schema which you can find here.
Let’s just go through all the datasets to get the idea about which entities we have used from a different dataset to increase the entity types in the Bio-NER model.
- AnatEM(Anatomical Entity Mention) Corpus:
- It contains sentences and entity types related to Anatomy. We have used this dataset for the entity ‘Anatomy’.
- BC2GM(BioCreative II Gene Mention) Corpus:
- This dataset has information related to ‘Genetic terms’. For this dataset used entity is ‘Gene’.
- BC4CHEMD (BioCreative IV Chemical and Drug) Corpus:
- As the name suggests this dataset contains information related to Chemical, chemical disease and Drug. We have used the ‘Chemical’ entity from this dataset.
- BC5CDR(BioCreative V Chemical Disease Relation) Corpus:
- In this dataset, there are two kinds of information, one chemical and other for disease relation. From this dataset, we have got two entities, ‘Chemical’ and ‘Disease’.
- BIONLP09(Bio-Medical Natural Language Processing) Corpus:
- BIONLP data preparation was a part of GENIA event corpus. From this Corpus, we used a ‘Protein’ annotated entity for the Model.
- BIONLP11ID(Bio-NLP 2011 Infection Diseased) Corpus:
- Bio-NLP 2011 shared task was focused on the data related to the rich tasks such as Infectious Diseases (ID) and Epigenetics and Post-translational Modifications (EPI). We have used ‘Chemical’, ‘Protein and ‘Organism’ entities from this dataset.
- BIONLP13(Bio-NLP 2013) Corpus:
- Bio-NLP 2013 shared task was focused on six dataset construction; we are using two datasets for our tasks: Cancer Genetics (CG) and Pathway Curation (PC).
- BioNLP13CG Corpus:
- From this corpus, we have used ‘Anatomy’, ‘Gene’, ‘Chemical’, ‘Organism’, ‘Cancer’, ‘Organ’, ‘Cell’, ‘Tissue’ and ‘Pathology’ entities for the model.
- BioNLP13PC Corpus:
- Used entities from this dataset are ‘Gene’, ‘Chemical’, ‘Cell’ and ‘Complex’.
- CRAFT (Colorado Richly Annotated Full Text) Corpus:
- CRAFT has worked on all most all the terminology like Cell Type Ontology, Chemical Entities of Biological Interest ontology, NCBI Taxonomy, Protein Ontology, Sequence Ontology, Entrez Gene database entries, and the three sub-ontologies of the Gene Ontology. For our project, we have used ‘Gene’, ‘Chemical’, ‘Protein’, ‘Organism’, ‘Cell’, and ‘Taxon’ entities.
- NCBI (National Center for Biotechnology Information) Disease corpus:
- As the name suggests NCBI has worked on the disease for this corpus, hence we have used this dataset for the entity ‘Disease’.
We have used the same fine-tuning method which we have used for our previous model. For that, we required all the dataset in the CONLL dataset format. But, all these selected datasets are not available in the required format to fine-tune. So, we merged all datasets and converted them into a CONLL format. The final merged dataset contains more than 69K sentences has a total of 13 entities with 27 tags (As per BIO schema).
In our dataset there are 13 entity types:
Fine-tuning BIOBERT model
As we discussed earlier, to fulfil the task of NER we have fine-tuned the pre-trained BIOBERT model, which is trained on the biomedical dataset. For the fine-tuning, we have used the merged dataset as explained above. The fine-tuned model is able to identify the biomedical terms and can classify them in the 13 different entities.
For the fine-tuning, we have used the huggingface’s NER method used for the fine-tuning on our datasets. But as this method is implemented in pytorch, we should have a pre-trained model in the PyTorch, but as BIOBERT is pre-trained using Tensorflow we get .ckpt file. And to use in huggingface pytorch, we need to convert it to .bin file.
So we have converted the TensorFlow model into the PyTorch using script which is given here.
Now it’s time to check predicted results from this fine-tuned model. For prediction, we have taken help from the Kamal Raj’s repository.
To break your eagerness to know how it works and results of the model we have developed the one demo page:
You can try with a demo by giving the sentences related to biomedical and it will give you the entities related to biomedical and the type of the entity with which we have fine-tuned the pre-trained BIOBERT model. We are keen to listen to you in the comment section below regarding the results and your suggestions.
This version is the beta version and we are still working on improvements.
We are working on increasing the dataset in a proper format to get the state-of-the-art result for the biomedical domain-specific NER system.
We are also planning to increase the entity types for the same domain.