Long time no see! Had a bit of a break and came back finally.
I recently started working on training Korean version of GPT-3 so got myself into NLP. It is not necessary to know how BERT works but as I have been mainly working on implementing downstream tasks, BERT is often compared to GPT-3 so I should at least know how it works.
BERT stands for Bidirectional Encoder Representations from Transformer.
BERT basically only uses the encoder part of a transformer.
Now, let’s dive deep into how actually BERT works and how it was trained. (training methodology)
Inputs and Outputs
BERT takes two sentences as an input. Additionally, we put [CLS] and [SEP] tokens as shown in Figure 2.
BERT outputs NSP (Next Sentence Prediction) and two MASK LMs.
NSP tells you whether or not the two given sequences are consecutive or not.
MASK LMs are for MLM (Masked Language Model) task. It predicts which token should be at the masked location.
Figure 3 shows a more detailed input representation of BERT.
On top of token embeddings, it adds segment embedding and positional embeddings.
BTW, segment embeddings are also trainable parameters.
Since we need two types of segment embeddings, for the case of BERT-base (768 dimensions for embeddings), we would need 768*2 = 1536 trainable parameters.
Pre-training BERT
Task 1: Masked Language Model (MLM)
O_1 to O_n are just embeddings. On top of that, we add a fully connected layer that actually predicts a token.
Since 15% of each sequence are replaced with a [MASK] token, BERT has to guess which token would most likely have been there.
Self-supervision is given to the model.
Authors of BERT mention that a mismatch can occur between pre-training and fine-tuning since the [MASK] token does not appear during fine-tuning.
The solution is shown in Figure 6.
In my opinion, I am not sure why this is even required. Don’t we already replace a token with a [MASK] token with a probability?