Review: BERT

Joonsu Oh
3 min readDec 4, 2021
  • Long time no see! Had a bit of a break and came back finally.
  • I recently started working on training Korean version of GPT-3 so got myself into NLP. It is not necessary to know how BERT works but as I have been mainly working on implementing downstream tasks, BERT is often compared to GPT-3 so I should at least know how it works.
Figure 1
  • BERT stands for Bidirectional Encoder Representations from Transformer.
  • BERT basically only uses the encoder part of a transformer.
  • Now, let’s dive deep into how actually BERT works and how it was trained. (training methodology)

Inputs and Outputs

Figure 2
  • BERT takes two sentences as an input. Additionally, we put [CLS] and [SEP] tokens as shown in Figure 2.
  • BERT outputs NSP (Next Sentence Prediction) and two MASK LMs.
  • NSP tells you whether or not the two given sequences are consecutive or not.
  • MASK LMs are for MLM (Masked Language Model) task. It predicts which token should be at the masked location.
Figure 3
  • Figure 3 shows a more detailed input representation of BERT.
  • On top of token embeddings, it adds segment embedding and positional embeddings.
Figure 4
  • BTW, segment embeddings are also trainable parameters.
  • Since we need two types of segment embeddings, for the case of BERT-base (768 dimensions for embeddings), we would need 768*2 = 1536 trainable parameters.

Pre-training BERT

Task 1: Masked Language Model (MLM)

Figure 5
  • O_1 to O_n are just embeddings. On top of that, we add a fully connected layer that actually predicts a token.
  • Since 15% of each sequence are replaced with a [MASK] token, BERT has to guess which token would most likely have been there.
  • Self-supervision is given to the model.
Figure 6
  • Authors of BERT mention that a mismatch can occur between pre-training and fine-tuning since the [MASK] token does not appear during fine-tuning.
  • The solution is shown in Figure 6.
  • In my opinion, I am not sure why this is even required. Don’t we already replace a token with a [MASK] token with a probability?
Figure 7: NSP
  • “C” means [CLS] token in Figure 7.

--

--