The detailed workings of Transformer are described in a Google document

shaownhasan · Post by **shaownhasan** » Tue Jan 28, 2025 3:39 am

In the paper, the researchers detail a new technique called Masked LM (MLM) that enables bidirectional alignment in patterns where it was previously impossible. Its researchers developed computer processes that handle the words in a sentence by relating them to each other , rather than handling them individually as was done until now.

How does Bert work?
BERT uses Transformer, an attention mechanism that learns the contextual relationships between words (or subwords) in a text. In its original form, Transformer includes two separate mechanisms: an encoder that reads the components of a job seekers database text input and a decoder that produces a prediction for the task . Since the goal of BERT is to generate a language model, only the encoder mechanism is needed.

Unlike directional models, which read the input text sequentially (from left to right or right to left), the Transformer encoder reads the entire sequence of words at once. It is therefore considered bidirectional, although it would be more accurate to say that it is non-directional.

This feature allows the model to learn the context of a word based on everything around it (left and right of the word).

When aligning language models, it is difficult to define a prediction goal. Many models predict the next word in a sequence “The child came home from ___”), a directional approach that inherently limits context learning. To overcome this challenge, BERT uses two training strategies: