context
betterWord2Vec
to BERT
and beyond, this is the underlying logic!Language modeling can be seen as the ideal source task as it captures many facets of language relevant for downstream tasks, such as long-term dependencies, hierarchical relations and sentiment
Ruder et al in the ULMFiT paper
unsupervised
or self-supervised
(since we already know what is the next word in the corpus)Fine Tuning
BERT
Fine-tuned for Text Classification
Source of image: An article in Research Gate | Link
- Evolution of RNN Architecture till LSTM
- Seq2Seq Models - A higher form of LMs
- The ImageNet moment in NLP; advent of LSTM models ULMFiT and ELMo
- LSTM Seq2Seq Model with Attention
- Transformer - A Seq2Seq Model with Attention
- The Advent of BERT and similar Transformers
- What has been the trend of recent Pre-trained Tranformer-based LMs?
- What direction should future Pre-trained Transformer-based LMs go?
Source: NLP Course on Coursera by National Research University Higher School of Economics
Pre-trained
Language Models¶Sources:
Pre-trained
Language Models¶- Models such as the Multi-layer Perceptron Network, vector machines and logistic regression did not perform well on sequence modelling tasks (e.g.: text_sequence2sentiment_classification)
- Why? Lack of memory element ; No information retention
- RNNs attempted to redress this shortcoming by introducing loops within the network, thus allowing the retention of information.
Source: https://colah.io/
Advantage of a vanilla RNN:
- Better than traditional ML algos in retaining information
Limitations of a vanilla RNN:
- RNNs fail to model long term dependencies.
- the information was often **"forgotten"** after the unit activations were multiplied several times by small numbers
- Vanishing gradient and exploding gradient problems
Long Short Term Memory (LSTM):
- A special type of RNN architecture
- Designed to keep information retained for extended number of timesteps
Advantage of a LSTM:
- Better equipped for long range dependencies
- Resists better than RNNs for vanishing gradient problem
Limitations of LSTM:
- Added gates lead to more computation requirement and LSTMs tend to be slower
- Difficult to train
- Transfer Learning never really worked
- Very long gradient paths
- LSTM on 100-word doc has gradients 100-layer network
Pre-trained
Language Models¶
Source: https://indicodata.ai/blog/sequence-modeling-neuralnets-part1/
Transformer:
Transformer vs LSTM
The BERT Mountain by Chris McCormick:
Source: www.mccormickml.com
Question to ponder:
- How much is the contribution from each?
Better model architectures are needed to capture long-range information
Need more efficient models with explainability in mind as well
ImageNet
moment in NLP called upon by the first pre-Transformer era Transfer Learning models - ULMFiT and ELMoExample notebooks for Text Classification
Application
Jay Alamar's Post: DistilBERT for Feature Extraction
+ Logitic Regression for classification | Link
Jay Alamar's Post: BERT Fine-tuned for Classification | Picture_Link | HuggingFace Example Fine-tuning Notebook)
References:
Evolution of Transfer Learning in Natural Language Processing
| LinkPre-trained Models for NLP
| LinkThe State of Transfer Learning in NLP
| Article by Sebastian Ruder | LinkNLP's ImageNet Moment has arrived
| Article by Sebastian Ruder
| LinkRecent Advances in LMs
| Article by Sebastian | LinkSequence Modeling with Neural Networks
LSTM is dead. Long Live Transformers
| YouTube Video by Leo Dirac | Presentation on the same titleThe Future of NLP
video and slides by Thomas Wolf, HugginngFace Co-Founder | YouTube Video