AI Training Level 2 – Uniquify AI

Prerequisites

Before diving into GenAI, it is important to have a solid foundation. Only by having a sturdy base, will you be able to build upon your knowledge of AI. Skipping some of these fundamental elements can lead to confusion and hinder your progress. Here are some prerequisites that you should have

Level 2 Topics

Sentence Embedding

Sentence embedding is a technique in natural language processing where sentences are mapped into vectors of real numbers, essentially creating a mathematical representation of the sentence. This facilitates tasks such as semantic similarity measurement, text classification, and other language-related machine learning applications.

Self-attention

Self-Attention is a mechanism in machine learning models that allows them to focus on relevant parts of the input for making predictions. It assigns different importance (attention) to different parts of the input, enabling the model to make better context-aware decisions.

Residual layer

In deep learning, a residual layer is a technique used in architectures like convolutional neural networks (CNNs) to address vanishing gradients. It allows the network to learn the identity function (simply outputting the input unchanged) alongside more complex transformations. This helps with training very deep networks by making it easier for them to learn and improve upon the information passed through previous layers.

Feed Forward NN

A feedforward neural network is a basic architecture where information flows in one direction, from input to output, through hidden layers. It’s a building block in many models, including transformers. Transformers rely on a specific type of feedforward network called a Multi-Layer Perceptron (MLP) within their encoder and decoder blocks. This MLP helps process the information extracted by the transformer’s self-attention mechanism, allowing the model to learn complex relationships within the data.

Positional Embedding

Positional Embedding is a technique used in Natural Language Processing that assigns each word or token in a sequence a vector representation based on its position, enabling the model to understand the order of words in a sequence. This is crucial for tasks such as translation or sentence completio where the positioning of words impacts the meaning.

In/out Embedding Tie

In/out Embedding Tie refers to a method in machine learning where the same weights are shared between the input-to-hidden and hidden-to-output layers of a model, essentially “tying” these two layers together. This technique can reduce the number of parameters in the model, potentially improving the efficiency and performance of the model.

Multihead attention

Multihead attention is a mechanism in transformer models that allows the model to focus on different positions of the input sequence simultaneously, thereby capturing various aspects of the information. It does so by splitting the input data into multiple ‘heads’, each of which independently learns different types of attention, before being recombined to produce the final output.

Transformer

The Transformer architecture, a novel approach using self-attention for dependency modeling in sequence-to-sequence tasks. This revolutionized NLP due to its parallelization capabilities and effectiveness, making it a foundational model for many NLP tasks.

Tranformer-D

Transformer-D is a variation of the transformer architecture introduced in the paper “Generating Wikipedia by Summarizing Long Sequences.” It is designed to handle long sequences of text by incorporating specialized mechanisms that improve the model’s ability to generate coherent and contextually accurate summaries. Transformer-D addresses the challenge of managing long dependencies and maintaining the flow of information across lengthy inputs, which is critical for tasks like document summarization and long-form text generation. By enhancing the standard transformer model, Transformer-D achieves superior performance in summarizing extensive texts, demonstrating its capability to produce high-quality summaries from long input sequences.

Tokenization

Tokenization is a process in natural language processing (NLP) that splits a large amount of text into smaller parts called tokens. These tokens help in understanding the context or developing a model to analyze the text.

sub-word tokens

Sub-word tokens are smaller units derived from whole words, often used in natural language processing to handle unknown or rare words and improve model performance. They can range from single characters to common word stems or suffixes, allowing a model to generalize from known tokens to unseen ones.

WordPiece

WordPiece is a subword tokenization algorithm used in natural language processing, which breaks words into smaller units, allowing the model to handle rare or unseen words more effectively by understanding their components. It helps in reducing the size of the vocabulary and improving the computational efficiency of language models.

Output Decoding

Output decoding is the process of interpreting the output received from a system, often translating it from machine code into a human-readable format. It’s a crucial step in digital communication and data processing, where the decoded information is used for further analysis or decision making.

Greedy Search

Greedy Search is an algorithmic paradigm that follows the problem-solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. It is used in optimization problems where the goal is to make the most optimal decision at each step, reducing the problem’s complexity.

Beam Search

Beam Search is an algorithm used in many natural language processing tasks that reduces the complexity of searching through all possible sequences by only keeping track of the most promising sequences, referred to as “beams”. This approach helps to optimize computational efficiency without significantly sacrificing the quality of results.

GELU

GELU, or Gaussian Error Linear Unit, is a type of activation function used in artificial neural networks that helps decide whether and how much a neuron should be activated. It is known for its efficiency, as it allows for fast and accurate training, and its ability to mitigate the vanishing gradient problem, which can slow down or prevent learning.

Label Smoothing

Label Smoothing is a regularization technique in machine learning that prevents the model from becoming too confident about the class labels by assigning a small amount of the total probability to all other labels. This technique helps to mitigate overfitting and improves generalization of the model by making the training process less sensitive to the exact values of the labels.

Pre-training

Pre-training is a foundational technique in machine learning, particularly in the development of language models. It involves training a model on a large dataset before fine-tuning it on a specific task. During pre-training, a model learns general patterns, structures, and features from the data, which can include text, images, or other types of information. This process enables the model to acquire a broad understanding of the domain, making it more effective and efficient when subsequently trained on smaller, task-specific datasets. Pre-training is crucial for achieving state-of-the-art performance in various applications, such as natural language processing, computer vision, and speech recognition, by leveraging the knowledge gained from vast amounts of data.

Fine-tuning

Fine-tuning is a transfer learning technique where a pre-trained model is further trained on a specific task or domain with task-specific data. It involves updating the parameters of the pre-trained model using a smaller dataset, allowing the model to adapt to the nuances of the target task while leveraging the knowledge learned from the pre-training phase.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a groundbreaking language model introduced by Google AI. It employs a deep bidirectional transformer architecture, allowing it to capture context from both directions in a text sequence. BERT is pretrained using two main objectives: Masked Language Modeling (MLM), where random words in a sentence are masked and predicted, and Next Sentence Prediction (NSP), where the model learns to understand the relationship between paired sentences. This pretraining approach enables BERT to excel in various natural language understanding tasks, including question answering, sentiment analysis, and named entity recognition, setting new benchmarks across multiple NLP tasks.

Seg Embedding

Segment Embedding (Seg Embedding) is a component used in the BERT (Bidirectional Encoder Representations from Transformers) model to differentiate between different segments (or sentences) within a single input sequence. In BERT’s pretraining, each input consists of pairs of sentences, which are combined into a single sequence. Segment embeddings are used to indicate whether each token belongs to the first sentence (segment A) or the second sentence (segment B). This distinction is crucial for tasks like Next Sentence Prediction (NSP), where understanding the relationship between two sentences is essential. Segment embeddings, along with token and positional embeddings, enable BERT to effectively learn contextual representations and relationships between sentences during pretraining.

RoBERTa

RoBERTa (Robustly optimized BERT approach) is a transformer-based model for natural language understanding, introduced by Facebook AI in 2019. It builds on BERT (Bidirectional Encoder Representations from Transformers) by training on more data and for longer periods with larger mini-batches. RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pretraining and uses dynamic masking rather than static masking. It achieves state-of-the-art performance on a range of NLP tasks, demonstrating the importance of extensive pretraining and fine-tuning for improving model robustness and accuracy.

SpanBERT

SpanBERT is a variant of the BERT model specifically designed for improving span-based tasks in natural language processing (NLP). Introduced by researchers from Facebook AI in 2019, SpanBERT enhances BERT’s capabilities by modifying the pretraining objectives to better capture and predict contiguous spans of text. It replaces BERT’s Next Sentence Prediction (NSP) task with span boundary objectives, which involve predicting the entire span from its boundaries and reconstructing masked spans. SpanBERT achieves state-of-the-art performance on tasks like question answering and coreference resolution, highlighting its effectiveness in understanding and predicting text spans.

UniLM

UniLM (Unified Language Model) is a versatile transformer-based model developed by Microsoft Research. It is designed to handle various natural language processing (NLP) tasks, including natural language understanding, generation, and translation, within a single framework. UniLM achieves this by leveraging a unified pretraining approach that combines the objectives of different tasks, such as sequence-to-sequence and masked language modeling. This allows the model to generate coherent text sequences and understand context effectively. UniLM has demonstrated state-of-the-art performance on a wide range of NLP benchmarks, showcasing its flexibility and effectiveness in handling diverse language tasks.

MASS

MASS (Masked Sequence to Sequence Pre-training for Language Generation) is a language model developed by Microsoft Research. It introduces a novel pretraining method specifically designed for sequence-to-sequence language generation tasks. In MASS, a portion of the input sequence is masked, and the model is trained to predict the missing part, effectively learning to generate coherent sequences. This approach enhances the model’s ability to handle various generation tasks, such as machine translation, text summarization, and text generation. MASS has shown significant improvements in performance over existing methods, demonstrating the effectiveness of masked sequence-to-sequence pretraining for language generation.

BARD

BART (Bidirectional and Auto-Regressive Transformers) is a language model developed by Facebook AI, designed for natural language understanding and generation. It utilizes a denoising sequence-to-sequence pre-training approach where the model is trained to reconstruct the original text from a corrupted input. This method involves various types of noise, such as token masking, deletion, and shuffling. BART combines the strengths of bidirectional (BERT-like) and autoregressive (GPT-like) models, making it highly effective for tasks like text generation, translation, and summarization. BART achieves state-of-the-art performance on multiple NLP benchmarks, showcasing its robustness and versatility in handling complex language tasks.

QA

Question answering (QA) is a natural language processing task where a model is tasked with providing relevant answers to questions posed in natural language. It involves understanding the question, retrieving relevant information from a given context or knowledge base, and generating an accurate response.

RACE

The RACE dataset is a large-scale English language reading comprehension dataset, designed for evaluating machine reading comprehension models. It contains diverse texts and questions from various sources, including fictional and non-fictional texts, along with multiple-choice answers.

StoryCloze

The StoryCloze Test is a dataset for evaluating story understanding and script learning, where models are given the beginning of a story and must choose the correct ending from two options. It assesses the ability of models to generate coherent and contextually appropriate narrative continuations.

DRP for QA

Dense Passage Retrieval (DPR) is a technique used in open-domain question answering systems to efficiently retrieve relevant passages from a large corpus of documents. It involves encoding passages and questions into dense representations and using similarity search methods to identify relevant passages.

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a benchmark dataset for question answering tasks, consisting of questions posed on Wikipedia articles where the answer is a segment of text from the corresponding passage. It is widely used for training and evaluating question answering models.

Catastrophic forgetting

Catastrophic forgetting refers to the phenomenon where a machine learning model forgets previously learned information when trained on new data. It is a challenge in lifelong or continual learning settings where models need to adapt to new tasks while retaining knowledge from previous tasks.

NLI

Natural Language Inference (NLI) is the task of determining the logical relationship between two pieces of text, typically referred to as the premise and the hypothesis. It involves classifying whether the hypothesis can be inferred from the premise, often categorized into entailment, contradiction, or neutral.

SciTail

The SciTail dataset is a textual entailment dataset created from multiple-choice science exams and web sentences, designed to evaluate natural language inference models. It consists of premise-hypothesis pairs labeled as either entailment or neutral.

QNLI

The Question Natural Language Inference (QNLI) dataset is derived from the Stanford Question Answering Dataset (SQuAD), where the task is to determine if a given sentence contains the answer to a question. It is used to train and evaluate models on question-answer entailment.

MNLI

The Multi-Genre Natural Language Inference (MNLI) dataset consists of sentence pairs from a variety of sources, labeled for entailment, contradiction, or neutral relationships. It is designed to evaluate model performance across different genres of text.

SNLI

The Stanford Natural Language Inference (SNLI) dataset is a large collection of sentence pairs manually labeled for balanced classification with entailment, contradiction, and neutral categories. It is widely used for training and evaluating natural language understanding systems.

RTE

The RTE-5 dataset is part of the Fifth PASCAL Recognizing Textual Entailment Challenge, containing pairs of text and hypothesis labeled as entailment or non-entailment. This dataset is used to evaluate models’ abilities to determine if one sentence logically follows from another.

Semantic Similarity

Semantic similarity refers to the degree of relatedness or similarity between two pieces of text based on their meaning. It involves quantifying the similarity of words, phrases, sentences, or documents using various techniques such as word embeddings, semantic models, or similarity metrics.

QQP

The Quora Question Pairs (QQP) dataset contains pairs of questions from Quora labeled to indicate if they have the same intent or are semantically equivalent. It is used to train models for identifying duplicate questions.

STS-B

The Semantic Textual Similarity Benchmark (STSB) dataset from SemEval-2017, created by Google, contains sentence pairs annotated with similarity scores on a scale from 0 to 5. This dataset is used to train and evaluate models on their ability to predict the degree of semantic similarity between sentences.

MSRP

The Microsoft Research Paraphrase (MSRP) dataset consists of pairs of sentences, each annotated to indicate whether they are semantically equivalent (paraphrases) or not. It is used to train and test models for paraphrase detection and semantic similarity tasks.

Text Classification

Text classification is the task of categorizing text documents into predefined classes or categories based on their content. It is a fundamental problem in natural language processing and involves training models to automatically assign labels to text data, such as sentiment analysis, topic classification, or spam detection.

CoLA

The Corpus of Linguistic Acceptability (CoLA) consists of English sentences annotated for grammatical acceptability, based on the judgments of expert linguists. It is used to evaluate models on their ability to distinguish between grammatically correct and incorrect sentences.

SST

The Stanford Sentiment Treebank (SST) dataset includes movie reviews annotated with sentiment labels ranging from very negative to very positive. It is commonly used for training and evaluating sentiment analysis models.

GLUE

The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding tasks, including sentiment analysis, textual entailment, and question answering. It serves as a standard for evaluating and comparing the performance of language models.

Span representation

Span representation refers to the way a portion of text is encoded or represented in a machine learning model, often used in tasks like question answering where the model needs to identify spans of text that answer a given question.

Mchine comprehension

Machine comprehension involves training models to understand and answer questions about a passage of text. It’s a subfield of natural language processing focused on teaching machines to extract information from written sources.

Pointer Net

Pointer Networks are a type of neural network architecture designed to handle variable-sized inputs and outputs by learning to point to specific elements in a sequence. They are commonly used in tasks like sequence-to-sequence learning and combinatorial optimization.

DrQA

DrQA (Document Reader Question Answering) is a question answering system developed by Stanford University. It uses a combination of information retrieval and machine reading techniques to find relevant documents from a large corpus and extract answers to questions from them.

ORCA

ORCA (Open Retrieval Chatbot) is a chatbot framework developed by Google that integrates retrieval-based methods for generating responses in conversation. It utilizes passage retrieval techniques to find relevant information from a large knowledge base and generate contextually appropriate responses.

Wikipedia for QA

Wikipedia for Question Answering is a dataset developed by Stanford University for training and evaluating question answering models. It consists of questions paired with relevant passages from Wikipedia, allowing models to learn to extract answers from structured text.

XLNet

XLNet is a pre-trained language model developed by Google, based on the Transformer architecture, that achieves state-of-the-art results on various natural language processing tasks. It employs permutation-based training to capture bidirectional context and overcome limitations of traditional pre-training techniques like BERT.

REALM

REALM (Retrieval-Augmented Language Model) is a language model developed by Google that integrates a dense retrieval mechanism to enhance its understanding of natural language. It leverages retrievers to retrieve relevant information from a large-scale knowledge base during inference, improving its performance on various NLP tasks.

Context affects LM

The study delves into the nuanced relationship between contextual cues and the accuracy of language models in making factual predictions. It scrutinizes how variations in surrounding text can profoundly shape the models’ interpretations, shedding light on the pivotal role of context in refining the reliability of language model outputs.

RAG

Retrieval-Augmented Generation (RAG) is a framework developed by Facebook for question answering tasks, which combines retrieval-based and generation-based approaches. It retrieves relevant passages from a large corpus and generates answers conditioned on both the input question and retrieved passages.

GPT-1

Pioneered the Generative Pretraining (GPT) approach for text generation. Trained on a large text corpus, it can predict the next word, demonstrating text generation capabilities. This approach paved the way for further development in large language models.