Uniquify AI

GenAI Talent Solutions
Smart LLM Technologies for Enterprises

GenAI Talent Class

UniquifyAI recruits, trains, and deploys talented GenAI resouces around the world.

ChatGPT Explosion History

No other technology field has witnessed the same level of explosive growth in the last 3 years as LLM technology.

ChatGPT Tech Tree

The brain, the textbook, and the mentor; ChatGPT technology is consisted of 3 types: AI/ML(Brain) , Data(Textbook), and Others(Mentor).

Name: GPT-1

Date: 2018

Source: OpenAI

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: Pioneered the Generative Pretraining (GPT) approach for text generation. Trained on a large text corpus, it can predict the next word, demonstrating text generation capabilities. This approach paved the way for further development in large language models.

Name: Pre-training

Date: 0

Source: nan

Talent Level: L2

Tech Type: AI/ML

Tech Source: Academic

Summary: Pre-training is a foundational technique in machine learning, particularly in the development of language models. It involves training a model on a large dataset before fine-tuning it on a specific task. During pre-training, a model learns general patterns, structures, and features from the data, which can include text, images, or other types of information. This process enables the model to acquire a broad understanding of the domain, making it more effective and efficient when subsequently trained on smaller, task-specific datasets. Pre-training is crucial for achieving state-of-the-art performance in various applications, such as natural language processing, computer vision, and speech recognition, by leveraging the knowledge gained from vast amounts of data.

Name: RoBERTa

Date: 2019

Source: UWash, FAIR

Talent Level: L2

Tech Type: AI/ML

Tech Source: Collaborative

Summary: RoBERTa (Robustly optimized BERT approach) is a transformer-based model for natural language understanding, introduced by Facebook AI in 2019. It builds on BERT (Bidirectional Encoder Representations from Transformers) by training on more data and for longer periods with larger mini-batches. RoBERTa removes the Next Sentence Prediction (NSP) task from BERT's pretraining and uses dynamic masking rather than static masking. It achieves state-of-the-art performance on a range of NLP tasks, demonstrating the importance of extensive pretraining and fine-tuning for improving model robustness and accuracy.

Name: SpanBERT

Date: 2019

Source: Facebook, UW

Talent Level: L2

Tech Type: AI/ML

Tech Source: Collaborative

Summary: SpanBERT is a variant of the BERT model specifically designed for improving span-based tasks in natural language processing (NLP). Introduced by researchers from Facebook AI in 2019, SpanBERT enhances BERT's capabilities by modifying the pretraining objectives to better capture and predict contiguous spans of text. It replaces BERT's Next Sentence Prediction (NSP) task with span boundary objectives, which involve predicting the entire span from its boundaries and reconstructing masked spans. SpanBERT achieves state-of-the-art performance on tasks like question answering and coreference resolution, highlighting its effectiveness in understanding and predicting text spans.

Name: UniLM

Date: 2019

Source: Microsoft

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: UniLM (Unified Language Model) is a versatile transformer-based model developed by Microsoft Research. It is designed to handle various natural language processing (NLP) tasks, including natural language understanding, generation, and translation, within a single framework. UniLM achieves this by leveraging a unified pretraining approach that combines the objectives of different tasks, such as sequence-to-sequence and masked language modeling. This allows the model to generate coherent text sequences and understand context effectively. UniLM has demonstrated state-of-the-art performance on a wide range of NLP benchmarks, showcasing its flexibility and effectiveness in handling diverse language tasks.

Name: MASS

Date: 2019

Source: Microsoft

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: MASS (Masked Sequence to Sequence Pre-training for Language Generation) is a language model developed by Microsoft Research. It introduces a novel pretraining method specifically designed for sequence-to-sequence language generation tasks. In MASS, a portion of the input sequence is masked, and the model is trained to predict the missing part, effectively learning to generate coherent sequences. This approach enhances the model's ability to handle various generation tasks, such as machine translation, text summarization, and text generation. MASS has shown significant improvements in performance over existing methods, demonstrating the effectiveness of masked sequence-to-sequence pretraining for language generation.

Name: BART

Date: 2019

Source: Facebook

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: BART (Bidirectional and Auto-Regressive Transformers) is a language model developed by Facebook AI, designed for natural language understanding and generation. It utilizes a denoising sequence-to-sequence pre-training approach where the model is trained to reconstruct the original text from a corrupted input. This method involves various types of noise, such as token masking, deletion, and shuffling. BART combines the strengths of bidirectional (BERT-like) and autoregressive (GPT-like) models, making it highly effective for tasks like text generation, translation, and summarization. BART achieves state-of-the-art performance on multiple NLP benchmarks, showcasing its robustness and versatility in handling complex language tasks.

Name: BERT

Date: 2018

Source: Google

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: BERT (Bidirectional Encoder Representations from Transformers) is a groundbreaking language model introduced by Google AI. It employs a deep bidirectional transformer architecture, allowing it to capture context from both directions in a text sequence. BERT is pretrained using two main objectives: Masked Language Modeling (MLM), where random words in a sentence are masked and predicted, and Next Sentence Prediction (NSP), where the model learns to understand the relationship between paired sentences. This pretraining approach enables BERT to excel in various natural language understanding tasks, including question answering, sentiment analysis, and named entity recognition, setting new benchmarks across multiple NLP tasks.

Name: Seg Embedding

Date: 2018

Source: Google

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: Segment Embedding (Seg Embedding) is a component used in the BERT (Bidirectional Encoder Representations from Transformers) model to differentiate between different segments (or sentences) within a single input sequence. In BERT's pretraining, each input consists of pairs of sentences, which are combined into a single sequence. Segment embeddings are used to indicate whether each token belongs to the first sentence (segment A) or the second sentence (segment B). This distinction is crucial for tasks like Next Sentence Prediction (NSP), where understanding the relationship between two sentences is essential. Segment embeddings, along with token and positional embeddings, enable BERT to effectively learn contextual representations and relationships between sentences during pretraining.

Name: Cloze Test

Date: 1953

Source: Taylor

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: The Cloze Test is a linguistic assessment method used to evaluate language comprehension and proficiency. In this test, participants are presented with a passage of text in which certain words are removed and replaced with blanks. The participants must then fill in the blanks with the appropriate words based on the context of the remaining text. The Cloze Test is commonly used in language education and psycholinguistics to measure reading ability, vocabulary knowledge, and overall language understanding. It serves as a diagnostic tool to identify areas of strength and weakness in a learner's linguistic abilities.

Name: T5

Date: 2019

Source: Google

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: T5 (Text-To-Text Transfer Transformer) is a transformer-based language model introduced by Google Research, designed to handle various natural language processing tasks using a unified text-to-text framework. Presented in the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer," T5 converts all tasks, such as translation, summarization, and classification, into a text-to-text format. This approach simplifies the model architecture and leverages transfer learning more effectively. T5 is pretrained on a large, diverse dataset using a denoising objective, where corrupted input sequences are reconstructed. It achieves state-of-the-art performance across multiple benchmarks, demonstrating the effectiveness of the unified text-to-text approach for various NLP tasks.

Name: Tranformer-D

Date: 2018

Source: Google

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: Transformer-D is a variation of the transformer architecture introduced in the paper "Generating Wikipedia by Summarizing Long Sequences." It is designed to handle long sequences of text by incorporating specialized mechanisms that improve the model's ability to generate coherent and contextually accurate summaries. Transformer-D addresses the challenge of managing long dependencies and maintaining the flow of information across lengthy inputs, which is critical for tasks like document summarization and long-form text generation. By enhancing the standard transformer model, Transformer-D achieves superior performance in summarizing extensive texts, demonstrating its capability to produce high-quality summaries from long input sequences.

Name: Transformer

Date: 2017

Source: Google AI

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: The Transformer architecture, a novel approach using self-attention for dependency modeling in sequence-to-sequence tasks. This revolutionized NLP due to its parallelization capabilities and effectiveness, making it a foundational model for many NLP tasks.

Name: Feed Forward NN

Date: 2017

Source: Google AI

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: A feedforward neural network is a basic architecture where information flows in one direction, from input to output, through hidden layers. It's a building block in many models, including transformers. Transformers rely on a specific type of feedforward network called a Multi-Layer Perceptron (MLP) within their encoder and decoder blocks. This MLP helps process the information extracted by the transformer's self-attention mechanism, allowing the model to learn complex relationships within the data.

Name: Residual layer

Date: 2017

Source: Google AI

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: In deep learning, a residual layer is a technique used in architectures like convolutional neural networks (CNNs) to address vanishing gradients. It allows the network to learn the identity function (simply outputting the input unchanged) alongside more complex transformations. This helps with training very deep networks by making it easier for them to learn and improve upon the information passed through previous layers.

Name: ResNet

Date: 2015

Source: Microsoft

Talent Level: L1

Tech Type: AI/ML

Tech Source: Industry

Summary: Resnet, short for Residual Neural Network, is a type of Convolutional Neural Network (CNN) architecture specifically designed to overcome challenges in training very deep networks. Traditional CNNs suffer from vanishing gradients, where information gets lost as it passes through many layers. Resnet tackles this by introducing skip connections. These connections bypass some layers and add the original input directly to the output of those layers. This allows the network to learn more complex transformations while also preserving the ability to learn the identity function (simply outputting the input unchanged). This approach enables Resnets to achieve superior performance on various computer vision tasks compared to traditional CNNs, especially when dealing with very deep architectures.

Name: Layer Norm

Date: 2016

Source: UToronto

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: Layer normalization is a technique in machine learning that stabilizes the training process of neural networks. It works by normalizing the activations (outputs) of each neuron within a hidden layer, independently across different training examples. This helps address a phenomenon called internal covariate shift, where the distribution of activations can change throughout training, hindering learning. Layer normalization improves gradient flow, allowing for faster training and better generalization performance of the model.

Name: Batch Normalization

Date: 2015

Source: Google

Talent Level: L1

Tech Type: AI/ML

Tech Source: Industry

Summary: Batch normalization is a technique used during training in deep neural networks. It addresses a problem called internal covariate shift, where the distribution of data changes between layers as the network learns. Batch normalization normalizes the activations of each layer, making the training process faster and more stable. This allows the network to learn from a wider range of weight initializations and helps prevent overfitting.

Name: CNN

Date: 1980

Source: Fukushima

Talent Level: L1

Tech Type: AI/ML

Tech Source: Industry

Summary: CNN, or Convolutional Neural Network, is a type of deep learning model excelling at tasks involving grids of data, like images. It extracts features through convolutional layers that slide over the input, identifying patterns and edges. Pooling layers then summarize this information. Fully-connected layers at the end use these features for classification or other tasks. CNNs are particularly effective in computer vision applications like image recognition and object detection.

Name: AE

Date: 1991

Source: MIT

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: Autoencoders are unsupervised learners that compress data into a lower-dimensional space and then try to recreate the original data from that compressed version. This process helps them learn efficient representations of the data, useful for dimensionality reduction or anomaly detection.

Name: GAN

Date: 2014

Source: Montreal U

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: Generative Adversarial Networks (GANs) are a type of machine learning system that uses two competing neural networks to create new data. One network, the generator, tries to produce realistic data based on a training set, while the other network, the discriminator, tries to identify if the data is real or generated. This competition pushes both networks to improve, resulting in the generator creating increasingly realistic new data, like images, music, or even text.

Name: RNN

Date: 1993

Source: Munchen

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: Recurrent Neural Networks (RNNs) are a type of deep learning model that excels at handling sequential data like text or speech. Unlike traditional neural networks, RNNs can remember past inputs thanks to a hidden state. This allows them to analyze sequences and make predictions based on the context, making them ideal for tasks like machine translation, speech recognition, and caption generation.

Name: LSTM

Date: 1995

Source: Google

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: LSTMs, or Long Short-Term Memory networks, are a special kind of Recurrent Neural Network (RNN) designed to overcome a weakness in RNNs. LSTMs have a built-in memory and control mechanisms that allow them to learn from long sequences of data. This makes them particularly good at tasks where understanding long-term context is important, like machine translation, speech recognition, and handwriting analysis.

Name: BiRNN

Date: 1997

Source: IEEE

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: BiRNN, short for Bidirectional Recurrent Neural Network, is a type of RNN that tackles sequences from both directions. Unlike a standard RNN that only looks at the past, a BiRNN considers both past and future elements in a sequence. This extra context allows BiRNNs to better understand the entire sequence, making them useful for tasks like sentiment analysis, speech recognition, and machine translation where surrounding information is crucial.

Name: VGG16

Date: 2014

Source: UOxford

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: VGG16 is a deep learning model, specifically a Convolutional Neural Network (CNN), known for its accuracy in image recognition and classification. It analyzes images through a series of stacked layers, progressively extracting higher-level features. VGG16 is famous for its depth (16 layers) and is often used as a pre-trained model to jumpstart training on new image recognition tasks.

Name: FaceNet

Date: 2015

Source: Google

Talent Level: L1

Tech Type: AI/ML

Tech Source: Industry

Summary: Facenet is a deep learning system for face recognition. It takes a person's face image and creates a unique 128-dimensional code, like a fingerprint for their face. This code captures key facial features and distances between them. Faces from the same person will have similar codes, even under variations like lighting or pose. Facenet is a powerful tool for building face recognition applications.

Name: VGGFace

Date: 2017

Source: UOxford

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: VGGFace refers to a family of pre-trained models specifically designed for face recognition tasks in machine learning. Based on the VGG16 architecture, these models are trained on massive datasets of labeled faces. VGGFace excels at extracting facial features and can be used for various tasks like face detection, recognition, and verification. It's often used as a starting point for fine-tuning on specific face recognition problems.

Name: YOLO

Date: 2015

Source: UW

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: YOLO, standing for "You Only Look Once," is a machine learning algorithm for real-time object detection. Unlike some methods, YOLO uses a single neural network to efficiently analyze the entire image at once. It divides the image into a grid and predicts bounding boxes and class probabilities for objects within each grid cell. This makes YOLO super fast for applications like self-driving cars or video surveillance where real-time processing is crucial. However, it can be less accurate than some other object detectors.

Name: CTC

Date: 2006

Source: Muchen U

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: CTC, or Connectionist Temporal Classification, is a technique used in machine learning specifically for sequence recognition tasks, often involving audio or text data. Unlike other methods, CTC doesn't require perfect alignment between the input sequence (like speech) and the output (like text). It considers all possible alignments and picks the most likely one, making it robust for dealing with variations in speech speed or pronunciation. CTC is commonly used in speech recognition systems.

Name: ReLU, Sigmoid

Date: 0

Source: nan

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: ReLU (Rectified Linear Unit) and sigmoid are activation functions used in machine learning's artificial neurons. They determine how the neuron processes information. ReLU acts like a switch, only firing if the input is positive. This makes it faster to compute and avoids vanishing gradients in deep networks. Sigmoid squishes values between 0 and 1, but can suffer from vanishing gradients and may not be ideal for all tasks.

Name: MLP

Date: 1958

Source: Frank Rosenblatt

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: In machine learning, an MLP, or Multilayer Perceptron, is a fundamental type of artificial neural network. It consists of layers of interconnected nodes, inspired by the human brain. Unlike simpler models, MLPs have multiple "hidden layers" between the input and output layers. These layers allow MLPs to learn complex patterns in data that aren't easily separated with straight lines. This makes them useful for tasks like image recognition, spam filtering, and even playing games.

Name: Perceptron

Date: 1959

Source: Frank Rosenblatt

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: A perceptron is the building block of neural networks in machine learning. It's a simple algorithm for binary classification tasks. Imagine a single neuron receiving inputs, like features of an image. The perceptron assigns weights to these inputs and sums them. If the sum is above a certain threshold, it outputs a 1, otherwise a 0. Perceptrons learn by adjusting weights to improve their classification accuracy. While limited to binary problems, they are a foundational concept for understanding more complex neural networks.

Name: Supervised learning

Date: 0

Source: nan

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: Supervised learning is a machine learning technique where algorithms learn from labeled data. This data acts like a training manual, with clear examples of inputs and their corresponding desired outputs. By analyzing these pairs, the algorithm learns the relationship between them and can then predict outputs for new, unseen data. Imagine a student learning shapes from a teacher. Supervised learning works similarly, allowing machines to learn and make predictions based on labeled examples.

Name: Unsupervised learning

Date: 0

Source: nan

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: In unsupervised machine learning, algorithms work with data that lacks predefined labels or categories. Unlike supervised learning where the data comes with clear instructions, unsupervised learning algorithms are tasked with finding hidden patterns and structures within the data itself. They achieve this by analyzing the data to group similar elements together, identify hidden categories, or find anomalies. This allows them to uncover interesting insights and prepare the data for further analysis or even generate entirely new data.

Name: Reinforcement learning

Date: 0

Source: nan

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: In reinforcement learning, an agent interacts with an environment by taking actions and receiving rewards. Through trial and error, the agent learns which actions lead to the most reward, constantly refining its strategy. This method mimics how humans learn by experimentation and is useful for tasks where the best course of action isn't explicitly defined.

Name: Regression, classification

Date: 0

Source: nan

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: Regression and classification are two fundamental tasks in machine learning. Regression algorithms predict continuous values, like housing prices or weather forecasts. Classification algorithms, on the other hand, predict discrete categories, such as whether an email is spam or an image contains a cat. Essentially, regression predicts "what" while classification predicts "which".

Name: Machine learning

Date: 1959

Source: Arthur Samuel

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: Machine learning is a type of artificial intelligence where computers learn from data without being explicitly programmed. By analyzing data, they can identify patterns and make predictions on new data, becoming increasingly accurate over time.

Name: Linear Algebra

Date: 0

Source: nan

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: Linear algebra is the branch of mathematics focused on vectors, matrices, and linear equations. It provides the foundation for many machine learning algorithms, allowing us to represent data, perform transformations, and solve complex problems efficiently.

Name: Data, Label

Date: 0

Source: nan

Talent Level: L1

Tech Type: Data

Tech Source: Academic

Summary: In machine learning, data is the raw material, like images, text, or numbers. It holds the information the model needs to learn from. Labels are added information that tells the model what the data represents. Think of data as ingredients and labels as instructions in a recipe � both are crucial for the model to learn and make predictions.

Name: Vector math

Date: 0

Source: nan

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: Vector math extends regular arithmetic to objects with both magnitude (size) and direction. Imagine arrows representing forces or velocities. Vector addition considers both the length and direction of the arrows to find a resultant arrow. This allows us to analyze and manipulate quantities with direction, crucial in physics, engineering, and many machine learning applications.

Name: Word Embedding

Date: 2013

Source: Google

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: Word Embedding is a machine learning technique that maps words or phrases from a vocabulary into vectors of real numbers, allowing similar words to have similar numerical representations. This is crucial in natural language processing tasks, as it helps algorithms understand semantic and syntactic similarities between words.

Name: Positional Embedding

Date: 2017

Source: Google AI

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: Positional Embedding is a technique used in Natural Language Processing that assigns each word or token in a sequence a vector representation based on its position, enabling the model to understand the order of words in a sequence. This is crucial for tasks such as translation or sentence completio where the positioning of words impacts the meaning.

Name: In/out Embedding Tie

Date: 2017

Source: Google AI

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: In/out Embedding Tie refers to a method in machine learning where the same weights are shared between the input-to-hidden and hidden-to-output layers of a model, essentially "tying" these two layers together. This technique can reduce the number of parameters in the model, potentially improving the efficiency and performance of the model.

Name: Multihead attention

Date: 2017

Source: Google AI

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: Multihead attention is a mechanism in transformer models that allows the model to focus on different positions of the input sequence simultaneously, thereby capturing various aspects of the information. It does so by splitting the input data into multiple 'heads', each of which independently learns different types of attention, before being recombined to produce the final output.

Name: Sentence embedding

Date: 2017

Source: IBM

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: Sentence embedding is a technique in natural language processing where sentences are mapped into vectors of real numbers, essentially creating a mathematical representation of the sentence. This facilitates tasks such as semantic similarity measurement, text classification, and other language-related machine learning applications.

Name: Self-Attention

Date: 2018

Source: IBM

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: Self-Attention is a mechanism in machine learning models that allows them to focus on relevant parts of the input for making predictions. It assigns different importance (attention) to different parts of the input, enabling the model to make better context-aware decisions.

Name: Attention

Date: 2014

Source: JacobsU, UMontreal

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: Attention is a mechanism in machine learning models that allows them to focus on specific aspects of complex inputs, improving the accuracy of results. It's often used in natural language processing to help models understand context, remember previous information, and produce more accurate translations or responses.

Name: Bigram/N-gram LM

Date: 2009

Source: Brno U

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: A Bigram/N-gram Language Model (LM) is a type of statistical language model used in natural language processing that predicts the probability of a word given the previous 'N-1' words in a sentence. It's called a 'bigram' when N=2, and 'N-gram' for N>2, representing sequences of words or letters to anticipate and better understand context.

Name: Neural Probabilistic LM

Date: 2003

Source: Montreal U

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: Neural Probabilistic Language Model (NPLM) is a type of language model that leverages neural networks to predict the next word in a sequence based on the words that precede it. It uses the context of the sentence (previous words) to form a high-dimensional representation, which it then uses to compute the probability distribution of the next word.

Name: ADAM Optimizer

Date: 2014

Source: Toronto U

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: The ADAM (Adaptive Moment Estimation) Optimizer is a machine learning algorithm that calculates individual adaptive learning rates for different parameters, combining the advantages of two other extensions of stochastic gradient descent: AdaGrad and RMSProp. It's particularly effective in settings where data and/or resources are sparse and computation is expensive.

Name: GELU

Date: 2016

Source: Berkeley

Talent Level: L2

Tech Type: Others

Tech Source: Academic

Summary: GELU, or Gaussian Error Linear Unit, is a type of activation function used in artificial neural networks that helps decide whether and how much a neuron should be activated. It is known for its efficiency, as it allows for fast and accurate training, and its ability to mitigate the vanishing gradient problem, which can slow down or prevent learning.

Name: Label Smoothing

Date: 2015

Source: Google

Talent Level: L2

Tech Type: Others

Tech Source: Academic

Summary: Label Smoothing is a regularization technique in machine learning that prevents the model from becoming too confident about the class labels by assigning a small amount of the total probability to all other labels. This technique helps to mitigate overfitting and improves generalization of the model by making the training process less sensitive to the exact values of the labels.

Name: RMSProp

Date: 2012

Source: Toronto U

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm for neural networks, designed to attenuate the aggressively decreasing learning rate in conventional gradient descent methods. It adjusts the learning rate by dividing it by an exponentially decaying average of squared gradients, providing an individual learning rate for each parameter, effectively resolving the issue of diminishing learning rates in deep learning.

Name: AdaGrad

Date: 2011

Source: Berkeley

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: AdaGrad (Adaptive Gradient Algorithm) is an optimization algorithm in machine learning that adapts the learning rate to the parameters, performing smaller updates for parameters associated with frequently occurring features and larger updates for parameters associated with infrequently occurring features. It is particularly useful in scenarios where data is sparse and the learning rate needs to be adaptive.

Name: Momentum

Date: 1986

Source: Hinton, Remelhart

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: Momentum is a technique used in optimization algorithms, such as Gradient Descent, to speed up learning and avoid local minima by adding a fraction of the direction of the previous step to a current step. This way, the algorithm accumulates the gradient of the past steps to determine the direction to go, somewhat similar to a ball rolling downhill.

Name: SGD

Date: 0

Source: nan

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: Stochastic Gradient Descent (SGD) is an iterative algorithm used in machine learning and deep learning to find the optimal parameters that minimize a function, often a loss function. Unlike the standard Gradient Descent that uses all data points to compute the gradient, SGD randomly selects a batch of data points per iteration, significantly speeding up the process and reducing computational load.

Name: Backpropagation

Date: 1982

Source: Werbos

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: Backpropagation is a machine learning algorithm used in neural networks to adjust the weights and biases in response to errors, effectively "learning" from the mistakes. It does so by propagating the error backwards through the network, hence the name, and then using gradient descent to iteratively fine-tune the network parameters until the model's predictions are as accurate as possible.

Name: Regularization

Date: 0

Source: nan

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: Regularization is a technique used in machine learning models to prevent overfitting by adding a penalty term to the loss function, which in turn reduces the complexity of the model. It helps to maintain a balance between bias and variance, ensuring that the model generalizes well on unseen data.

Name: Dropout

Date: 2014

Source: Toronto U

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: Dropout is a regularization technique used in neural networks to prevent overfitting by randomly dropping out, or deactivating, a proportion of neurons during training. This forces the network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

Name: Overfitting

Date: 0

Source: nan

Talent Level: L1

Tech Type: AI/ML

Tech Source: Academic

Summary: Overfitting is a modeling error in machine learning when a model is excessively complex and performs well on training data but poorly on new, unseen data. It happens when the model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.

Name: Underfitting

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: Underfitting is a concept in machine learning where a model fails to capture the underlying pattern of the data, usually due to its oversimplicity or lack of sufficient training data. This leads to poor performance on both training and test data, as the model lacks the complexity necessary to understand and predict accurate outcomes.

Name: Calculus

Date: 0

Source: Newton

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: Calculus is a branch of mathematics that studies continuous change, primarily through the concepts of differentiation and integration. Differentiation measures the rate of change in a function, while integration accumulates the quantities produced by a function.

Name: Tokenization

Date: 0

Source: nan

Talent Level: L2

Tech Type: Others

Tech Source: Academic

Summary: Tokenization is a process in natural language processing (NLP) that splits a large amount of text into smaller parts called tokens. These tokens help in understanding the context or developing a model to analyze the text.

Name: sub-word tokens

Date: 2015

Source: UEdinburgh

Talent Level: L2

Tech Type: Others

Tech Source: Academic

Summary: Sub-word tokens are smaller units derived from whole words, often used in natural language processing to handle unknown or rare words and improve model performance. They can range from single characters to common word stems or suffixes, allowing a model to generalize from known tokens to unseen ones.

Name: WordPiece

Date: 2012

Source: Google

Talent Level: L2

Tech Type: Others

Tech Source: Industry

Summary: WordPiece is a subword tokenization algorithm used in natural language processing, which breaks words into smaller units, allowing the model to handle rare or unseen words more effectively by understanding their components. It helps in reducing the size of the vocabulary and improving the computational efficiency of language models.

Name: BPE

Date: 1994

Source: Philip Gage

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: Byte Pair Encoding (BPE) is a data compression technique that replaces the most common pair of bytes in a dataset with a single byte not previously used. In natural language processing, BPE is used to split words into subwords to allow the model to handle rare and unseen words better.

Name: Ouput Decoding

Date: 0

Source: nan

Talent Level: L2

Tech Type: Others

Tech Source: Academic

Summary: Output decoding is the process of interpreting the output received from a system, often translating it from machine code into a human-readable format. It's a crucial step in digital communication and data processing, where the decoded information is used for further analysis or decision making.

Name: Greedy Search

Date: 0

Source: nan

Talent Level: L2

Tech Type: Others

Tech Source: Academic

Summary: Greedy Search is an algorithmic paradigm that follows the problem-solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. It is used in optimization problems where the goal is to make the most optimal decision at each step, reducing the problem's complexity.

Name: Beam Search

Date: 1977

Source: nan

Talent Level: L2

Tech Type: Others

Tech Source: Academic

Summary: Beam Search is an algorithm used in many natural language processing tasks that reduces the complexity of searching through all possible sequences by only keeping track of the most promising sequences, referred to as "beams". This approach helps to optimize computational efficiency without significantly sacrificing the quality of results.

Name: Fine-tuning

Date: 0

Source: nan

Talent Level: L2

Tech Type: AI/ML

Tech Source: Academic

Summary: Fine-tuning is a transfer learning technique where a pre-trained model is further trained on a specific task or domain with task-specific data. It involves updating the parameters of the pre-trained model using a smaller dataset, allowing the model to adapt to the nuances of the target task while leveraging the knowledge learned from the pre-training phase.

Name: QA

Date: 0

Source: nan

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: Question answering (QA) is a natural language processing task where a model is tasked with providing relevant answers to questions posed in natural language. It involves understanding the question, retrieving relevant information from a given context or knowledge base, and generating an accurate response.

Name: RACE

Date: 2017

Source: CMU

Talent Level: L2

Tech Type: AI/ML

Tech Source: Academic

Summary: The RACE dataset is a large-scale English language reading comprehension dataset, designed for evaluating machine reading comprehension models. It contains diverse texts and questions from various sources, including fictional and non-fictional texts, along with multiple-choice answers.

Name: StoryCloze

Date: 2016

Source: URochester, USNA, VT, Microsoft

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Summary: The StoryCloze Test is a dataset for evaluating story understanding and script learning, where models are given the beginning of a story and must choose the correct ending from two options. It assesses the ability of models to generate coherent and contextually appropriate narrative continuations.

Name: DRP for QA

Date: 2020

Source: FAIR, UWash, PrincetonU

Talent Level: L2

Tech Type: AI/ML

Tech Source: Collaborative

Summary: Dense Passage Retrieval (DPR) is a technique used in open-domain question answering systems to efficiently retrieve relevant passages from a large corpus of documents. It involves encoding passages and questions into dense representations and using similarity search methods to identify relevant passages.

Name: Context affects LM

Date: 2020

Source: FAIR

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: The study delves into the nuanced relationship between contextual cues and the accuracy of language models in making factual predictions. It scrutinizes how variations in surrounding text can profoundly shape the models' interpretations, shedding light on the pivotal role of context in refining the reliability of language model outputs.

Name: RAG

Date: 2020

Source: FAIR, UWash, PrincetonU

Talent Level: L2

Tech Type: AI/ML

Tech Source: Collaborative

Summary: Retrieval-Augmented Generation (RAG) is a framework developed by Facebook for question answering tasks, which combines retrieval-based and generation-based approaches. It retrieves relevant passages from a large corpus and generates answers conditioned on both the input question and retrieved passages.

Name: REALM

Date: 2020

Source: Google

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: REALM (Retrieval-Augmented Language Model) is a language model developed by Google that integrates a dense retrieval mechanism to enhance its understanding of natural language. It leverages retrievers to retrieve relevant information from a large-scale knowledge base during inference, improving its performance on various NLP tasks.

Name: XLNet

Date: 2019

Source: CMU, Google Brain

Talent Level: L2

Tech Type: AI/ML

Tech Source: Collaborative

Summary: XLNet is a pre-trained language model developed by Google, based on the Transformer architecture, that achieves state-of-the-art results on various natural language processing tasks. It employs permutation-based training to capture bidirectional context and overcome limitations of traditional pre-training techniques like BERT.

Name: ORCA

Date: 2019

Source: Google

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: ORCA (Open Retrieval Chatbot) is a chatbot framework developed by Google that integrates retrieval-based methods for generating responses in conversation. It utilizes passage retrieval techniques to find relevant information from a large knowledge base and generate contextually appropriate responses.

Name: Wikipedia for QA

Date: 2017

Source: Stanford, FAIR

Talent Level: L2

Tech Type: AI/ML

Tech Source: Collaborative

Summary: Wikipedia for Question Answering is a dataset developed by Stanford University for training and evaluating question answering models. It consists of questions paired with relevant passages from Wikipedia, allowing models to learn to extract answers from structured text.

Name: DrQA

Date: 2017

Source: Stanford, FAIR

Talent Level: L2

Tech Type: AI/ML

Tech Source: Collaborative

Summary: DrQA (Document Reader Question Answering) is a question answering system developed by Stanford University. It uses a combination of information retrieval and machine reading techniques to find relevant documents from a large corpus and extract answers to questions from them.

Name: SQuAD

Date: 2016

Source: Stanford

Talent Level: L2

Tech Type: AI/ML

Tech Source: Academic

Summary: The Stanford Question Answering Dataset (SQuAD) is a benchmark dataset for question answering tasks, consisting of questions posed on Wikipedia articles where the answer is a segment of text from the corresponding passage. It is widely used for training and evaluating question answering models.

Name: Span representation

Date: 2016

Source: UWash, Tel-AvivU, Google

Talent Level: L2

Tech Type: AI/ML

Tech Source: Industry

Summary: Span representation refers to the way a portion of text is encoded or represented in a machine learning model, often used in tasks like question answering where the model needs to identify spans of text that answer a given question.

Name: Machine comprehension

Date: 2016

Source: SingaporeU

Talent Level: L2

Tech Type: AI/ML

Tech Source: Academic

Summary: Machine comprehension involves training models to understand and answer questions about a passage of text. It's a subfield of natural language processing focused on teaching machines to extract information from written sources.

Name: Pointer Net

Date: 2015

Source: Google Brain, Berkeley

Talent Level: L2

Tech Type: AI/ML

Tech Source: Collaborative

Summary: Pointer Networks are a type of neural network architecture designed to handle variable-sized inputs and outputs by learning to point to specific elements in a sequence. They are commonly used in tasks like sequence-to-sequence learning and combinatorial optimization.

Name: Max Inner Product Search

Date: 2012

Source: nan

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: Maximum Inner Product Search (MIPS) is a search algorithm used to find the vectors in a database that have the highest inner product with a query vector. It's often employed in applications like recommendation systems and information retrieval.

Name: TREC Conf

Date: 1992

Source: nan

Talent Level: L1

Tech Type: Data

Tech Source: Industry

Summary: The TREC Conference (Text REtrieval Conference) is an annual event organized by the National Institute of Standards and Technology (NIST) and the U.S. Department of Defense. It aims to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. Since its inception in 1992, TREC has facilitated advancements in search technologies through standardized benchmarking and shared tasks. Participants are given datasets and tasked with developing and evaluating systems to retrieve relevant information. The conference covers various tracks, including web search, question answering, and more specialized areas like legal and biomedical text retrieval. TREC's collaborative environment has been instrumental in driving progress and innovation in the field of information retrieval.

Name: Info Retreival

Date: 1945

Source: Bash

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: Information Retrieval (IR) is the process of obtaining relevant information from large repositories, such as databases or the internet, based on user queries. It encompasses a wide range of tasks including search engine development, document retrieval, and data mining. IR systems use algorithms to match user queries with indexed documents, ranking them by relevance using techniques like keyword matching, semantic analysis, and machine learning. Applications of IR extend to various domains, including web search engines, digital libraries, and enterprise search solutions. The field continually evolves, integrating advancements in natural language processing and artificial intelligence to improve the accuracy and efficiency of information retrieval processes.

Name: Baseball

Date: 1961

Source: MIT

Talent Level: L1

Tech Type: Others

Tech Source: Academic

Summary: BASEBALL is an early automatic question-answering system developed to respond to queries about baseball games. Described in the paper "BASEBALL: An Automatic Question-Answerer," this system was one of the pioneering efforts in natural language processing. It used a database of facts about baseball games and employed syntactic parsing to understand and answer user questions. BASEBALL demonstrated the feasibility of automated question answering by accurately retrieving information based on natural language queries, setting the groundwork for future advancements in AI-driven information retrieval and natural language understanding systems.

Name: NLI

Date: 0

Source: nan

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: Natural Language Inference (NLI) is the task of determining the logical relationship between two pieces of text, typically referred to as the premise and the hypothesis. It involves classifying whether the hypothesis can be inferred from the premise, often categorized into entailment, contradiction, or neutral.

Name: SciTail

Date: 2018

Source: AllenAI

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: The SciTail dataset is a textual entailment dataset created from multiple-choice science exams and web sentences, designed to evaluate natural language inference models. It consists of premise-hypothesis pairs labeled as either entailment or neutral.

Name: QNLI

Date: 2018

Source: NYU, UWash, DeepMind

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: The Question Natural Language Inference (QNLI) dataset is derived from the Stanford Question Answering Dataset (SQuAD), where the task is to determine if a given sentence contains the answer to a question. It is used to train and evaluate models on question-answer entailment.

Name: MNLI

Date: 2018

Source: NYU

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: The Multi-Genre Natural Language Inference (MNLI) dataset consists of sentence pairs from a variety of sources, labeled for entailment, contradiction, or neutral relationships. It is designed to evaluate model performance across different genres of text.

Name: SNLI

Date: 2015

Source: Stanford

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: The Stanford Natural Language Inference (SNLI) dataset is a large collection of sentence pairs manually labeled for balanced classification with entailment, contradiction, and neutral categories. It is widely used for training and evaluating natural language understanding systems.

Name: RTE

Date: 2009

Source: FBK, CIU, NIST, CELCT

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: The RTE-5 dataset is part of the Fifth PASCAL Recognizing Textual Entailment Challenge, containing pairs of text and hypothesis labeled as entailment or non-entailment. This dataset is used to evaluate models' abilities to determine if one sentence logically follows from another.

Name: Semantic Similarity

Date: 0

Source: nan

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: Semantic similarity refers to the degree of relatedness or similarity between two pieces of text based on their meaning. It involves quantifying the similarity of words, phrases, sentences, or documents using various techniques such as word embeddings, semantic models, or similarity metrics.

Name: QQP

Date: 2019

Source: NYU

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: The Quora Question Pairs (QQP) dataset contains pairs of questions from Quora labeled to indicate if they have the same intent or are semantically equivalent. It is used to train models for identifying duplicate questions.

Name: STS-B

Date: 2017

Source: Google, WashU, UBasque Country, USheffield

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: The Semantic Textual Similarity Benchmark (STSB) dataset from SemEval-2017, created by Google, contains sentence pairs annotated with similarity scores on a scale from 0 to 5. This dataset is used to train and evaluate models on their ability to predict the degree of semantic similarity between sentences.

Name: MRPC

Date: 2005

Source: Microsoft

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: The Microsoft Research Paraphrase (MSRP) dataset consists of pairs of sentences, each annotated to indicate whether they are semantically equivalent (paraphrases) or not. It is used to train and test models for paraphrase detection and semantic similarity tasks.

Name: Text Classification

Date: 0

Source: nan

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: Text classification is the task of categorizing text documents into predefined classes or categories based on their content. It is a fundamental problem in natural language processing and involves training models to automatically assign labels to text data, such as sentiment analysis, topic classification, or spam detection.

Name: CoLA

Date: 2014

Source: NYU, FAIR

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: The Corpus of Linguistic Acceptability (CoLA) consists of English sentences annotated for grammatical acceptability, based on the judgments of expert linguists. It is used to evaluate models on their ability to distinguish between grammatically correct and incorrect sentences.

Name: SST-2

Date: 2013

Source: Stanford

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Summary: The Stanford Sentiment Treebank (SST) dataset includes movie reviews annotated with sentiment labels ranging from very negative to very positive. It is commonly used for training and evaluating sentiment analysis models.

Name: GPT-2

Date: 2019

Source: OpenAI

Talent Level: L3

Tech Type: AI/ML

Tech Source: Industry

Summary: A scaled-up version of GPT with 1.5 billion parameters, showing significant improvements in text generation quality and diversity compared to its predecessor. This advancement demonstrated the potential of increasing model size for performance gains in NLP tasks.

Name: Pre-training

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: Pre-training involves training a model on a vast corpus of text data to learn general language patterns and representations. This foundational training enables the model to perform well on a variety of tasks after further fine-tuning on more specific datasets.

Name: WebText

Date: 2017

Source: Common Crawl

Talent Level: L3

Tech Type: Data

Tech Source: Industry

Summary: The WebText dataset is a large-scale dataset introduced by OpenAI for training language models. It is comprised of a diverse range of internet text.

Name: DragNet

Date: 2013

Source: nan

Talent Level: L3

Tech Type: Others

Tech Source: Industry

Summary: The Dragnet dataset is a collection of web pages designed for evaluating algorithms that extract main content from web pages, omitting ads, navigation, and other extraneous elements. It provides labeled examples to facilitate the development and benchmarking of content extraction methods, ensuring algorithms can accurately identify and isolate the primary content from various types of web pages.

Name: Newspaper Content Text

Date: 0

Source: nan

Talent Level: L3

Tech Type: Others

Tech Source: Industry

Summary: The Newspaper Content Text dataset is a compilation of data gathered from newspaper articles. The dataset may\ include various features such as the title of the article, author, publication date, the content of the article, and\ more.

Name: Reddit

Date: 2005

Source: Reddit

Talent Level: L3

Tech Type: Data

Tech Source: Industry

Summary: The Reddit dataset is a large collection of data from the social media platform, Reddit. It includes informa\ tion about Reddit posts, comments, upvotes, downvotes, and other user interactions on the site.

Name: Common Crawl

Date: 2018

Source: Common Crawl

Talent Level: L3

Tech Type: Data

Tech Source: Industry

Summary: The Common Crawl dataset is a large and regularly updated corpus of web crawl data that is freely available \ to anyone. This dataset contains raw web page data, metadata, and text.

Name: Multitask Learning

Date: 1997

Source: CMU

Talent Level: L3

Tech Type: AI/ML

Tech Source: Academic

Summary: Multitask learning is an approach where a single model is trained to perform multiple related tasks simultaneously. By sharing information and features across tasks, multitask learning aims to improve the performance of individual tasks through joint learning, leading to better generalization and efficiency.

Name: Meta Learning

Date: 0

Source: nan

Talent Level: L3

Tech Type: AI/ML

Tech Source: Academic

Summary: Meta-learning, also known as learning to learn, involves training models on a variety of tasks with the goal of enabling them to quickly adapt to new tasks or domains with minimal data. It explores techniques for learning effective learning strategies or representations that generalize across tasks.

Name: Zero Shot

Date: 0

Source: nan

Talent Level: L3

Tech Type: AI/ML

Tech Source: Academic

Summary: Zero-shot learning is a machine learning paradigm where a model is trained to recognize classes it has never seen before during training. It involves learning to generalize to new classes by leveraging auxiliary information or semantic embeddings.

Name: MAML

Date: 2017

Source: Berkeley, OpenAI

Talent Level: L3

Tech Type: AI/ML

Tech Source: Academic

Summary: Model-Agnostic Meta-Learning (MAML) is a meta-learning algorithm that aims to train models to adapt quickly to new tasks with minimal data. It involves learning a good initialization that can be fine-tuned efficiently for new tasks, facilitating rapid adaptation.

Name: MQAN

Date: 2018

Source: Salesforce

Talent Level: L3

Tech Type: AI/ML

Tech Source: Industry

Summary: Multi-Question Answering Network (MQAN) is a neural network architecture designed for multi-hop question answering tasks. It enables the model to reason over multiple pieces of evidence to generate accurate answers to complex questions.

Name: decaNLP

Date: 2018

Source: Salesforce

Talent Level: L3

Tech Type: Data

Tech Source: Industry

Summary: decaNLP is a framework for training and evaluating multitask models across ten diverse natural language processing tasks. It covers a wide range of NLP tasks, including translation, summarization, and question answering, aiming to encourage research in multitask learning and generalization.

Name: QA-SRL

Date: 2015

Source: UWash

Talent Level: L3

Tech Type: Data

Tech Source: Academic

Summary: Question Answering-Semantic Role Labeling (QA-SRL) is a task that involves identifying semantic roles in sentences and generating answers based on these roles. It combines the tasks of question answering and semantic role labeling to improve natural language understanding.

Name: QA-ZRE

Date: 2017

Source: AllenAI

Talent Level: L3

Tech Type: Data

Tech Source: Academic

Summary: Question Answering-Zero-shot Relation Extraction (QA-ZRE) is a task that combines question answering and relation extraction, where models are trained to answer questions about relations between entities in text without explicit supervision for relation extraction.

Name: WOZ

Date: 2016

Source: Cambridge

Talent Level: L3

Tech Type: Data

Tech Source: Academic

Summary: Wizard-of-Oz (WOZ) is a data collection technique where human operators simulate the behavior of an automated system to collect naturalistic interactions with users. It is commonly used to gather training data for dialogue systems and conversational agents.

Name: WikiSQL

Date: 2017

Source: Salesforce

Talent Level: L3

Tech Type: Data

Tech Source: Industry

Summary: WikiSQL is a dataset for semantic parsing and question answering tasks, where models are trained to map natural language questions to SQL queries over a structured table. It is used to evaluate models on their ability to understand and generate SQL queries from text.

Name: Catastrophic forgetting

Date: 2019

Source: DeepMind

Talent Level: L3

Tech Type: AI/ML

Tech Source: Industry

Summary: Catastrophic forgetting refers to the phenomenon where a machine learning model forgets previously learned information when trained on new data. It is a challenge in lifelong or continual learning settings where models need to adapt to new tasks while retaining knowledge from previous tasks.

Name: Training Curriculum

Date: 2019

Source: DeepMind

Talent Level: L3

Tech Type: AI/ML

Tech Source: Industry

Summary: Training curriculum refers to a strategy for training machine learning models where training examples are presented to the model in a structured order, typically starting with simpler examples and gradually increasing in complexity. It is used to facilitate learning and improve the generalization performance of models.

Name: Tokenization

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: Tokenization is the process of breaking down a text into smaller units, such as words, subwords, or characters, known as tokens. It is a crucial preprocessing step in natural language processing tasks, enabling the representation of text data in a format suitable for machine learning algorithms.

Name: BPE with Unicode

Date: 2019

Source: OpenAI

Talent Level: L3

Tech Type: Others

Tech Source: Industry

Summary: Byte Pair Encoding (BPE) with Unicode is an extension of the BPE algorithm that handles Unicode characters, allowing for the efficient tokenization of multilingual text. It is commonly used in natural language processing tasks such as machine translation and text generation.

Name: Unicode

Date: 1991

Source: Unicode Consortium

Talent Level: L1

Tech Type: Data

Tech Source: Academic

Summary: Unicode 1991 refers to the first version of the Unicode Standard published by the Unicode Consortium in 1991. It laid the foundation for character encoding and representation standards used in modern computing systems, facilitating the consistent handling of text across different platforms and languages.

Name: Fine-tuning

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Name: GLUE

Date: 2018

Source: NYU, UWash, DeepMind

Talent Level: L3

Tech Type: Data

Tech Source: Academic

Summary: The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding tasks, including sentiment analysis, textual entailment, and question answering. It serves as a standard for evaluating and comparing the performance of language models.

Name: Natural Q

Date: 2019

Source: Google

Talent Level: L3

Tech Type: Data

Tech Source: Industry

Summary: The Natural Questions dataset contains real anonymized questions from Google search, each paired with a Wikipedia page and a corresponding answer span. It is used to train and evaluate question answering systems on their ability to find precise answers within long documents.

Name: CoQA

Date: 2018

Source: Stanford

Talent Level: L3

Tech Type: Data

Tech Source: Academic

Summary: The Conversational Question Answering (CoQA) dataset consists of question-answer pairs within a conversational context, focusing on answering questions based on a given passage. It evaluates models on their ability to understand and generate contextually relevant answers in a conversation.

Name: LAMBADA

Date: 2016

Source: UTrento, UAmsterdam

Talent Level: L3

Tech Type: Data

Tech Source: Academic

Summary: The LAMBADA dataset is designed to test language models on their ability to predict the last word of sentences that require a broad understanding of the context provided by preceding text. It emphasizes the need for models to grasp long-range dependencies in text.

Name: WikiText

Date: 2016

Source: Salesforce

Talent Level: L3

Tech Type: Data

Tech Source: Industry

Summary: The WikiText dataset is a collection of Wikipedia articles curated for language modeling tasks, featuring long-form, coherent text with minimal editing. It is used to train and evaluate models on their ability to generate and predict natural language text.

Name: CBT

Date: 2015

Source: FAIR

Talent Level: L3

Tech Type: Data

Tech Source: Industry

Summary: The Children's Book Test (CBT) dataset, created by Facebook AI Research (FAIR), consists of sentences from children's books with one word removed and multiple-choice options provided for the missing word. It is designed to evaluate language models on their ability to understand and predict the context of a given text.

Name: CNN/Daily News

Date: 2015

Source: Google

Talent Level: L3

Tech Type: Data

Tech Source: Industry

Summary: The CNN/Daily Mail dataset is a large collection of news articles paired with multi-sentence summaries, created to facilitate research in automatic summarization and reading comprehension. It is used to train models to generate concise summaries from longer texts.

Name: 1BW

Date: 2013

Source: Google

Talent Level: L3

Tech Type: Data

Tech Source: Academic

Summary: The 1 Billion Word Benchmark is a dataset composed of a large collection of sentences from news articles, designed to support research in language modeling. It aims to evaluate models on their ability to predict and generate fluent and coherent text.

Name: PTB

Date: 1992

Source: UPenn, NorthwesternU

Talent Level: L3

Tech Type: Data

Tech Source: Academic

Summary: The Penn Treebank (PTB) dataset contains text from the Wall Street Journal, annotated with syntactic structure and part-of-speech tags. It is a standard benchmark for evaluating models on tasks such as syntactic parsing and language modeling.

Name: GPT-3

Date: 2021

Source: OpenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: Announced by OpenAI, it boasts significantly more parameters than GPT-2 and showcases even more impressive capabilities in text generation, translation, and code writing. There is no official research paper, but announcements suggest GPT-3 is a significant leap forward in NLP capabilities.

Name: Pre-training

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Name: Filtered WebText

Date: 2019

Source: Common Crawl

Talent Level: L4

Tech Type: Data

Tech Source: Industry

Summary: Filtered WebText is a curated dataset derived from the Common Crawl, a publicly available web archive. It involves filtering and preprocessing raw web data to remove low-quality or irrelevant content, resulting in a high-quality text corpus suitable for training language models. This process ensures that the dataset includes diverse and informative text from reliable sources, enhancing the training data's overall quality. Filtered WebText is used to train various natural language processing models, providing them with a rich and diverse set of examples that help improve their language understanding, generation capabilities, and performance across a range of tasks.

Name: Book1, Book2

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: Book1 and Book2 are nternet-based books corporas, containing a random sampling of a small subset of all the public domain books that humanity has published and are available online

Name: WikiPedia

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: A large text dataset comprised of data pulled from Wikipedia, an online encyclopedia

Name: Large Batch Training

Date: 2018

Source: OpenAI

Talent Level: L4

Tech Type: Others

Tech Source: Industry

Summary: Large batch training involves using significantly larger batch sizes in training deep learning models to improve training efficiency and model performance. This approach leverages advanced optimization techniques and robust computational resources to accelerate the training process while maintaining or enhancing the accuracy and generalization capabilities of the models.

Name: Weight Decay

Date: 2017

Source: UFreiburg

Talent Level: L4

Tech Type: Others

Tech Source: Academic

Summary: Weight Decay is a regularization technique used in training neural networks to prevent overfitting by penalizing large weights. As detailed in the paper "Decoupled Weight Decay Regularization," this method involves adding a term to the loss function that is proportional to the sum of the squared weights of the model. By doing so, the training process encourages the model to keep its weights small, which helps in reducing the model's complexity and improving generalization to unseen data. Decoupled weight decay specifically separates the weight decay from the gradient update step, allowing for more flexible and effective regularization. This approach enhances the stability and performance of neural networks, particularly in large-scale and complex models.

Name: Few Shot LM

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: A Few-Shot Language Model can understand and generate human-like text based on a small sample of input data. It reduces the necessity for large amounts of training data and is capable of performing tasks like translation, question answering, and summarization after being trained on diverse internet text.

Name: Sparse Transformer

Date: 2019

Source: OpenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: A Sparse Transformer is a deep learning model that utilizes sparse attention mechanism, enabling it to efficiently process long sequences of data by focusing on relevant parts rather than the entire sequence. This makes it particularly useful for tasks involving large amounts of data like language translation, image recognition, and music synthesis.

Name: MAML

Date: 2017

Source: Berkeley, OpenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Academic

Summary: MAML (Model-Agnostic Meta-Learning) is a machine learning algorithm that aims to rapidly adapt to new tasks with minimal training data by learning a model initialization that can be fine-tuned effectively. It's model-agnostic in the sense that it can be applied to any model trained with gradient descent, making it highly versatile in application.

Name: Match Network

Date: 2016

Source: DeepMind

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: A Match Network is a digital infrastructure that facilitates the interconnection of various internet networks, often referred to as autonomous systems, to exchange internet traffic between them, thus improving speed, reliability, and cost-effectiveness. By utilizing the Border Gateway Protocol, it helps in routing and IP addressing, enabling seamless communication and data transfer across multiple networks.

Name: Parameter Scaling

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: Parameter scaling, also known as feature scaling, is a data preprocessing technique used to standardize the range of independent variables or features of data. This process improves the performance and stability of machine learning algorithms, particularly those that use gradient descent as an optimization strategy.

Name: Scaling Law for NN

Date: 0

Source: OpenAI

Talent Level: L4

Tech Type: Others

Tech Source: Industry

Summary: Scaling laws for neural networks refer to the relationship between the resources used (such as data, model size, or training time) and the model's performance. Essentially, increasing these resources tend to improve the model's performance, but with diminishing returns, meaning that after a certain point, adding more resources leads to smal\ ler and smaller improvements.

Name: Few Shot Test

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: Few Shot Test is a method in machine learning where a model is trained to identify a new category based on a few examples, thereby testing the model's ability to generalize from limited data. It's a crucial test for AI systems, mimicking the human ability to learn new concepts from a few instances.

Name: SuperGLUE

Date: 2019

Source: NYU, FAIR, UWash, DeepMind

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Summary: SuperGLUE is a benchmark designed to be more challenging than GLUE, consisting of a diverse set of natural language understanding tasks including coreference resolution, commonsense reasoning, and question answering. It aims to push the boundaries of current language models.

Name: HellaSwag

Date: 2019

Source: AllenAI, UWash

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Summary: HellaSwag is a dataset designed for commonsense reasoning, containing multiple-choice questions where models must choose the most plausible continuation of a given context. It focuses on evaluating the ability of models to generate and understand coherent, sensible text.

Name: WinoGrande

Date: 2019

Source: AllenAI, UWash

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Summary: WinoGrande is a dataset for commonsense reasoning, derived from the Winograd Schema Challenge, containing sentence pairs with ambiguous pronouns that require commonsense knowledge to resolve. It is used to train and test models on their ability to understand context and resolve ambiguities.

Name: Natural Q

Date: 2019

Source: Google

Talent Level: L4

Tech Type: Data

Tech Source: Industry

Name: DROP

Date: 2019

Source: UCI, PekingU, AllenAI, UWash

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Summary: The Discrete Reasoning Over Paragraphs (DROP) dataset consists of questions that require discrete operations such as addition, subtraction, or sorting to answer, based on information within paragraphs. It evaluates the ability of models to perform complex reasoning over textual data.

Name: ARC

Date: 2018

Source: AllenAI

Talent Level: L4

Tech Type: Data

Tech Source: Industry

Summary: The AI2 Reasoning Challenge (ARC) dataset includes multiple-choice science questions from standardized exams, designed to test advanced reasoning abilities in artificial intelligence systems. It is used to evaluate models on their understanding of scientific concepts and reasoning skills.

Name: TriviaQA

Date: 2017

Source: AllenAI

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Summary: TriviaQA is a large-scale question answering dataset containing question-answer pairs from trivia websites, with corresponding evidence documents from Wikipedia and the web. It is designed to train and evaluate models on their ability to find and extract relevant information.

Name: TriviaQA

Date: 2017

Source: AllenAI

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Name: LAMBADA

Date: 2016

Source: UTrento, UAmsterdam

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Name: StoryCloze

Date: 2016

Source: URochester, USNA, VT, Microsoft

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Name: WebQ

Date: 2013

Source: Stanford

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Summary: The WebQuestions (WebQ) dataset contains questions sourced from Google queries, paired with answers from Freebase, a large knowledge graph. It is used to evaluate models on their ability to understand and answer natural language questions based on structured data.

Name: PTB

Date: 1992

Source: UPenn, NorthwesternU

Talent Level: L2

Tech Type: Data

Tech Source: Academic

Name: GPT-4

Date: 2023

Source: OpenAI

Talent Level: L5

Tech Type: AI/ML

Tech Source: Industry

Summary: Likely an even larger and more powerful version of GPT-3 under development at OpenAI. Details are unknown, but it is expected to continue the trend of improved performance with increasing model size.

Name: Training

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: Training involves feeding a model large amounts of data and adjusting its parameters through iterative optimization to minimize the difference between its predictions and the actual outcomes. This process allows the model to learn patterns and relationships within the data, improving its ability to make accurate predictions or generate relevant outputs when given new inputs.

Name: Multimodal LM

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: Multimodal machine learning models process information from multiple sources, like text, images, or audio. By combining these modalities, they gain a richer understanding of the data compared to models using a single data type. This allows them to perform tasks like generating captions for images or translating spoken language while considering visual context.

Name: RLHF

Date: 2023

Source: OpenAI

Talent Level: L5

Tech Type: AI/ML

Tech Source: Industry

Summary: RLHF stands for Reinforcement Learning from Human Feedback. It's a technique used to improve the model by giving it rewards based on human input.

Name: MoE

Date: 2017

Source: Google, JagiellonianU

Talent Level: L5

Tech Type: AI/ML

Tech Source: Industry

Summary: MoE has many smaller, specialized networks ("experts") and a gating network. The gating network selects a few experts for each input, allowing the MoE layer to handle complex tasks with thousands of times more parameters than usual.

Name: Red Teaming

Date: 2022

Source: DeepMind, NYU

Talent Level: L5

Tech Type: AI/ML

Tech Source: Collaborative

Summary: Red Teaming is a project at DeepMind aimed at developing AI systems capable of strategic reasoning and planning in complex adversarial environments. It focuses on creating agents that can anticipate and counteract the actions of adversaries through game-theoretic approaches.

Name: Sparrow

Date: 2022

Source: DeepMind

Talent Level: L5

Tech Type: AI/ML

Tech Source: Industry

Summary: Sparrow is a research project by DeepMind focusing on developing techniques for self-supervised reinforcement learning. It aims to enable agents to learn effective behavior from their own experiences without requiring external supervision.

Name: Test Data

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: Test data is a separate subset of data used to evaluate the performance of a trained model, ensuring it can generalize well to unseen examples. It provides an unbiased measure of the model's accuracy and effectiveness by comparing its predictions against known outcomes not previously seen during training.

Name: MMLU

Date: 2020

Source: Berkeley, Columbia

Talent Level: L5

Tech Type: Data

Tech Source: Academic

Summary: The Massive Multitask Language Understanding (MMLU) dataset includes a wide range of multiple-choice questions across various domains such as history, mathematics, and science. It is designed to evaluate the breadth and depth of a language model's general knowledge and reasoning abilities.

Name: HumanEval

Date: 2021

Source: OpenAI

Talent Level: L5

Tech Type: Data

Tech Source: Industry

Summary: The HumanEval dataset consists of programming problems designed to assess the code generation capabilities of language models. Each problem includes a natural language prompt and a set of test cases, used to evaluate the correctness and functionality of generated code.

Name: GSM-8K

Date: 2021

Source: OpenAI

Talent Level: L5

Tech Type: Data

Tech Source: Industry

Summary: The Grade School Math 8K (GSM-8K) dataset contains thousands of math word problems typically encountered in grade school. It is used to train and evaluate models on their ability to perform arithmetic and understand mathematical concepts in a natural language context.

Name: ETHICS

Date: 2020

Source: Berkeley, Columbia

Talent Level: L5

Tech Type: Data

Tech Source: Academic

Summary: The ETHICS dataset is designed to evaluate the ethical reasoning abilities of AI models. It includes scenarios and questions related to moral and ethical dilemmas, aiming to train and assess models on their understanding and application of ethical principles.

Name: TruthfulQA

Date: 2021

Source: UOxford, OpenAI

Talent Level: L4

Tech Type: Data

Tech Source: Collaborative

Summary: TruthfulQA is a dataset designed to evaluate the truthfulness and factual accuracy of language models when answering questions. It includes questions that are carefully crafted to test whether models produce truthful responses, addressing the challenge of misinformation in AI-generated content.

Name: RealToxicityPrompts

Date: 2020

Source: AllenAI

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Summary: RealToxicityPrompts is a dataset created to study and mitigate the generation of toxic language by AI models. It consists of prompts that could lead to toxic completions, used to evaluate and improve the safety and robustness of language models against generating harmful content.

Name: HellaSwag

Date: 2019

Source: AllenAI, UWash

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Name: WinoGrande

Date: 2019

Source: AllenAI, UWash

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Name: DROP

Date: 2019

Source: UCI, PekingU, AllenAI, UWash

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Name: ARC

Date: 2018

Source: AllenAI

Talent Level: L4

Tech Type: Data

Tech Source: Industry

Name: InstructGPT

Date: 2022

Source: OpenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: Built upon GPT-3, InstructGPT is fine-tuned with human instruction and reinforcement learning for better adherence to instructions and factual grounding in responses. This addresses some of the limitations of large language models, such as potential biases and factual inaccuracies.

Name: Training

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Name: RLHF

Date: 2017

Source: DeepMind, OpenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: RLHF (Reinforcement Learning from Human Feedback) is a method that combines reinforcement learning with human preferences to train AI models more effectively. Detailed in the paper "Deep Reinforcement Learning from Human Preferences," this approach involves using human feedback to guide the training process, allowing the model to learn behaviors that align with human values and expectations. By incorporating human input into the reward function, RLHF improves the model's performance on tasks where purely algorithmic rewards might be insufficient or misaligned with desired outcomes. This method is particularly useful in applications requiring complex decision-making and nuanced understanding, such as robotics, game playing, and interactive AI systems.

Name: Alignment Tax

Date: 2018

Source: OpenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: Alignment Tax refers to the performance trade-off incurred when aligning a language model's behavior with human values and expectations. Introduced in the context of the paper "Improving Language Understanding by Generative Pre-Training," this concept highlights the potential reduction in a model's raw performance or efficiency due to the additional constraints and modifications necessary to ensure its outputs are aligned with desired ethical standards and user preferences. The "tax" is the cost of achieving more responsible and user-friendly AI behavior, balancing the pursuit of accuracy and robustness with the need for alignment in real-world applications.

Name: Agent Alignment

Date: 2018

Source: DeepMind

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: Agent Alignment is the process of ensuring that an artificial intelligence (AI) system's actions and goals are in line with human values and intentions. As discussed in the paper "Scalable Agent Alignment via Reward Modeling: A Research Direction," this involves designing and training AI agents so that their behavior consistently meets the expectations and requirements of human users. One approach to achieving agent alignment is through reward modeling, where human feedback is used to shape the reward functions that guide the AI's learning and decision-making processes. This research direction aims to develop scalable methods to align increasingly complex AI systems with human values, ensuring their safe and beneficial deployment in real-world scenarios.

Name: PPO

Date: 2017

Source: OpenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: Proximal Policy Optimization (PPO) is a type of reinforcement learning algorithm introduced by OpenAI in the paper "Proximal Policy Optimization Algorithms." PPO strikes a balance between simplicity and performance, making it a widely used approach for training AI agents. It improves upon previous policy gradient methods by using a novel objective function that prevents large, destabilizing updates to the policy, ensuring more stable and reliable learning. PPO employs a clipped surrogate objective to limit the change in the policy at each training step, which helps maintain the balance between exploration and exploitation. This approach has shown strong empirical performance across various domains, including robotics, game playing, and other complex decision-making tasks.

Name: RLHF

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Name: Value-aligned RL

Date: 2021

Source: Kentucky U

Talent Level: L4

Tech Type: AI/ML

Tech Source: Academic

Summary: Value-aligned Reinforcement Learning (RL) is a methodology that aims to train AI agents whose actions and policies are aligned with human values and societal norms. Detailed in the paper "Training Value-Aligned Reinforcement Learning Agents Using a Normative," this approach involves incorporating ethical guidelines and normative principles into the RL training process. By using reward functions that reflect human values, along with human feedback and normative data, value-aligned RL seeks to ensure that the resulting agents behave in ways that are consistent with desired moral and ethical standards. This field is critical for the development of AI systems that are safe, trustworthy, and beneficial for society.

Name: EXT5/EXMIX

Date: 2021

Source: Google

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: ExT5 (Extreme T5) and ExMIX are advanced models introduced in the paper "ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning." These models extend the T5 (Text-to-Text Transfer Transformer) framework by scaling it to handle a vast number of tasks simultaneously, leveraging multi-task learning to improve generalization and performance across diverse NLP tasks. ExT5 utilizes an extreme multi-task mixture (ExMIX), where the model is trained on a highly diverse set of tasks, allowing it to learn shared representations and transfer knowledge more effectively. This approach aims to push the boundaries of transfer learning, demonstrating that with sufficient scaling and task diversity, models can achieve superior performance on a wide range of benchmarks, from text generation and translation to comprehension and summarization.

Name: Cross Talk Gen

Date: 2021

Source: AllenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: Cross-Talk Gen is a framework introduced in the paper "Cross-Task Generalization via Natural Language Crowdsourcing Instructions." It focuses on enhancing the generalization capabilities of AI models across different tasks by utilizing natural language instructions collected from crowdsourcing. This approach leverages diverse human-provided instructions to train models that can understand and perform a wide variety of tasks, even those they have not been explicitly trained on. By generalizing from these instructions, Cross-Talk Gen aims to create more flexible and adaptable models capable of transferring knowledge between tasks. This method demonstrates the potential of using natural language as a universal medium for task specification, significantly improving the versatility and robustness of AI systems.

Name: UnifiedQA

Date: 2021

Source: AllenAI, UWash

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: UnifiedQA is a question-answering system introduced in the paper "UnifiedQA: Crossing Format Boundaries With a Single QA System." This model is designed to handle multiple types of question-answering formats within a single framework, including extractive QA, abstractive QA, multiple-choice QA, and yes/no questions. UnifiedQA is built on top of T5 (Text-to-Text Transfer Transformer) and leverages the text-to-text paradigm to unify various QA tasks. By training on a diverse set of question-answering datasets, UnifiedQA can generalize across different formats and domains, providing accurate and coherent answers. This approach simplifies the development of QA systems and demonstrates robust performance across a wide range of benchmarks, highlighting its versatility and effectiveness in natural language understanding and generation.

Name: Recursive sum with HF

Date: 2021

Source: OpenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: Recursive Summarization with Human Feedback (HF) is a technique outlined in the paper "Recursively Summarizing Books with Human Feedback." This approach involves generating summaries of large texts, such as books, through a recursive process that incrementally condenses information. Human feedback is incorporated at various stages to ensure the accuracy, coherence, and relevance of the summaries. By iteratively summarizing smaller sections and refining them based on human input, this method aims to produce high-quality summaries that capture the essential content and meaning of the original text. The combination of recursive summarization and human feedback enhances the model's ability to handle complex and lengthy documents, making it a valuable tool for efficient information extraction and content comprehension.

Name: Rank Preference Model

Date: 2021

Source: Anthropic

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: The Rank Preference Model is introduced in the context of the paper "A General Language Assistant as a Laboratory for Alignment." This model is designed to align AI behaviors with human preferences by ranking multiple outputs based on human feedback. Instead of relying solely on numerical reward signals, the Rank Preference Model incorporates direct human judgments about which outputs are better. This approach helps in fine-tuning AI systems to generate responses that are more aligned with human values and expectations. By systematically ranking outputs and learning from these preferences, the model improves its ability to produce desirable and appropriate responses, enhancing its alignment with human users and improving its performance in various applications.

Name: Summary from HF

Date: 2020

Source: OpenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: "Summary from Human Feedback" is a technique detailed in the paper "Learning to Summarize from Human Feedback." This method leverages human evaluations to improve the quality of automated text summaries. Instead of relying solely on predefined metrics, the model receives feedback from humans on the generated summaries, learning to produce more accurate, coherent, and relevant content. This iterative process involves presenting summaries to human reviewers, who then provide feedback used to refine the model's future outputs. By incorporating human judgment, the approach enhances the summarization model's ability to understand and prioritize the most critical information, leading to higher quality and more user-aligned summaries.

Name: Summary from HF

Date: 2019

Source: Darmstadt U

Talent Level: L4

Tech Type: AI/ML

Tech Source: Academic

Summary: "Summary from Human Feedback (HF)" is a method outlined in the paper "Better Rewards Yield Better Summaries: Learning to Summarise Without Human-Written Summaries." This approach focuses on improving the quality of machine-generated summaries by utilizing human feedback instead of relying solely on human-written summaries as references. The model is trained to optimize for rewards based on human evaluations, which assess the relevance, coherence, and informativeness of the generated summaries. By directly incorporating human preferences into the reward structure, this method enables the summarization model to learn more effectively, producing high-quality summaries that better align with human expectations and needs.

Name: LaMDA

Date: 2022

Source: Google

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: LaMDA (Language Models for Dialog Applications) is a conversational AI model developed by Google, as described in the paper "LaMDA: Language Models for Dialog Applications." LaMDA is designed to generate natural and engaging dialogue across a wide range of topics. Unlike traditional language models, LaMDA focuses specifically on dialogue applications, emphasizing the ability to maintain context, provide informative responses, and exhibit open-ended conversational capabilities. It leverages advanced machine learning techniques to understand the nuances of human language and generate responses that are contextually appropriate and coherent. LaMDA aims to improve the quality of human-computer interactions, making them more seamless and natural.

Name: Dialog Model

Date: 2019

Source: CalTech, Google

Talent Level: L4

Tech Type: AI/ML

Tech Source: Collaborative

Summary: The Dialog Model discussed in the paper "Towards Coherent and Engaging Spoken Dialog Response Generation Using" focuses on improving the generation of responses in spoken dialogues. This model aims to create more coherent, contextually appropriate, and engaging interactions between humans and AI systems. It employs advanced natural language processing techniques to understand the context and nuances of a conversation, enabling it to generate responses that are not only accurate but also engaging and natural. The goal of this model is to enhance the overall quality of human-computer interactions, making them more fluid and human-like, and to address common challenges in dialogue systems such as maintaining context and handling diverse conversational topics.

Name: Reward Learning

Date: 2019

Source: OpenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: Reward Learning, as detailed in the paper "Fine-Tuning Language Models from Human Preferences," is a technique used to enhance language models by incorporating human feedback into their training process. This method involves using human evaluations to shape the reward function, guiding the model to generate outputs that align more closely with human values and preferences. By fine-tuning the model based on human-provided feedback, Reward Learning helps the model learn which responses are more appropriate, relevant, and useful. This approach aims to improve the overall quality and alignment of language models, making them more effective and reliable in generating human-like and contextually suitable responses.

Name: Reward Model

Date: 2018

Source: Montreal U

Talent Level: L4

Tech Type: AI/ML

Tech Source: Academic

Summary: The Reward Model, as discussed in the paper "Learning to Understand Goal Specifications by Modelling Reward," is a framework used to train AI systems to comprehend and achieve specified goals by modeling rewards. This approach involves defining a reward function that captures the desired outcomes and behaviors, allowing the AI to learn which actions lead to achieving these goals. By simulating and optimizing for these rewards during training, the model can better understand and fulfill complex goal specifications. The Reward Model is particularly useful in reinforcement learning, where it helps align the AI's actions with the intended objectives, leading to more effective and goal-oriented behavior in various applications.

Name: Reward Function

Date: 2018

Source: Google, OpenAI

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: The Reward Function, as outlined in the paper "Reward Learning from Human Preferences and Demonstrations in Atari," is a crucial component in reinforcement learning systems that defines the goals and desired outcomes for an AI agent. It assigns numerical values to different actions or sequences of actions, guiding the agent towards behaviors that maximize cumulative rewards. In this context, reward functions are learned from human preferences and demonstrations, providing a more intuitive and aligned basis for training. By incorporating human input, the reward function ensures that the agent's actions are not only optimal from a computational perspective but also aligned with human expectations and values. This approach enhances the agent's performance and reliability in complex tasks, such as playing Atari games, by leveraging both expert demonstrations and feedback.

Name: MoE

Date: 2021

Source: Google

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: MoE (Mixture of Experts), as presented in the paper "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity," is a model architecture designed to efficiently scale neural networks to massive sizes. MoE uses multiple expert subnetworks, where each expert is specialized for different parts of the input data. During training and inference, only a subset of these experts is activated, making the model computationally efficient despite its large size. The "Switch" mechanism dynamically routes input to the most relevant experts, improving both performance and scalability. This approach allows the creation of models with trillions of parameters while maintaining practical computational and memory requirements, achieving state-of-the-art results in various natural language processing tasks.

Name: Scaling LM

Date: 2021

Source: DeepMind

Talent Level: L4

Tech Type: AI/ML

Tech Source: Industry

Summary: Scaling Language Models, as explored in the paper "Scaling Language Models: Methods, Analysis & Insights from Training Gopher," refers to the practice of increasing the size and complexity of language models to improve their performance and capabilities. This process involves expanding the number of parameters, enhancing data quality and diversity, and optimizing training techniques to handle larger models. The paper provides an in-depth analysis of the benefits and challenges associated with scaling, such as improved accuracy, generalization, and the ability to perform a broader range of tasks. It also addresses potential issues like increased computational costs and the need for advanced hardware. The insights gained from training Gopher, a large-scale language model, contribute to the understanding of how to effectively scale models to achieve superior performance in natural language processing.

Name: Test Data

Date: 0

Source: nan

Talent Level: nan

Tech Type: nan

Tech Source: nan

Summary: nan

Name: FLAN

Date: 2021

Source: Google

Talent Level: L4

Tech Type: Data

Tech Source: Industry

Summary: The Fine-tuned Language Net (FLAN) dataset by Google involves instruction-based fine-tuning of language models to improve their ability to follow complex instructions and generate accurate, context-aware responses. It is used to enhance the performance of models in understanding and executing diverse tasks.

Name: T0

Date: 2021

Source: BrownU, Hugging Face, others

Talent Level: L4

Tech Type: Data

Tech Source: Collaborative

Summary: The T0 dataset is part of a project that uses prompted datasets to train models on a wide range of tasks using a unified natural language interface. It leverages prompt-based learning to improve model generalization across various natural language processing tasks.

Name: TruthfulQA

Date: 2021

Source: UOxford, OpenAI

Talent Level: L4

Tech Type: Data

Tech Source: Collaborative

Name: RealToxicityPrompts

Date: 2020

Source: AllenAI

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Name: CrowS-Pairs

Date: 2020

Source: NYU

Talent Level: L4

Tech Type: Data

Tech Source: Academic

Summary: CrowS-Pairs is a dataset designed to measure and reduce social biases in AI language models. It includes sentence pairs that exhibit various types of social biases, such as race or gender bias, and is used to assess and improve the fairness and impartiality of language models.

ChatGPT Data

Data is the most important resource that enables creation of large language models.

Dataset Name: General Language Understanding Evaluation benchmark

Organization(s): New York University (NYU), University of Washington (UW),DeepMind

Task(s):

GPT Level: 2.0

Description: The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding tasks, including sentiment analysis, textual entailment, and question answering. It serves as a standard for evaluating and comparing the performance of language models.

Splits
Training:
Validation:
Testing :

Dataset Name: General Language Understanding Evaluation benchmark

Organization(s): New York University (NYU), University of Washington (UW), DeepMind

Task(s):

GPT Level: 2.0

Splits
Training:
Validation:
Testing :

Dataset Name: General Language Understanding Evaluation benchmark

Organization(s): New York University (NYU), University of Washington (UW), DeepMind

Task(s):

GPT Level: 2.0

Splits
Training:
Validation:
Testing :

Dataset Name: Super General Language Understanding Evaluation benchmark

Organization(s): New York University (NYU), Facebook AI Research (FAIR), University of Washington (UW), DeepMind

Task(s):

GPT Level: 3.0

Description: SuperGLUE is a benchmark designed to be more challenging than GLUE, consisting of a diverse set of natural language understanding tasks including coreference resolution, commonsense reasoning, and question answering. It aims to push the boundaries of current language models.

Splits
Training:
Validation:
Testing :

Dataset Name: Super General Language Understanding Evaluation benchmark

Organization(s): New York University (NYU), Facebook AI Research (FAIR), University of Washington (UW), DeepMind

Task(s):

GPT Level: 3.0

Splits
Training:
Validation:
Testing :

Dataset Name: Corpus of Linguistic Acceptability

Organization(s): New York University (NYU), Facebook AI Research (FAIR)

Task(s): Text Classification

GPT Level: 1.0

Description: The Corpus of Linguistic Acceptability (CoLA) consists of English sentences annotated for grammatical acceptability, based on the judgments of expert linguists. It is used to evaluate models on their ability to distinguish between grammatically correct and incorrect sentences.

Splits
Training: 8551
Validation: 1043
Testing : 1063

Dataset Name: Corpus of Linguistic Acceptability

Organization(s): New York University (NYU), Facebook AI Research (FAIR)

Task(s): Text Classification

GPT Level: 1.0

Splits
Training: 8551
Validation: 1043
Testing : 1063

Dataset Name: Corpus of Linguistic Acceptability

Organization(s): New York University (NYU), Facebook AI Research (FAIR)

Task(s): Text Classification

GPT Level: 1.0

Splits
Training: 8551
Validation: 1043
Testing : 1063

Dataset Name: Corpus of Linguistic Acceptability

Organization(s): New York University (NYU), Facebook AI Research (FAIR)

Task(s): Text Classification

GPT Level: 1.0

Splits
Training: 8551
Validation: 1043
Testing : 1063

Dataset Name: Multi-Genre Natural Language Inference Corpus

Organization(s): New York University (NYU)

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The Multi-Genre Natural Language Inference (MNLI) dataset consists of sentence pairs from a variety of sources, labeled for entailment, contradiction, or neutral relationships. It is designed to evaluate model performance across different genres of text.

Splits
Training: 392702
Validation: Matched: 9815 Mismatched: 9832
Testing : Matched: 9796 Mismatched: 9847

Dataset Name: Multi-Genre Natural Language Inference Corpus

Organization(s): New York University (NYU)

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Splits
Training: 392702
Validation: Matched: 9815 Mismatched: 9832
Testing : Matched: 9796 Mismatched: 9847

Dataset Name: Multi-Genre Natural Language Inference Corpus

Organization(s): New York University (NYU)

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Splits
Training: 392702
Validation: Matched: 9815 Mismatched: 9832
Testing : Matched: 9796 Mismatched: 9847

Dataset Name: Multi-Genre Natural Language Inference Corpus

Organization(s): New York University (NYU)

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Splits
Training: 392702
Validation: Matched: 9815 Mismatched: 9832
Testing : Matched: 9796 Mismatched: 9847

Dataset Name: Microsoft Research Paraphrase Corpus

Organization(s): Microsoft Research

Task(s): Paraphrase Identification

GPT Level: 1.0

Description: The Microsoft Research Paraphrase (MRPC) dataset consists of pairs of sentences, each annotated to indicate whether they are semantically equivalent (paraphrases) or not. It is used to train and test models for paraphrase detection and semantic similarity tasks.

Splits
Training: 3668
Validation: 408
Testing : 1725

Dataset Name: Microsoft Research Paraphrase Corpus

Organization(s): Microsoft Research

Task(s): Paraphrase Identification

GPT Level: 1.0

Splits
Training: 3668
Validation: 408
Testing : 1725

Dataset Name: Microsoft Research Paraphrase Corpus

Organization(s): Microsoft Research

Task(s): Paraphrase Identification

GPT Level: 1.0

Splits
Training: 3668
Validation: 408
Testing : 1725

Dataset Name: Microsoft Research Paraphrase Corpus

Organization(s): Microsoft Research

Task(s): Paraphrase Identification

GPT Level: 1.0

Splits
Training: 3668
Validation: 408
Testing : 1725

Dataset Name: Question-answering NLI

Organization(s): New York University (NYU), University of Washington (UW), DeepMind

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 2.0

Description: The Question Natural Language Inference (QNLI) dataset is derived from the Stanford Question Answering Dataset (SQuAD), where the task is to determine if a given sentence contains the answer to a question. It is used to train and evaluate models on question-answer entailment.

Splits
Training: 104743
Validation: 5463
Testing : 5463

Dataset Name: Quora Question Pairs2

Organization(s): Quora

Task(s): Paraphrase Identification

GPT Level: 1.0

Description: The Quora Question Pairs (QQP) dataset contains pairs of questions from Quora labeled to indicate if they have the same intent or are semantically equivalent. It is used to train models for identifying duplicate questions.

Splits
Training: 363846
Validation: 40429
Testing : 390964

Dataset Name: Quora Question Pairs2

Organization(s): Quora

Task(s): Paraphrase Identification

GPT Level: 1.0

Splits
Training: 363846
Validation: 40429
Testing : 390964

Dataset Name: Quora Question Pairs2

Organization(s): Quora

Task(s): Paraphrase Identification

GPT Level: 1.0

Splits
Training: 363846
Validation: 40429
Testing : 390964

Dataset Name: Quora Question Pairs2

Organization(s): Quora

Task(s): Paraphrase Identification

GPT Level: 1.0

Splits
Training: 363846
Validation: 40429
Testing : 390964

Dataset Name: Recognizing Textual Entailment

Organization(s): New York University (NYU), University of Washington (UW), DeepMind

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The RTE-5 dataset is part of the Fifth PASCAL Recognizing Textual Entailment Challenge, containing pairs of text and hypothesis labeled as entailment or non-entailment. This dataset is used to evaluate models' abilities to determine if one sentence logically follows from another.

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Recognizing Textual Entailment

Organization(s): New York University (NYU), University of Washington (UW), DeepMind

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Recognizing Textual Entailment

Organization(s): New York University (NYU), University of Washington (UW), DeepMind

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Recognizing Textual Entailment

Organization(s): New York University (NYU), University of Washington (UW), DeepMind

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Recognizing Textual Entailment

Organization(s): New York University (NYU), University of Washington (UW), DeepMind

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Recognizing Textual Entailment

Organization(s): New York University (NYU), University of Washington (UW), DeepMind

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Stanford Sentiment Treebank

Organization(s): Stanford University

Task(s): Text Classification

GPT Level: 1.0

Description: The Stanford Sentiment Treebank (SST) dataset includes movie reviews annotated with sentiment labels ranging from very negative to very positive. It is commonly used for training and evaluating sentiment analysis models.

Splits
Training: 67348
Validation: 872
Testing : 1821

Dataset Name: Stanford Sentiment Treebank

Organization(s): Stanford University

Task(s): Text Classification

GPT Level: 1.0

Splits
Training: 67348
Validation: 872
Testing : 1821

Dataset Name: Stanford Sentiment Treebank

Organization(s): Stanford University

Task(s): Text Classification

GPT Level: 1.0

Splits
Training: 67348
Validation: 872
Testing : 1821

Dataset Name: Stanford Sentiment Treebank

Organization(s): Stanford University

Task(s): Text Classification

GPT Level: 1.0

Splits
Training: 67348
Validation: 872
Testing : 1821

Dataset Name: Semantic Textual Similarity Benchmark

Organization(s): Google Research, George Washington University, University of Basque Country, University of Sheffield

Task(s): Textual Similarity

GPT Level: 1.0

Description: The Semantic Textual Similarity Benchmark (STSB) dataset from SemEval-2017, created by Google, contains sentence pairs annotated with similarity scores on a scale from 0 to 5. This dataset is used to train and evaluate models on their ability to predict the degree of semantic similarity between sentences.

Splits
Training: 5749
Validation: 1500
Testing : 1379

Dataset Name: Semantic Textual Similarity Benchmark

Organization(s): Google Research, George Washington University, University of Basque Country, University of Sheffield

Task(s): Textual Similarity

GPT Level: 1.0

Splits
Training: 5749
Validation: 1500
Testing : 1379

Dataset Name: Semantic Textual Similarity Benchmark

Organization(s): Google Research, George Washington University, University of Basque Country, University of Sheffield

Task(s): Textual Similarity

GPT Level: 1.0

Splits
Training: 5749
Validation: 1500
Testing : 1379

Dataset Name: Semantic Textual Similarity Benchmark

Organization(s): Google Research, George Washington University, University of Basque Country, University of Sheffield

Task(s): Textual Similarity

GPT Level: 1.0

Splits
Training: 5749
Validation: 1500
Testing : 1379

Dataset Name: Winograd Schema Challenge

Organization(s): New York University (NYU), University of Washington (UW), DeepMind

Task(s): Text Classification

GPT Level: 2.0

Description: The Winograd NLI (WNLI) dataset is converted from the Winograd Schema Challenge (WSC) where each example is evaluated separately, so there is not a systematic correspondence between a model's score on this task and its score on the unconverted original task.

Splits
Training: 635
Validation: 71
Testing : 146

Dataset Name: Boolean Questions

Organization(s): University of Washington (UW), Google AI Language

Task(s): Text Classification

GPT Level: 3.0

Description: The BoolQ dataset is a question-answering benchmark consisting of naturally occurring yes/no questions paired with corresponding paragraphs from Wikipedia that contain the answer. Each question in the dataset is derived from real queries made by users, and the dataset is designed to evaluate the ability of models to understand and reason over textual information to provide accurate boolean answers.

Splits
Training: 0
Validation: 0
Testing : 1104

Dataset Name: Boolean Questions

Organization(s): University of Washington (UW), Google AI Language

Task(s): Text Classification

GPT Level: 3.0

Splits
Training: 0
Validation: 0
Testing : 1104

Dataset Name: CommitmentBank

Organization(s): Ohio State University (OSU), Carnegie Mellon University (CMU), University of Stuttgart

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 3.0

Description: The CommitmentBank (CB) dataset is a collection of English sentences paired with human annotations indicating the degree to which the author of each sentence is committed to the truth of the embedded proposition. It is designed to test models on their ability to understand nuanced linguistic phenomena related to commitment and entailment within various contexts.

Splits
Training: 250
Validation: 56
Testing : 250

Dataset Name: CommitmentBank

Organization(s): Ohio State University (OSU), Carnegie Mellon University (CMU), University of Stuttgart

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 3.0

Splits
Training: 250
Validation: 56
Testing : 250

Dataset Name: Choice of Plausible Alternatives

Organization(s): Indiana University, University of Southern California (USC)

Task(s): Question Answering

GPT Level: 3.0

Description: The Choice of Plausible Alternatives (CoPA) dataset is designed to evaluate causal reasoning in natural language understanding. It consists of multiple-choice questions where each question presents a premise and asks the model to select the more plausible of two alternative completions or causes. This dataset focuses on assessing a model's ability to infer causality and understand logical relationships in text.

Splits
Training: 500
Validation:
Testing : 500

Dataset Name: Choice of Plausible Alternatives

Organization(s): Indiana University, University of Southern California (USC)

Task(s): Question Answering

GPT Level: 3.0

Splits
Training: 500
Validation:
Testing : 500

Dataset Name: Multi-Sentence Reading Comprehension

Organization(s): University of Pennsylvania (UPenn), University of California, Santa Cruz (UCSC), Sarrland University

Task(s): Question Answering

GPT Level: 3.0

Description: The Multi-Sentence Reading Comprehension (MultiRC) dataset is a benchmark designed for evaluating machine reading comprehension across multiple sentences. It consists of questions paired with a context passage and multiple answer options, where each question may have more than one correct answer. The dataset aims to test a model's ability to comprehend and integrate information spread across several sentences to answer questions accurately.

Splits
Training: 27.2k
Validation: 4.85k
Testing : 9.69k

Dataset Name: Multi-Sentence Reading Comprehension

Organization(s): University of Pennsylvania (UPenn), University of California, Santa Cruz (UCSC), Sarrland University

Task(s): Question Answering

GPT Level: 3.0

Splits
Training: 27.2k
Validation: 4.85k
Testing : 9.69k

Dataset Name: Reading Comprehension with Commonsense Reasoning Dataset

Organization(s): Microsoft Research

Task(s): Text Classification

GPT Level: 3.0

Description: The Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a benchmark for evaluating reading comprehension systems on their ability to perform commonsense reasoning. It consists of passages from news articles, each paired with multiple cloze-style questions where a part of the passage is masked, and the task is to select the correct entity from a provided list. This dataset challenges models to integrate contextual understanding with commonsense knowledge to fill in the blanks accurately.

Splits
Training: 101k
Validation: 10k
Testing : 10k

Dataset Name: Reading Comprehension with Commonsense Reasoning Dataset

Organization(s): Microsoft Research

Task(s): Text Classification

GPT Level: 3.0

Splits
Training: 101k
Validation: 10k
Testing : 10k

Dataset Name: Word-in-Context

Organization(s): University of Cambridge, Tehran Institute for Advanced Studies (TelAS), Cardiff University

Task(s): Text Classification

GPT Level: 3.0

Description: The Word-in-Context (WiC) dataset is designed to assess models' understanding of word meaning in different contexts. It consists of pairs of sentences where a target word appears in two different contexts, and the task is to determine if the word has the same meaning in both contexts or not. WiC aims to evaluate lexical semantic understanding and contextual reasoning capabilities of natural language understanding models.

Splits
Training: 5428
Validation: 638
Testing : 1400

Dataset Name: Word-in-Context

Organization(s): University of Cambridge, Tehran Institute for Advanced Studies (TelAS), Cardiff University

Task(s): Text Classification

GPT Level: 3.0

Splits
Training: 5428
Validation: 638
Testing : 1400

Dataset Name: Winograd Schema Challenge

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW)

Task(s): Text Classification

GPT Level: 3.0

Description: The Winograd Schema Challenge (WSC) dataset, also known as the Pronoun Disambiguation dataset, is designed to test machines' ability to resolve ambiguous pronouns by leveraging commonsense reasoning. It consists of sentences with a pronoun and two possible antecedents, where the task is to determine which antecedent the pronoun refers to based on the context provided. This dataset is specifically crafted to evaluate the capability of models to understand nuanced contextual cues and apply logical reasoning skills effectively.

Splits
Training:
Validation:
Testing : 273

Dataset Name: Winograd Schema Challenge

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW)

Task(s): Text Classification

GPT Level: 3.0

Splits
Training:
Validation:
Testing : 273

Dataset Name: Winograd Schema Challenge

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW)

Task(s): Text Classification

GPT Level: 3.0

Splits
Training:
Validation:
Testing : 273

Dataset Name: Winograd Schema Challenge

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW)

Task(s): Text Classification

GPT Level: 3.0

Splits
Training:
Validation:
Testing : 273

Dataset Name: Winograd Schema Challenge

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW)

Task(s): Text Classification

GPT Level: 3.0

Splits
Training:
Validation:
Testing : 273

Dataset Name: Winograd Schema Challenge

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW)

Task(s): Text Classification

GPT Level: 3.0

Splits
Training:
Validation:
Testing : 273

Dataset Name: Story Cloze Test

Organization(s): University of Rochester, United States Naval Academy (USNA), Microsoft Research, Virginia Tech (VT), The Institute for Human & Machine Cognition

Task(s):

GPT Level: 1.0

Description: Story Cloze Test' is a new commonsense reasoning framework for evaluating story understanding, story generation, and script learning.This test requires a system to choose the correct ending to a four-sentence story.

Splits
Training:
Validation:
Testing :

Dataset Name: Story Cloze Test

Organization(s): University of Rochester, United States Naval Academy (USNA), Microsoft Research, Virginia Tech (VT), The Institute for Human & Machine Cognition

Task(s):

GPT Level: 1.0

Splits
Training:
Validation:
Testing :

Dataset Name: Story Cloze Test

Organization(s): University of Rochester, United States Naval Academy (USNA), Microsoft Research, Virginia Tech (VT), The Institute for Human & Machine Cognition

Task(s):

GPT Level: 1.0

Splits
Training:
Validation:
Testing :

Dataset Name: Stanford Question Answering Dataset

Organization(s): Stanford University

Task(s): Question Answering

GPT Level: 1.0

Description: The Stanford Question Answering Dataset (SQuAD) is a benchmark dataset for question answering tasks, consisting of questions posed on Wikipedia articles where the answer is a segment of text from the corresponding passage. It is widely used for training and evaluating question answering models.

Splits
Training: 87.6k
Validation: 10.6k
Testing :

Dataset Name: Stanford Question Answering Dataset

Organization(s): Stanford University

Task(s): Question Answering

GPT Level: 1.0

Splits
Training: 87.6k
Validation: 10.6k
Testing :

Dataset Name: Stanford Question Answering Dataset

Organization(s): Stanford University

Task(s): Question Answering

GPT Level: 1.0

Splits
Training: 87.6k
Validation: 10.6k
Testing :

Dataset Name: Stanford Question Answering Dataset

Organization(s): Stanford University

Task(s): Question Answering

GPT Level: 1.0

Splits
Training: 87.6k
Validation: 10.6k
Testing :

Dataset Name: CNN/Daily Mail

Organization(s): IBM Watson, University of Montreal

Task(s): Text Summarization

GPT Level: 1.0

Description: The CNN/DailyMail dataset is a large-scale dataset for abstractive text summarization, consisting of news articles from CNN and the Daily Mail along with their corresponding human-written summaries. It is commonly used to train and evaluate models on their ability to generate concise, coherent summaries from lengthy and detailed news content.

Splits
Training: 287k
Validation: 13.4k
Testing : 11.5k

Dataset Name: SciTail

Organization(s): Allen Institute for Artificial Intelligence (AllenAI)

Task(s): Question Answering

GPT Level: 1.0

Description: The SciTail dataset is a textual entailment dataset created from multiple-choice science exams and web sentences, designed to evaluate natural language inference models. It consists of premise-hypothesis pairs labeled as either entailment or neutral.

Splits
Training: 23.1k
Validation: 1.3k
Testing : 2.13k

Dataset Name: ReAding Comprehension Dataset From Examination

Organization(s): Carnegie Mellon University (CMU)

Task(s): Question Answering

GPT Level: 1.0

Description: The RACE dataset is a large-scale English language reading comprehension dataset, designed for evaluating machine reading comprehension models. It contains diverse texts and questions from various sources, including fictional and non-fictional texts, along with multiple-choice answers.

Splits
Training: 87.9k
Validation: 4.89k
Testing : 4.93k

Dataset Name: Wizard of Oz

Organization(s): University of Cambridge, Apple Inc.

Task(s): Dialouge state tracking

GPT Level: 2.0

Description: Wizard-of-Oz (WOZ) is a data collection technique where human operators simulate the behavior of an automated system to collect naturalistic interactions with users. It is commonly used to gather training data for dialogue systems and conversational agents.

Splits
Training: 37.9k
Validation:
Testing : 2.32k

Dataset Name: Question-Answer Driven Semantic Role Labeling

Organization(s): University of Washington (UW)

Task(s): Question Answering

GPT Level: 2.0

Description: Question Answering-Semantic Role Labeling (QA-SRL) is a task that involves identifying semantic roles in sentences and generating answers based on these roles. It combines the tasks of question answering and semantic role labeling to improve natural language understanding.

Splits
Training: 6.41k
Validation: 2.18k
Testing : 2.2k

Dataset Name: Penn Treebank Project

Organization(s): University of Pennsylvania (UPenn), Northwestern University

Task(s): Text Generation

GPT Level: 2.0

Description: The Penn Treebank (PTB) dataset contains text from the Wall Street Journal, annotated with syntactic structure and part-of-speech tags. It is a standard benchmark for evaluating models on tasks such as syntactic parsing and language modeling.

Splits
Training: 42.1k
Validation: 3.37k
Testing : 3.76k

Dataset Name: LAnguage Modeling Broadened to Account for Discourse Aspects

Organization(s): University of Trento, University of Amsterdam

Task(s): Text Generation

GPT Level: 2.0

Description: The LAMBADA dataset is designed to test language models on their ability to predict the last word of sentences that require a broad understanding of the context provided by preceding text. It emphasizes the need for models to grasp long-range dependencies in text.

Splits
Training:
Validation:
Testing : 5.15k

Dataset Name: Question Answer - Zero Shot Relation Extraction

Organization(s): University of Washington (UW)

Task(s): Question Answering

GPT Level: 2.0

Description: Question Answering-Zero-shot Relation Extraction (QA-ZRE) is a task that combines question answering and relation extraction, where models are trained to answer questions about relations between entities in text without explicit supervision for relation extraction.

Splits
Training: 8.4M
Validation: 6k
Testing : 120k

Dataset Name: Children’s Book Test

Organization(s): Facebook AI Research (FAIR)

Task(s): Question Answering

GPT Level: 2.0

Description: The Children's Book Test (CBT) dataset, created by Facebook AI Research (FAIR), consists of sentences from children's books with one word removed and multiple-choice options provided for the missing word. It is designed to evaluate language models on their ability to understand and predict the context of a given text.

Splits
Training: 121k
Validation: 2k
Testing : 2.5k

Dataset Name: Question Answering in Context

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW), Stanford University, UMass Amherst

Task(s): Question Answering

GPT Level: 3.0

Description: The Question Answering in Context (QuAC) dataset is designed for training and evaluating models on answering questions in a conversational context. It includes over 14,000 information-seeking question-answer pairs from real-world classroom settings, focusing on understanding and generating contextually relevant answers based on a given text passage.

Splits
Training: 83568
Validation: 7354
Testing : 7353

Dataset Name: Question Answering in Context

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW), Stanford University, UMass Amherst

Task(s): Question Answering

GPT Level: 3.0

Splits
Training: 83568
Validation: 7354
Testing : 7353

Dataset Name: Discrete Reasoning Over Paragraphs

Organization(s): University of California, Irvine (UCI), Peking University, Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW)

Task(s): Question Answering

GPT Level: 3.0

Description: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. . DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).

Splits
Training: 77409
Validation: 9536
Testing :

Dataset Name: Discrete Reasoning Over Paragraphs

Organization(s): University of California, Irvine (UCI), Peking University, Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW)

Task(s): Question Answering

GPT Level: 3.0

Splits
Training: 77409
Validation: 9536
Testing :

Dataset Name: Natural Questions

Organization(s): Google Research

Task(s): Question Answering

GPT Level: 3.0

Description: The Natural Questions (NQ) dataset is a large-scale benchmark for question answering, consisting of real user questions from Google search paired with corresponding passages from Wikipedia that contain the answer. Each question is accompanied by a long passage, and the task is to identify the exact answer span within the passage or to determine that the passage does not contain an answer. This dataset is designed to evaluate a model's ability to understand and extract relevant information from long documents.

Splits
Training: 307372
Validation: 7830
Testing : 7842

Dataset Name: Natural Questions

Organization(s): Google Research

Task(s): Question Answering

GPT Level: 3.0

Splits
Training: 307372
Validation: 7830
Testing : 7842

Dataset Name: HellaSwag

Organization(s): University of Washington (UW), Allen Institute for Artificial Intelligence (AllenAI)

Task(s): Natural Language Inference

GPT Level: 3.0

Description: HellaSWAG is a dataset for studying grounded commonsense inference. It consists of 70k multiple choice questions about grounded situations: each question comes from one of two domains -- activitynet or wikihow -- with four answer choices about what might happen next in the scene. The correct answer is the (real) sentence for the next event; the three incorrect answers are adversarially generated and human verified, so as to fool machines but not humans.

Splits
Training: 39900
Validation: 10000
Testing : 10000

Dataset Name: HellaSwag

Organization(s): University of Washington (UW), Allen Institute for Artificial Intelligence (AllenAI)

Task(s): Natural Language Inference

GPT Level: 3.0

Splits
Training: 39900
Validation: 10000
Testing : 10000

Dataset Name: OpenBookQA

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), Heidelberg University

Task(s): Question Answering

GPT Level: 3.0

Description: OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension.

Splits
Training: 4957
Validation: 500
Testing : 500

Dataset Name: OpenBookQA

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), Heidelberg University

Task(s): Question Answering

GPT Level: 3.0

Splits
Training: 4957
Validation: 500
Testing : 500

Dataset Name: WinoGrande

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW)

Task(s): Question Answering

GPT Level: 3.0

Description: WinoGrande is a dataset for commonsense reasoning, derived from the Winograd Schema Challenge, containing sentence pairs with ambiguous pronouns that require commonsense knowledge to resolve. It is used to train and test models on their ability to understand context and resolve ambiguities.

Splits
Training: 9248
Validation: 1267
Testing : 1767

Dataset Name: WinoGrande

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), University of Washington (UW)

Task(s): Question Answering

GPT Level: 3.0

Splits
Training: 9248
Validation: 1267
Testing : 1767

Dataset Name: Physical Interaction: Question Answering

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), Microsoft Research AI, Carnegie Mellon University (CMU), University of Washington (UW)

Task(s): Question Answering

GPT Level: 3.0

Description: QA dataset introduces the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Physical commonsense knowledge is a major challenge on the road to true AI-completeness, including robots that interact with the world and understand natural language. PIQA focuses on everyday situations with a preference for atypical solutions. The dataset is inspired by instructables.com, which provides users with instructions on how to build, craft, bake, or manipulate objects using everyday materials.

Splits
Training: 16000
Validation: 2000
Testing : 3000

Dataset Name: Physical Interaction: Question Answering

Organization(s): Allen Institute for Artificial Intelligence (AllenAI), Microsoft Research AI, Carnegie Mellon University (CMU), University of Washington (UW)

Task(s): Question Answering

GPT Level: 3.0

Splits
Training: 16000
Validation: 2000
Testing : 3000

Dataset Name: AI2 Reasoning Challenge

Organization(s): Allen Institute for Artificial Intelligence (AllenAI)

Task(s): Question Answering

GPT Level: 3.0

Description: The AI2 Reasoning Challenge (ARC) dataset includes multiple-choice science questions from standardized exams, designed to test advanced reasoning abilities in artificial intelligence systems. It is used to evaluate models on their understanding of scientific concepts and reasoning skills.

Splits
Training: 1.12k
Validation: 299
Testing : 1.17k

Dataset Name: AI2 Reasoning Challenge

Organization(s): Allen Institute for Artificial Intelligence (AllenAI)

Task(s): Question Answering

GPT Level: 3.0

Splits
Training: 1.12k
Validation: 299
Testing : 1.17k

Dataset Name: TriviaQA

Organization(s): University of Washington (UW), Allen Institute for Artificial Intelligence (AllenAI)

Task(s): Question Answering

GPT Level: 3.0

Description: TriviaQA is a large-scale question answering dataset containing question-answer pairs from trivia websites, with corresponding evidence documents from Wikipedia and the web. It is designed to train and evaluate models on their ability to find and extract relevant information.

Splits
Training: 138000
Validation: 17900
Testing : 17200

Dataset Name: TriviaQA

Organization(s): University of Washington (UW), Allen Institute for Artificial Intelligence (AllenAI)

Task(s): Question Answering

GPT Level: 3.0

Splits
Training: 138000
Validation: 17900
Testing : 17200

Dataset Name: Adversarial Natural Language Inference

Organization(s): Facebook AI Research (FAIR)

Task(s): Natural Language Inference

GPT Level: 3.0

Description: The ANLI (Adversarial NLI) dataset is a challenging natural language inference dataset designed to test and improve the robustness of NLI models against adversarial attacks. It consists of three rounds of human-annotated examples, with each round progressively more challenging, aiming to push models to better generalize and handle difficult inference tasks. The dataset covers a variety of inference types and includes a diverse set of premises and hypotheses.

Splits
Training: train_r1=16946, train_r2=45460, train_r3=100459
Validation: dev_r1=1000, dev_r2=1000, dev_r3=1200
Testing : test_r1=1000, test_r2=1000, test_r3=1200

Dataset Name: Adversarial Natural Language Inference

Organization(s): Facebook AI Research (FAIR)

Task(s): Natural Language Inference

GPT Level: 3.0

Splits
Training: train_r1=16946, train_r2=45460, train_r3=100459
Validation: dev_r1=1000, dev_r2=1000, dev_r3=1200
Testing : test_r1=1000, test_r2=1000, test_r3=1200

Dataset Name: WebQuestions

Organization(s): Stanford University

Task(s): Question Answering

GPT Level: 3.0

Description: The WebQuestions (WebQ) dataset contains questions sourced from Google queries, paired with answers from Freebase, a large knowledge graph. It is used to evaluate models on their ability to understand and answer natural language questions based on structured data.

Splits
Training: 3778
Validation:
Testing : 2032

Dataset Name: WebQuestions

Organization(s): Stanford University

Task(s): Question Answering

GPT Level: 3.0

Splits
Training: 3778
Validation:
Testing : 2032

Dataset Name: Massive Multitask Language Understanding

Organization(s): University of California, Berkeley (UCB), Columbia University, University of Chicago, University of Illinois Urbana-Champaign (UIUC)

Task(s): Question Answering

GPT Level: 4.0

Description: The Massive Multitask Language Understanding (MMLU) dataset includes a wide range of multiple-choice questions across various domains such as history, mathematics, and science. It is designed to evaluate the breadth and depth of a language model's general knowledge and reasoning abilities.

Splits
Training:
Validation:
Testing :

Dataset Name: EXAMS

Organization(s): Sofia University, Heidelberg University, Qatar Computing Research Institute

Task(s):

GPT Level: 4.0

Description: The Exams dataset is a comprehensive benchmark designed to evaluate the problem-solving capabilities of language models like GPT-4. It encompasses a diverse range of academic subjects and question types, simulating standardized exams and school assessments. This dataset aims to assess the model's proficiency across different knowledge domains and its ability to reason and solve complex problems.

Splits
Training:
Validation:
Testing :

Dataset Name: Crowdsourced Stereotype Pairs benchmark

Organization(s): New York University (NYU)

Task(s):

GPT Level: 4.0

Description: CrowS-Pairs is a dataset designed to measure and reduce social biases in AI language models. It includes sentence pairs that exhibit various types of social biases, such as race or gender bias, and is used to assess and improve the fairness and impartiality of language models.

Splits
Training:
Validation:
Testing : 1508

Dataset Name: Crowdsourced Stereotype Pairs benchmark

Organization(s): New York University (NYU)

Task(s):

GPT Level: 4.0

Splits
Training:
Validation:
Testing : 1508

Dataset Name: Crowdsourced Stereotype Pairs benchmark

Organization(s): New York University (NYU)

Task(s):

GPT Level: 4.0

Splits
Training:
Validation:
Testing : 1508

Dataset Name: Grade School Math 8K

Organization(s): OpenAI

Task(s):

GPT Level: 4.0

Description: The Grade School Math 8K (GSM-8K) dataset contains thousands of math word problems typically encountered in grade school. It is used to train and evaluate models on their ability to perform arithmetic and understand mathematical concepts in a natural language context.

Splits
Training: 7.5k
Validation:
Testing : 1k

Dataset Name: Real Toxicity Prompts

Organization(s): University of Washington (UW), Allen Institute for Artificial Intelligence (AllenAI)

Task(s):

GPT Level: 4.0

Description: RealToxicityPrompts is a dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.

Splits
Training: 100k
Validation: 0
Testing : 0

Dataset Name: Real Toxicity Prompts

Organization(s): University of Washington (UW), Allen Institute for Artificial Intelligence (AllenAI)

Task(s):

GPT Level: 4.0

Description: RealToxicityPrompts is a dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.

Splits
Training: 100k
Validation: 0
Testing : 0

Dataset Name: Real Toxicity Prompts

Organization(s): University of Washington (UW), Allen Institute for Artificial Intelligence (AllenAI)

Task(s):

GPT Level: 4.0

Description: RealToxicityPrompts is a dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.

Splits
Training: 100k
Validation: 0
Testing : 0

Dataset Name: Hand-Written Evaluation Set

Organization(s): OpenAI, Anthropic AI, Zipline

Task(s):

GPT Level: 4.0

Description: The HumanEval dataset consists of programming problems designed to assess the code generation capabilities of language models. Each problem includes a natural language prompt and a set of test cases, used to evaluate the correctness and functionality of generated code.

Splits
Training: 164
Validation:
Testing :

Dataset Name: ETHICS

Organization(s): University of California, Berkeley (UCB), Columbia University, University of Chicago, Microsoft

Task(s):

GPT Level: 4.0

Description: The ETHICS dataset is designed to evaluate the ethical reasoning abilities of AI models. It includes scenarios and questions related to moral and ethical dilemmas, aiming to train and assess models on their understanding and application of ethical principles.

Splits
Training: virtue=28246, deontology= 18165, commonsense=13911, justice=21792, utilitarianism=13738
Validation:
Testing : virtue=4976, deontology= 3597, commonsense=3886, justice=2705, utilitarianism=4808

Dataset Name:

Organization(s): University of Oxford, OpenAI

Task(s):

GPT Level: 4.0

Description: TruthfulQA is a dataset designed to evaluate the truthfulness and factual accuracy of language models when answering questions. It includes questions that are carefully crafted to test whether models produce truthful responses, addressing the challenge of misinformation in AI-generated content.

Splits
Training: 817
Validation:
Testing :

Dataset Name:

Organization(s): University of Oxford, OpenAI

Task(s):

GPT Level: 4.0

Splits
Training: 817
Validation:
Testing :

Dataset Name:

Organization(s): University of Oxford, OpenAI

Task(s):

GPT Level: 4.0

Splits
Training: 817
Validation:
Testing :

Dataset Name: Fine-tuned LAnguage Net

Organization(s): Google Research

Task(s):

GPT Level: 4.0

Description: The Fine-tuned Language Net (FLAN) dataset by Google involves instruction-based fine-tuning of language models to improve their ability to follow complex instructions and generate accurate, context-aware responses. It is used to enhance the performance of models in understanding and executing diverse tasks.

Splits
Training:
Validation:
Testing :

Dataset Name: Fine-tuned LAnguage Net

Organization(s): Google Research

Task(s):

GPT Level: 4.0

Splits
Training:
Validation:
Testing :

Dataset Name: Fine-tuned LAnguage Net

Organization(s): Google Research

Task(s):

GPT Level: 4.0

Splits
Training:
Validation:
Testing :

Dataset Name: Conversational Question Answering

Organization(s): Stanford University

Task(s): Question Answering

GPT Level:

Description: The CoQA (Conversational Question Answering) dataset is a large-scale collection designed to train and evaluate systems for answering questions in a conversational context. It includes over 127,000 questions based on passages from diverse domains, with each question-answer pair building on previous ones to simulate a natural dialogue. CoQA emphasizes understanding and maintaining context across multiple turns of conversation.

Splits
Training: 108k
Validation: 8k
Testing : 31k

Dataset Name:

Organization(s): Common Crawl Foundation

Task(s): LLM Pre-Training

GPT Level:

Description: Common Crawl is a massive dataset of text and code scraped from the public web. It's a popular choice for training LLMs because of its vast size and variety, reflecting the real-world language used across the internet. This exposure to diverse content helps LLMs become more versatile and informative.

Splits
Training:
Validation:
Testing :

Dataset Name:

Organization(s): OpenAI

Task(s): LLM Pre-Training

GPT Level:

Description: This dataset consists of a variety of books sourced from Project Gutenberg, representing a diverse collection of public domain literature. It provides a wide range of topics and styles, which helps in developing a model's understanding of complex language patterns and narrative structures.

Splits
Training:
Validation:
Testing :

Dataset Name:

Organization(s): OpenAI

Task(s): LLM Pre-Training

GPT Level:

Description: This dataset includes a selection of books from a proprietary corpus, containing contemporary published works. It offers high-quality, curated text with modern language usage, which enhances the model's ability to generate contextually relevant and coherent responses.

Splits
Training:
Validation:
Testing :

Dataset Name:

Organization(s): Stanford University

Task(s): LLM Pre-Training

GPT Level:

Description: The Reddit dataset used to train large language models typically consists of a vast collection of posts and comments from the Reddit platform. This dataset is valuable because it captures a wide range of informal conversational language, covering diverse topics, styles, and dialects. The conversational nature of the data helps in training models to understand and generate human-like dialogue and to engage in contextually relevant interactions.

Splits
Training:
Validation:
Testing :

Dataset Name:

Organization(s): Wikipedia Foundation

Task(s): LLM Pre-Training

GPT Level:

Description: The Wikipedia dataset used for training large language models is a comprehensive snapshot of English Wikipedia, encompassing millions of articles across various fields of knowledge. This dataset is valuable for its structured, well-written, and encyclopedic content, providing models with high-quality information and a broad understanding of numerous subjects. Wikipedia's consistent formatting and extensive coverage make it ideal for improving a model's ability to generate informative and accurate text.

Splits
Training:
Validation:
Testing :

Dataset Name:

Organization(s): Google Research, DeepMind

Task(s):

GPT Level:

Description: The ExMix dataset is a collection of text and code designed specifically to challenge large language models. It contains intentionally confusing or nonsensical examples that push LLMs to improve their reasoning and critical thinking abilities. By exposing LLMs to these complexities, researchers hope to develop more robust and reliable models.

Splits
Training:
Validation:
Testing :

Dataset Name:

Organization(s): OpenAI

Task(s): LLM Pre-Training

GPT Level:

Description: Web text is the vast collection of written content found on websites across the internet. It encompasses everything from articles and blog posts to social media updates, product descriptions, and online comments. This diverse mix of information reflects the real-world use of language and constantly evolves as new websites and content are created.

Splits
Training:
Validation:
Testing :

Dataset Name: Stanford Natural Language Inference

Organization(s): Stanford University

Task(s): Natural Language Inference (NLI)

GPT Level:

Description: The SNLI dataset is a collection of English sentence pairs labeled as entailment, contradiction, or neutral. It's a popular benchmark for training and evaluating large language models (LLMs) in their ability to understand the relationships between sentences.

Splits
Training: 550k
Validation: 10k
Testing : 10k

GenAI Talent SolutionsSmart LLM Technologies for Enterprises

GenAI Talent Class

ChatGPT Explosion History

ChatGPT Tech Tree

ChatGPT Data

LLM Data

GenAI Talent Solutions
Smart LLM Technologies for Enterprises