ChatGPT Data

Data is the most important resource that enables creation of large language models.

Dataset Name: General Language Understanding Evaluation benchmark

Organization(s): nan

Task(s): nan

GPT Level: 2.0

Description: The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding tasks, including sentiment analysis, textual entailment, and question answering. It serves as a standard for evaluating and comparing the performance of language models.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: General Language Understanding Evaluation benchmark

Organization(s): nan

Task(s): nan

GPT Level: 2.0

Description: The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding tasks, including sentiment analysis, textual entailment, and question answering. It serves as a standard for evaluating and comparing the performance of language models.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: General Language Understanding Evaluation benchmark

Organization(s): nan

Task(s): nan

GPT Level: 2.0

Description: The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding tasks, including sentiment analysis, textual entailment, and question answering. It serves as a standard for evaluating and comparing the performance of language models.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Super General Language Understanding Evaluation benchmark

Organization(s): nan

Task(s): nan

GPT Level: 3.0

Description: SuperGLUE is a benchmark designed to be more challenging than GLUE, consisting of a diverse set of natural language understanding tasks including coreference resolution, commonsense reasoning, and question answering. It aims to push the boundaries of current language models.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Super General Language Understanding Evaluation benchmark

Organization(s): nan

Task(s): nan

GPT Level: 3.0

Description: SuperGLUE is a benchmark designed to be more challenging than GLUE, consisting of a diverse set of natural language understanding tasks including coreference resolution, commonsense reasoning, and question answering. It aims to push the boundaries of current language models.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Corpus of Linguistic Acceptability

Organization(s): nan

Task(s): Text Classification

GPT Level: 1.0

Description: The Corpus of Linguistic Acceptability (CoLA) consists of English sentences annotated for grammatical acceptability, based on the judgments of expert linguists. It is used to evaluate models on their ability to distinguish between grammatically correct and incorrect sentences.

Splits
Training: 8551
Validation: 1043
Testing : 1063

Dataset Name: Corpus of Linguistic Acceptability

Organization(s): nan

Task(s): Text Classification

GPT Level: 1.0

Description: The Corpus of Linguistic Acceptability (CoLA) consists of English sentences annotated for grammatical acceptability, based on the judgments of expert linguists. It is used to evaluate models on their ability to distinguish between grammatically correct and incorrect sentences.

Splits
Training: 8551
Validation: 1043
Testing : 1063

Dataset Name: Corpus of Linguistic Acceptability

Organization(s): nan

Task(s): Text Classification

GPT Level: 1.0

Description: The Corpus of Linguistic Acceptability (CoLA) consists of English sentences annotated for grammatical acceptability, based on the judgments of expert linguists. It is used to evaluate models on their ability to distinguish between grammatically correct and incorrect sentences.

Splits
Training: 8551
Validation: 1043
Testing : 1063

Dataset Name: Corpus of Linguistic Acceptability

Organization(s): nan

Task(s): Text Classification

GPT Level: 1.0

Description: The Corpus of Linguistic Acceptability (CoLA) consists of English sentences annotated for grammatical acceptability, based on the judgments of expert linguists. It is used to evaluate models on their ability to distinguish between grammatically correct and incorrect sentences.

Splits
Training: 8551
Validation: 1043
Testing : 1063

Dataset Name: Multi-Genre Natural Language Inference Corpus

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The Multi-Genre Natural Language Inference (MNLI) dataset consists of sentence pairs from a variety of sources, labeled for entailment, contradiction, or neutral relationships. It is designed to evaluate model performance across different genres of text.

Splits
Training: 392702
Validation: Matched: 9815 Mismatched: 9832
Testing : Matched: 9796 Mismatched: 9847

Dataset Name: Multi-Genre Natural Language Inference Corpus

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The Multi-Genre Natural Language Inference (MNLI) dataset consists of sentence pairs from a variety of sources, labeled for entailment, contradiction, or neutral relationships. It is designed to evaluate model performance across different genres of text.

Splits
Training: 392702
Validation: Matched: 9815 Mismatched: 9832
Testing : Matched: 9796 Mismatched: 9847

Dataset Name: Multi-Genre Natural Language Inference Corpus

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The Multi-Genre Natural Language Inference (MNLI) dataset consists of sentence pairs from a variety of sources, labeled for entailment, contradiction, or neutral relationships. It is designed to evaluate model performance across different genres of text.

Splits
Training: 392702
Validation: Matched: 9815 Mismatched: 9832
Testing : Matched: 9796 Mismatched: 9847

Dataset Name: Multi-Genre Natural Language Inference Corpus

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The Multi-Genre Natural Language Inference (MNLI) dataset consists of sentence pairs from a variety of sources, labeled for entailment, contradiction, or neutral relationships. It is designed to evaluate model performance across different genres of text.

Splits
Training: 392702
Validation: Matched: 9815 Mismatched: 9832
Testing : Matched: 9796 Mismatched: 9847

Dataset Name: Microsoft Research Paraphrase Corpus

Organization(s): nan

Task(s): Paraphrase Identification

GPT Level: 1.0

Description: The Microsoft Research Paraphrase (MRPC) dataset consists of pairs of sentences, each annotated to indicate whether they are semantically equivalent (paraphrases) or not. It is used to train and test models for paraphrase detection and semantic similarity tasks.

Splits
Training: 3668
Validation: 408
Testing : 1725

Dataset Name: Microsoft Research Paraphrase Corpus

Organization(s): nan

Task(s): Paraphrase Identification

GPT Level: 1.0

Description: The Microsoft Research Paraphrase (MRPC) dataset consists of pairs of sentences, each annotated to indicate whether they are semantically equivalent (paraphrases) or not. It is used to train and test models for paraphrase detection and semantic similarity tasks.

Splits
Training: 3668
Validation: 408
Testing : 1725

Dataset Name: Microsoft Research Paraphrase Corpus

Organization(s): nan

Task(s): Paraphrase Identification

GPT Level: 1.0

Description: The Microsoft Research Paraphrase (MRPC) dataset consists of pairs of sentences, each annotated to indicate whether they are semantically equivalent (paraphrases) or not. It is used to train and test models for paraphrase detection and semantic similarity tasks.

Splits
Training: 3668
Validation: 408
Testing : 1725

Dataset Name: Microsoft Research Paraphrase Corpus

Organization(s): nan

Task(s): Paraphrase Identification

GPT Level: 1.0

Description: The Microsoft Research Paraphrase (MRPC) dataset consists of pairs of sentences, each annotated to indicate whether they are semantically equivalent (paraphrases) or not. It is used to train and test models for paraphrase detection and semantic similarity tasks.

Splits
Training: 3668
Validation: 408
Testing : 1725

Dataset Name: Stanford Question Answering Dataset

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 2.0

Description: The Question Natural Language Inference (QNLI) dataset is derived from the Stanford Question Answering Dataset (SQuAD), where the task is to determine if a given sentence contains the answer to a question. It is used to train and evaluate models on question-answer entailment.

Splits
Training: 104743
Validation: 5463
Testing : 5463

Dataset Name: Quora Question Pairs2

Organization(s): nan

Task(s): Paraphrase Identification

GPT Level: 1.0

Description: The Quora Question Pairs (QQP) dataset contains pairs of questions from Quora labeled to indicate if they have the same intent or are semantically equivalent. It is used to train models for identifying duplicate questions.

Splits
Training: 363846
Validation: 40429
Testing : 390964

Dataset Name: Quora Question Pairs2

Organization(s): nan

Task(s): Paraphrase Identification

GPT Level: 1.0

Description: The Quora Question Pairs (QQP) dataset contains pairs of questions from Quora labeled to indicate if they have the same intent or are semantically equivalent. It is used to train models for identifying duplicate questions.

Splits
Training: 363846
Validation: 40429
Testing : 390964

Dataset Name: Quora Question Pairs2

Organization(s): nan

Task(s): Paraphrase Identification

GPT Level: 1.0

Description: The Quora Question Pairs (QQP) dataset contains pairs of questions from Quora labeled to indicate if they have the same intent or are semantically equivalent. It is used to train models for identifying duplicate questions.

Splits
Training: 363846
Validation: 40429
Testing : 390964

Dataset Name: Quora Question Pairs2

Organization(s): nan

Task(s): Paraphrase Identification

GPT Level: 1.0

Description: The Quora Question Pairs (QQP) dataset contains pairs of questions from Quora labeled to indicate if they have the same intent or are semantically equivalent. It is used to train models for identifying duplicate questions.

Splits
Training: 363846
Validation: 40429
Testing : 390964

Dataset Name: Recognizing Textual Entailment

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The RTE-5 dataset is part of the Fifth PASCAL Recognizing Textual Entailment Challenge, containing pairs of text and hypothesis labeled as entailment or non-entailment. This dataset is used to evaluate models' abilities to determine if one sentence logically follows from another.

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Recognizing Textual Entailment

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The RTE-5 dataset is part of the Fifth PASCAL Recognizing Textual Entailment Challenge, containing pairs of text and hypothesis labeled as entailment or non-entailment. This dataset is used to evaluate models' abilities to determine if one sentence logically follows from another.

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Recognizing Textual Entailment

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The RTE-5 dataset is part of the Fifth PASCAL Recognizing Textual Entailment Challenge, containing pairs of text and hypothesis labeled as entailment or non-entailment. This dataset is used to evaluate models' abilities to determine if one sentence logically follows from another.

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Recognizing Textual Entailment

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The RTE-5 dataset is part of the Fifth PASCAL Recognizing Textual Entailment Challenge, containing pairs of text and hypothesis labeled as entailment or non-entailment. This dataset is used to evaluate models' abilities to determine if one sentence logically follows from another.

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Recognizing Textual Entailment

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The RTE-5 dataset is part of the Fifth PASCAL Recognizing Textual Entailment Challenge, containing pairs of text and hypothesis labeled as entailment or non-entailment. This dataset is used to evaluate models' abilities to determine if one sentence logically follows from another.

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Recognizing Textual Entailment

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 1.0

Description: The RTE-5 dataset is part of the Fifth PASCAL Recognizing Textual Entailment Challenge, containing pairs of text and hypothesis labeled as entailment or non-entailment. This dataset is used to evaluate models' abilities to determine if one sentence logically follows from another.

Splits
Training: 2490
Validation: 277
Testing : 3000

Dataset Name: Stanford Sentiment Treebank

Organization(s): nan

Task(s): Text Classification

GPT Level: 1.0

Description: The Stanford Sentiment Treebank (SST) dataset includes movie reviews annotated with sentiment labels ranging from very negative to very positive. It is commonly used for training and evaluating sentiment analysis models.

Splits
Training: 67348
Validation: 872
Testing : 1821

Dataset Name: Stanford Sentiment Treebank

Organization(s): nan

Task(s): Text Classification

GPT Level: 1.0

Description: The Stanford Sentiment Treebank (SST) dataset includes movie reviews annotated with sentiment labels ranging from very negative to very positive. It is commonly used for training and evaluating sentiment analysis models.

Splits
Training: 67348
Validation: 872
Testing : 1821

Dataset Name: Stanford Sentiment Treebank

Organization(s): nan

Task(s): Text Classification

GPT Level: 1.0

Description: The Stanford Sentiment Treebank (SST) dataset includes movie reviews annotated with sentiment labels ranging from very negative to very positive. It is commonly used for training and evaluating sentiment analysis models.

Splits
Training: 67348
Validation: 872
Testing : 1821

Dataset Name: Stanford Sentiment Treebank

Organization(s): nan

Task(s): Text Classification

GPT Level: 1.0

Description: The Stanford Sentiment Treebank (SST) dataset includes movie reviews annotated with sentiment labels ranging from very negative to very positive. It is commonly used for training and evaluating sentiment analysis models.

Splits
Training: 67348
Validation: 872
Testing : 1821

Dataset Name: Semantic Textual Similarity Benchmark

Organization(s): nan

Task(s): Textual Similarity

GPT Level: 1.0

Description: The Semantic Textual Similarity Benchmark (STSB) dataset from SemEval-2017, created by Google, contains sentence pairs annotated with similarity scores on a scale from 0 to 5. This dataset is used to train and evaluate models on their ability to predict the degree of semantic similarity between sentences.

Splits
Training: 5749
Validation: 1500
Testing : 1379

Dataset Name: Semantic Textual Similarity Benchmark

Organization(s): nan

Task(s): Textual Similarity

GPT Level: 1.0

Description: The Semantic Textual Similarity Benchmark (STSB) dataset from SemEval-2017, created by Google, contains sentence pairs annotated with similarity scores on a scale from 0 to 5. This dataset is used to train and evaluate models on their ability to predict the degree of semantic similarity between sentences.

Splits
Training: 5749
Validation: 1500
Testing : 1379

Dataset Name: Semantic Textual Similarity Benchmark

Organization(s): nan

Task(s): Textual Similarity

GPT Level: 1.0

Description: The Semantic Textual Similarity Benchmark (STSB) dataset from SemEval-2017, created by Google, contains sentence pairs annotated with similarity scores on a scale from 0 to 5. This dataset is used to train and evaluate models on their ability to predict the degree of semantic similarity between sentences.

Splits
Training: 5749
Validation: 1500
Testing : 1379

Dataset Name: Semantic Textual Similarity Benchmark

Organization(s): nan

Task(s): Textual Similarity

GPT Level: 1.0

Description: The Semantic Textual Similarity Benchmark (STSB) dataset from SemEval-2017, created by Google, contains sentence pairs annotated with similarity scores on a scale from 0 to 5. This dataset is used to train and evaluate models on their ability to predict the degree of semantic similarity between sentences.

Splits
Training: 5749
Validation: 1500
Testing : 1379

Dataset Name: Winograd Schema Challenge

Organization(s): nan

Task(s): Text Classification

GPT Level: 2.0

Description: The Winograd NLI (WNLI) dataset is converted from the Winograd Schema Challenge (WSC) where each example is evaluated separately, so there is not a systematic correspondence between a model's score on this task and its score on the unconverted original task.​

Splits
Training: 635
Validation: 71
Testing : 146

Dataset Name: Boolean Questions

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The BoolQ dataset is a question-answering benchmark consisting of naturally occurring yes/no questions paired with corresponding paragraphs from Wikipedia that contain the answer. Each question in the dataset is derived from real queries made by users, and the dataset is designed to evaluate the ability of models to understand and reason over textual information to provide accurate boolean answers.

Splits
Training: 0
Validation: 0
Testing : 1104

Dataset Name: Boolean Questions

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The BoolQ dataset is a question-answering benchmark consisting of naturally occurring yes/no questions paired with corresponding paragraphs from Wikipedia that contain the answer. Each question in the dataset is derived from real queries made by users, and the dataset is designed to evaluate the ability of models to understand and reason over textual information to provide accurate boolean answers.

Splits
Training: 0
Validation: 0
Testing : 1104

Dataset Name: CommitmentBank

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 3.0

Description: The CommitmentBank (CB) dataset is a collection of English sentences paired with human annotations indicating the degree to which the author of each sentence is committed to the truth of the embedded proposition. It is designed to test models on their ability to understand nuanced linguistic phenomena related to commitment and entailment within various contexts.

Splits
Training: 250
Validation: 56
Testing : 250

Dataset Name: CommitmentBank

Organization(s): nan

Task(s): Textual Entailment (Natural Language Inference)

GPT Level: 3.0

Description: The CommitmentBank (CB) dataset is a collection of English sentences paired with human annotations indicating the degree to which the author of each sentence is committed to the truth of the embedded proposition. It is designed to test models on their ability to understand nuanced linguistic phenomena related to commitment and entailment within various contexts.

Splits
Training: 250
Validation: 56
Testing : 250

Dataset Name: Choice of Plausible Alternatives

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The Choice of Plausible Alternatives (CoPA) dataset is designed to evaluate causal reasoning in natural language understanding. It consists of multiple-choice questions where each question presents a premise and asks the model to select the more plausible of two alternative completions or causes. This dataset focuses on assessing a model's ability to infer causality and understand logical relationships in text.

Splits
Training: 500
Validation: nan
Testing : 500

Dataset Name: Choice of Plausible Alternatives

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The Choice of Plausible Alternatives (CoPA) dataset is designed to evaluate causal reasoning in natural language understanding. It consists of multiple-choice questions where each question presents a premise and asks the model to select the more plausible of two alternative completions or causes. This dataset focuses on assessing a model's ability to infer causality and understand logical relationships in text.

Splits
Training: 500
Validation: nan
Testing : 500

Dataset Name: Multi-Sentence Reading Comprehension

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The Multi-Sentence Reading Comprehension (MultiRC) dataset is a benchmark designed for evaluating machine reading comprehension across multiple sentences. It consists of questions paired with a context passage and multiple answer options, where each question may have more than one correct answer. The dataset aims to test a model's ability to comprehend and integrate information spread across several sentences to answer questions accurately.​

Splits
Training: 27.2k
Validation: 4.85k
Testing : 9.69k

Dataset Name: Multi-Sentence Reading Comprehension

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The Multi-Sentence Reading Comprehension (MultiRC) dataset is a benchmark designed for evaluating machine reading comprehension across multiple sentences. It consists of questions paired with a context passage and multiple answer options, where each question may have more than one correct answer. The dataset aims to test a model's ability to comprehend and integrate information spread across several sentences to answer questions accurately.​

Splits
Training: 27.2k
Validation: 4.85k
Testing : 9.69k

Dataset Name: Reading Comprehension with Commonsense Reasoning Dataset

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a benchmark for evaluating reading comprehension systems on their ability to perform commonsense reasoning. It consists of passages from news articles, each paired with multiple cloze-style questions where a part of the passage is masked, and the task is to select the correct entity from a provided list. This dataset challenges models to integrate contextual understanding with commonsense knowledge to fill in the blanks accurately.

Splits
Training: 101k
Validation: 10k
Testing : 10k

Dataset Name: Reading Comprehension with Commonsense Reasoning Dataset

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a benchmark for evaluating reading comprehension systems on their ability to perform commonsense reasoning. It consists of passages from news articles, each paired with multiple cloze-style questions where a part of the passage is masked, and the task is to select the correct entity from a provided list. This dataset challenges models to integrate contextual understanding with commonsense knowledge to fill in the blanks accurately.

Splits
Training: 101k
Validation: 10k
Testing : 10k

Dataset Name: Word-in-Context

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The Word-in-Context (WiC) dataset is designed to assess models' understanding of word meaning in different contexts. It consists of pairs of sentences where a target word appears in two different contexts, and the task is to determine if the word has the same meaning in both contexts or not. WiC aims to evaluate lexical semantic understanding and contextual reasoning capabilities of natural language understanding models.

Splits
Training: 5428
Validation: 638
Testing : 1400

Dataset Name: Word-in-Context

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The Word-in-Context (WiC) dataset is designed to assess models' understanding of word meaning in different contexts. It consists of pairs of sentences where a target word appears in two different contexts, and the task is to determine if the word has the same meaning in both contexts or not. WiC aims to evaluate lexical semantic understanding and contextual reasoning capabilities of natural language understanding models.

Splits
Training: 5428
Validation: 638
Testing : 1400

Dataset Name: Winograd Schema Challenge

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The Winograd Schema Challenge (WSC) dataset, also known as the Pronoun Disambiguation dataset, is designed to test machines' ability to resolve ambiguous pronouns by leveraging commonsense reasoning. It consists of sentences with a pronoun and two possible antecedents, where the task is to determine which antecedent the pronoun refers to based on the context provided. This dataset is specifically crafted to evaluate the capability of models to understand nuanced contextual cues and apply logical reasoning skills effectively.

Splits
Training: nan
Validation: nan
Testing : 273

Dataset Name: Winograd Schema Challenge

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The Winograd Schema Challenge (WSC) dataset, also known as the Pronoun Disambiguation dataset, is designed to test machines' ability to resolve ambiguous pronouns by leveraging commonsense reasoning. It consists of sentences with a pronoun and two possible antecedents, where the task is to determine which antecedent the pronoun refers to based on the context provided. This dataset is specifically crafted to evaluate the capability of models to understand nuanced contextual cues and apply logical reasoning skills effectively.

Splits
Training: nan
Validation: nan
Testing : 273

Dataset Name: Winograd Schema Challenge

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The Winograd Schema Challenge (WSC) dataset, also known as the Pronoun Disambiguation dataset, is designed to test machines' ability to resolve ambiguous pronouns by leveraging commonsense reasoning. It consists of sentences with a pronoun and two possible antecedents, where the task is to determine which antecedent the pronoun refers to based on the context provided. This dataset is specifically crafted to evaluate the capability of models to understand nuanced contextual cues and apply logical reasoning skills effectively.

Splits
Training: nan
Validation: nan
Testing : 273

Dataset Name: Winograd Schema Challenge

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The Winograd Schema Challenge (WSC) dataset, also known as the Pronoun Disambiguation dataset, is designed to test machines' ability to resolve ambiguous pronouns by leveraging commonsense reasoning. It consists of sentences with a pronoun and two possible antecedents, where the task is to determine which antecedent the pronoun refers to based on the context provided. This dataset is specifically crafted to evaluate the capability of models to understand nuanced contextual cues and apply logical reasoning skills effectively.

Splits
Training: nan
Validation: nan
Testing : 273

Dataset Name: Winograd Schema Challenge

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The Winograd Schema Challenge (WSC) dataset, also known as the Pronoun Disambiguation dataset, is designed to test machines' ability to resolve ambiguous pronouns by leveraging commonsense reasoning. It consists of sentences with a pronoun and two possible antecedents, where the task is to determine which antecedent the pronoun refers to based on the context provided. This dataset is specifically crafted to evaluate the capability of models to understand nuanced contextual cues and apply logical reasoning skills effectively.

Splits
Training: nan
Validation: nan
Testing : 273

Dataset Name: Winograd Schema Challenge

Organization(s): nan

Task(s): Text Classification

GPT Level: 3.0

Description: The Winograd Schema Challenge (WSC) dataset, also known as the Pronoun Disambiguation dataset, is designed to test machines' ability to resolve ambiguous pronouns by leveraging commonsense reasoning. It consists of sentences with a pronoun and two possible antecedents, where the task is to determine which antecedent the pronoun refers to based on the context provided. This dataset is specifically crafted to evaluate the capability of models to understand nuanced contextual cues and apply logical reasoning skills effectively.

Splits
Training: nan
Validation: nan
Testing : 273

Dataset Name: Story Cloze Test

Organization(s): nan

Task(s): nan

GPT Level: 1.0

Description: Story Cloze Test' is a new commonsense reasoning framework for evaluating story understanding, story generation, and script learning.This test requires a system to choose the correct ending to a four-sentence story.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Story Cloze Test

Organization(s): nan

Task(s): nan

GPT Level: 1.0

Description: Story Cloze Test' is a new commonsense reasoning framework for evaluating story understanding, story generation, and script learning.This test requires a system to choose the correct ending to a four-sentence story.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Story Cloze Test

Organization(s): nan

Task(s): nan

GPT Level: 1.0

Description: Story Cloze Test' is a new commonsense reasoning framework for evaluating story understanding, story generation, and script learning.This test requires a system to choose the correct ending to a four-sentence story.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Stanford Question Answering Dataset

Organization(s): nan

Task(s): Question Answering

GPT Level: 1.0

Description: The Stanford Question Answering Dataset (SQuAD) is a benchmark dataset for question answering tasks, consisting of questions posed on Wikipedia articles where the answer is a segment of text from the corresponding passage. It is widely used for training and evaluating question answering models.

Splits
Training: 87.6k
Validation: 10.6k
Testing : nan

Dataset Name: Stanford Question Answering Dataset

Organization(s): nan

Task(s): Question Answering

GPT Level: 1.0

Description: The Stanford Question Answering Dataset (SQuAD) is a benchmark dataset for question answering tasks, consisting of questions posed on Wikipedia articles where the answer is a segment of text from the corresponding passage. It is widely used for training and evaluating question answering models.

Splits
Training: 87.6k
Validation: 10.6k
Testing : nan

Dataset Name: Stanford Question Answering Dataset

Organization(s): nan

Task(s): Question Answering

GPT Level: 1.0

Description: The Stanford Question Answering Dataset (SQuAD) is a benchmark dataset for question answering tasks, consisting of questions posed on Wikipedia articles where the answer is a segment of text from the corresponding passage. It is widely used for training and evaluating question answering models.

Splits
Training: 87.6k
Validation: 10.6k
Testing : nan

Dataset Name: Stanford Question Answering Dataset

Organization(s): nan

Task(s): Question Answering

GPT Level: 1.0

Description: The Stanford Question Answering Dataset (SQuAD) is a benchmark dataset for question answering tasks, consisting of questions posed on Wikipedia articles where the answer is a segment of text from the corresponding passage. It is widely used for training and evaluating question answering models.

Splits
Training: 87.6k
Validation: 10.6k
Testing : nan

Dataset Name: CNN/Daily Mail

Organization(s): nan

Task(s): Text Summarization

GPT Level: 1.0

Description: The CNN/DailyMail dataset is a large-scale dataset for abstractive text summarization, consisting of news articles from CNN and the Daily Mail along with their corresponding human-written summaries. It is commonly used to train and evaluate models on their ability to generate concise, coherent summaries from lengthy and detailed news content.

Splits
Training: 287k
Validation: 13.4k
Testing : 11.5k

Dataset Name: SciTail

Organization(s): nan

Task(s): Question Answering

GPT Level: 1.0

Description: The SciTail dataset is a textual entailment dataset created from multiple-choice science exams and web sentences, designed to evaluate natural language inference models. It consists of premise-hypothesis pairs labeled as either entailment or neutral.

Splits
Training: 23.1k
Validation: 1.3k
Testing : 2.13k

Dataset Name: ReAding Comprehension Dataset From Examination

Organization(s): nan

Task(s): Question Answering

GPT Level: 1.0

Description: The RACE dataset is a large-scale English language reading comprehension dataset, designed for evaluating machine reading comprehension models. It contains diverse texts and questions from various sources, including fictional and non-fictional texts, along with multiple-choice answers.

Splits
Training: 87.9k
Validation: 4.89k
Testing : 4.93k

Dataset Name: Wizard of Oz

Organization(s): nan

Task(s): Dialouge state tracking

GPT Level: 2.0

Description: Wizard-of-Oz (WOZ) is a data collection technique where human operators simulate the behavior of an automated system to collect naturalistic interactions with users. It is commonly used to gather training data for dialogue systems and conversational agents.

Splits
Training: 37.9k
Validation: nan
Testing : 2.32k

Dataset Name: Question-Answer Driven Semantic Role Labeling

Organization(s): nan

Task(s): Question Answering

GPT Level: 2.0

Description: Question Answering-Semantic Role Labeling (QA-SRL) is a task that involves identifying semantic roles in sentences and generating answers based on these roles. It combines the tasks of question answering and semantic role labeling to improve natural language understanding.

Splits
Training: 6.41k
Validation: 2.18k
Testing : 2.2k

Dataset Name: Penn Treebank Project

Organization(s): nan

Task(s): Text Generation

GPT Level: 2.0

Description: The Penn Treebank (PTB) dataset contains text from the Wall Street Journal, annotated with syntactic structure and part-of-speech tags. It is a standard benchmark for evaluating models on tasks such as syntactic parsing and language modeling.

Splits
Training: 42.1k
Validation: 3.37k
Testing : 3.76k

Dataset Name: LAnguage Modeling Broadened to Account for Discourse Aspects

Organization(s): nan

Task(s): Text Generation

GPT Level: 2.0

Description: The LAMBADA dataset is designed to test language models on their ability to predict the last word of sentences that require a broad understanding of the context provided by preceding text. It emphasizes the need for models to grasp long-range dependencies in text.

Splits
Training: nan
Validation: nan
Testing : 5.15k

Dataset Name: Question Answer - Zero Shot Relation Extraction

Organization(s): nan

Task(s): Question Answering

GPT Level: 2.0

Description: Question Answering-Zero-shot Relation Extraction (QA-ZRE) is a task that combines question answering and relation extraction, where models are trained to answer questions about relations between entities in text without explicit supervision for relation extraction.

Splits
Training: 8.4M
Validation: 6k
Testing : 120k

Dataset Name: Children’s Book Test

Organization(s): nan

Task(s): Question Answering

GPT Level: 2.0

Description: The Children's Book Test (CBT) dataset, created by Facebook AI Research (FAIR), consists of sentences from children's books with one word removed and multiple-choice options provided for the missing word. It is designed to evaluate language models on their ability to understand and predict the context of a given text.

Splits
Training: 121k
Validation: 2k
Testing : 2.5k

Dataset Name: Question Answering in Context

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The Question Answering in Context (QuAC) dataset is designed for training and evaluating models on answering questions in a conversational context. It includes over 14,000 information-seeking question-answer pairs from real-world classroom settings, focusing on understanding and generating contextually relevant answers based on a given text passage.

Splits
Training: 83568
Validation: 7354
Testing : 7353

Dataset Name: Question Answering in Context

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The Question Answering in Context (QuAC) dataset is designed for training and evaluating models on answering questions in a conversational context. It includes over 14,000 information-seeking question-answer pairs from real-world classroom settings, focusing on understanding and generating contextually relevant answers based on a given text passage.

Splits
Training: 83568
Validation: 7354
Testing : 7353

Dataset Name: Discrete Reasoning Over Paragraphs

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. . DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).

Splits
Training: 77409
Validation: 9536
Testing : nan

Dataset Name: Discrete Reasoning Over Paragraphs

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. . DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).

Splits
Training: 77409
Validation: 9536
Testing : nan

Dataset Name: Natural Questions

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The Natural Questions (NQ) dataset is a large-scale benchmark for question answering, consisting of real user questions from Google search paired with corresponding passages from Wikipedia that contain the answer. Each question is accompanied by a long passage, and the task is to identify the exact answer span within the passage or to determine that the passage does not contain an answer. This dataset is designed to evaluate a model's ability to understand and extract relevant information from long documents.

Splits
Training: 307372
Validation: 7830
Testing : 7842

Dataset Name: Natural Questions

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The Natural Questions (NQ) dataset is a large-scale benchmark for question answering, consisting of real user questions from Google search paired with corresponding passages from Wikipedia that contain the answer. Each question is accompanied by a long passage, and the task is to identify the exact answer span within the passage or to determine that the passage does not contain an answer. This dataset is designed to evaluate a model's ability to understand and extract relevant information from long documents.

Splits
Training: 307372
Validation: 7830
Testing : 7842

Dataset Name: HellaSwag

Organization(s): nan

Task(s): Natural Language Inference

GPT Level: 3.0

Description: HellaSWAG is a dataset for studying grounded commonsense inference. It consists of 70k multiple choice questions about grounded situations: each question comes from one of two domains -- activitynet or wikihow -- with four answer choices about what might happen next in the scene. The correct answer is the (real) sentence for the next event; the three incorrect answers are adversarially generated and human verified, so as to fool machines but not humans.

Splits
Training: 39900
Validation: 10000
Testing : 10000

Dataset Name: HellaSwag

Organization(s): nan

Task(s): Natural Language Inference

GPT Level: 3.0

Description: HellaSWAG is a dataset for studying grounded commonsense inference. It consists of 70k multiple choice questions about grounded situations: each question comes from one of two domains -- activitynet or wikihow -- with four answer choices about what might happen next in the scene. The correct answer is the (real) sentence for the next event; the three incorrect answers are adversarially generated and human verified, so as to fool machines but not humans.

Splits
Training: 39900
Validation: 10000
Testing : 10000

Dataset Name: OpenBookQA

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension.

Splits
Training: 4957
Validation: 500
Testing : 500

Dataset Name: OpenBookQA

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension.

Splits
Training: 4957
Validation: 500
Testing : 500

Dataset Name: WinoGrande

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: WinoGrande is a dataset for commonsense reasoning, derived from the Winograd Schema Challenge, containing sentence pairs with ambiguous pronouns that require commonsense knowledge to resolve. It is used to train and test models on their ability to understand context and resolve ambiguities.

Splits
Training: 9248
Validation: 1267
Testing : 1767

Dataset Name: WinoGrande

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: WinoGrande is a dataset for commonsense reasoning, derived from the Winograd Schema Challenge, containing sentence pairs with ambiguous pronouns that require commonsense knowledge to resolve. It is used to train and test models on their ability to understand context and resolve ambiguities.

Splits
Training: 9248
Validation: 1267
Testing : 1767

Dataset Name: Physical Interaction: Question Answering

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: QA dataset introduces the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Physical commonsense knowledge is a major challenge on the road to true AI-completeness, including robots that interact with the world and understand natural language. PIQA focuses on everyday situations with a preference for atypical solutions. The dataset is inspired by instructables.com, which provides users with instructions on how to build, craft, bake, or manipulate objects using everyday materials.

Splits
Training: 16000
Validation: 2000
Testing : 3000

Dataset Name: Physical Interaction: Question Answering

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: QA dataset introduces the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Physical commonsense knowledge is a major challenge on the road to true AI-completeness, including robots that interact with the world and understand natural language. PIQA focuses on everyday situations with a preference for atypical solutions. The dataset is inspired by instructables.com, which provides users with instructions on how to build, craft, bake, or manipulate objects using everyday materials.

Splits
Training: 16000
Validation: 2000
Testing : 3000

Dataset Name: AI2 Reasoning Challenge

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The AI2 Reasoning Challenge (ARC) dataset includes multiple-choice science questions from standardized exams, designed to test advanced reasoning abilities in artificial intelligence systems. It is used to evaluate models on their understanding of scientific concepts and reasoning skills.

Splits
Training: 1.12k
Validation: 299
Testing : 1.17k

Dataset Name: AI2 Reasoning Challenge

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The AI2 Reasoning Challenge (ARC) dataset includes multiple-choice science questions from standardized exams, designed to test advanced reasoning abilities in artificial intelligence systems. It is used to evaluate models on their understanding of scientific concepts and reasoning skills.

Splits
Training: 1.12k
Validation: 299
Testing : 1.17k

Dataset Name: TriviaQA

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: TriviaQA is a large-scale question answering dataset containing question-answer pairs from trivia websites, with corresponding evidence documents from Wikipedia and the web. It is designed to train and evaluate models on their ability to find and extract relevant information.

Splits
Training: 138000
Validation: 17900
Testing : 17200

Dataset Name: TriviaQA

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: TriviaQA is a large-scale question answering dataset containing question-answer pairs from trivia websites, with corresponding evidence documents from Wikipedia and the web. It is designed to train and evaluate models on their ability to find and extract relevant information.

Splits
Training: 138000
Validation: 17900
Testing : 17200

Dataset Name: Adversarial Natural Language Inference

Organization(s): nan

Task(s): Natural Language Inference

GPT Level: 3.0

Description: The ANLI (Adversarial NLI) dataset is a challenging natural language inference dataset designed to test and improve the robustness of NLI models against adversarial attacks. It consists of three rounds of human-annotated examples, with each round progressively more challenging, aiming to push models to better generalize and handle difficult inference tasks. The dataset covers a variety of inference types and includes a diverse set of premises and hypotheses.

Splits
Training: train_r1=16946, train_r2=45460, train_r3=100459
Validation: dev_r1=1000, dev_r2=1000, dev_r3=1200
Testing : test_r1=1000, test_r2=1000, test_r3=1200

Dataset Name: Adversarial Natural Language Inference

Organization(s): nan

Task(s): Natural Language Inference

GPT Level: 3.0

Description: The ANLI (Adversarial NLI) dataset is a challenging natural language inference dataset designed to test and improve the robustness of NLI models against adversarial attacks. It consists of three rounds of human-annotated examples, with each round progressively more challenging, aiming to push models to better generalize and handle difficult inference tasks. The dataset covers a variety of inference types and includes a diverse set of premises and hypotheses.

Splits
Training: train_r1=16946, train_r2=45460, train_r3=100459
Validation: dev_r1=1000, dev_r2=1000, dev_r3=1200
Testing : test_r1=1000, test_r2=1000, test_r3=1200

Dataset Name: WebQuestions

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The WebQuestions (WebQ) dataset contains questions sourced from Google queries, paired with answers from Freebase, a large knowledge graph. It is used to evaluate models on their ability to understand and answer natural language questions based on structured data.

Splits
Training: 3778
Validation: nan
Testing : 2032

Dataset Name: WebQuestions

Organization(s): nan

Task(s): Question Answering

GPT Level: 3.0

Description: The WebQuestions (WebQ) dataset contains questions sourced from Google queries, paired with answers from Freebase, a large knowledge graph. It is used to evaluate models on their ability to understand and answer natural language questions based on structured data.

Splits
Training: 3778
Validation: nan
Testing : 2032

Dataset Name: Massive Multitask Language Understanding

Organization(s): nan

Task(s): Question Answering

GPT Level: 4.0

Description: The Massive Multitask Language Understanding (MMLU) dataset includes a wide range of multiple-choice questions across various domains such as history, mathematics, and science. It is designed to evaluate the breadth and depth of a language model's general knowledge and reasoning abilities.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Exams

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: The Exams dataset is a comprehensive benchmark designed to evaluate the problem-solving capabilities of language models like GPT-4. It encompasses a diverse range of academic subjects and question types, simulating standardized exams and school assessments. This dataset aims to assess the model's proficiency across different knowledge domains and its ability to reason and solve complex problems.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Crowdsourced Stereotype Pairs benchmark

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: CrowS-Pairs is a dataset designed to measure and reduce social biases in AI language models. It includes sentence pairs that exhibit various types of social biases, such as race or gender bias, and is used to assess and improve the fairness and impartiality of language models.

Splits
Training: nan
Validation: nan
Testing : 1508

Dataset Name: Crowdsourced Stereotype Pairs benchmark

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: CrowS-Pairs is a dataset designed to measure and reduce social biases in AI language models. It includes sentence pairs that exhibit various types of social biases, such as race or gender bias, and is used to assess and improve the fairness and impartiality of language models.

Splits
Training: nan
Validation: nan
Testing : 1508

Dataset Name: Crowdsourced Stereotype Pairs benchmark

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: CrowS-Pairs is a dataset designed to measure and reduce social biases in AI language models. It includes sentence pairs that exhibit various types of social biases, such as race or gender bias, and is used to assess and improve the fairness and impartiality of language models.

Splits
Training: nan
Validation: nan
Testing : 1508

Dataset Name: Grade School Math 8K

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: The Grade School Math 8K (GSM-8K) dataset contains thousands of math word problems typically encountered in grade school. It is used to train and evaluate models on their ability to perform arithmetic and understand mathematical concepts in a natural language context.

Splits
Training: 7.5k
Validation: nan
Testing : 1k

Dataset Name: Real Time Processing

Organization(s): U W

Task(s): nan

GPT Level: 4.0

Description: The RTP (Real-Time Processing) dataset developed by the University of Washington is designed for training and evaluating large language models, featuring a vast collection of diverse, high-quality text data from multiple sources to enhance real-time processing capabilities. This dataset is used for tasks such as real-time text generation, conversational AI, automated summarization, sentiment analysis, and machine translation.

Splits
Training: 100k
Validation: 0
Testing : 0

Dataset Name: Real Time Processing

Organization(s): U W

Task(s): nan

GPT Level: 4.0

Description: The RTP (Real-Time Processing) dataset developed by the University of Washington is designed for training and evaluating large language models, featuring a vast collection of diverse, high-quality text data from multiple sources to enhance real-time processing capabilities. This dataset is used for tasks such as real-time text generation, conversational AI, automated summarization, sentiment analysis, and machine translation.

Splits
Training: 100k
Validation: 0
Testing : 0

Dataset Name: Real Time Processing

Organization(s): U W

Task(s): nan

GPT Level: 4.0

Description: The RTP (Real-Time Processing) dataset developed by the University of Washington is designed for training and evaluating large language models, featuring a vast collection of diverse, high-quality text data from multiple sources to enhance real-time processing capabilities. This dataset is used for tasks such as real-time text generation, conversational AI, automated summarization, sentiment analysis, and machine translation.

Splits
Training: 100k
Validation: 0
Testing : 0

Dataset Name: Hand-Written Evaluation Set

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: The HumanEval dataset consists of programming problems designed to assess the code generation capabilities of language models. Each problem includes a natural language prompt and a set of test cases, used to evaluate the correctness and functionality of generated code.

Splits
Training: 164
Validation: nan
Testing : nan

Dataset Name: ETHICS

Organization(s): Hendrycks et al

Task(s): nan

GPT Level: 4.0

Description: The ETHICS dataset is designed to evaluate the ethical reasoning abilities of AI models. It includes scenarios and questions related to moral and ethical dilemmas, aiming to train and assess models on their understanding and application of ethical principles.

Splits
Training: virtue=28246, deontology= 18165, commonsense=13911, justice=21792, utilitarianism=13738
Validation: nan
Testing : virtue=4976, deontology= 3597, commonsense=3886, justice=2705, utilitarianism=4808

Dataset Name: nan

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: TruthfulQA is a dataset designed to evaluate the truthfulness and factual accuracy of language models when answering questions. It includes questions that are carefully crafted to test whether models produce truthful responses, addressing the challenge of misinformation in AI-generated content.

Splits
Training: 817
Validation: nan
Testing : nan

Dataset Name: nan

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: TruthfulQA is a dataset designed to evaluate the truthfulness and factual accuracy of language models when answering questions. It includes questions that are carefully crafted to test whether models produce truthful responses, addressing the challenge of misinformation in AI-generated content.

Splits
Training: 817
Validation: nan
Testing : nan

Dataset Name: nan

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: TruthfulQA is a dataset designed to evaluate the truthfulness and factual accuracy of language models when answering questions. It includes questions that are carefully crafted to test whether models produce truthful responses, addressing the challenge of misinformation in AI-generated content.

Splits
Training: 817
Validation: nan
Testing : nan

Dataset Name: Fine-tuned LAnguage Net

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: The Fine-tuned Language Net (FLAN) dataset by Google involves instruction-based fine-tuning of language models to improve their ability to follow complex instructions and generate accurate, context-aware responses. It is used to enhance the performance of models in understanding and executing diverse tasks.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Fine-tuned LAnguage Net

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: The Fine-tuned Language Net (FLAN) dataset by Google involves instruction-based fine-tuning of language models to improve their ability to follow complex instructions and generate accurate, context-aware responses. It is used to enhance the performance of models in understanding and executing diverse tasks.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Fine-tuned LAnguage Net

Organization(s): nan

Task(s): nan

GPT Level: 4.0

Description: The Fine-tuned Language Net (FLAN) dataset by Google involves instruction-based fine-tuning of language models to improve their ability to follow complex instructions and generate accurate, context-aware responses. It is used to enhance the performance of models in understanding and executing diverse tasks.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Conversational Question Answering

Organization(s): USC

Task(s): Question Answering

GPT Level: nan

Description: The CoQA (Conversational Question Answering) dataset is a large-scale collection designed to train and evaluate systems for answering questions in a conversational context. It includes over 127,000 questions based on passages from diverse domains, with each question-answer pair building on previous ones to simulate a natural dialogue. CoQA emphasizes understanding and maintaining context across multiple turns of conversation.

Splits
Training: 108k
Validation: 8k
Testing : 31k

Dataset Name: nan

Organization(s): nan

Task(s): LLM Pre-Training

GPT Level: nan

Description: Common Crawl is a massive dataset of text and code scraped from the public web. It's a popular choice for training LLMs because of its vast size and variety, reflecting the real-world language used across the internet. This exposure to diverse content helps LLMs become more versatile and informative.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: nan

Organization(s): nan

Task(s): LLM Pre-Training

GPT Level: nan

Description: This dataset consists of a variety of books sourced from Project Gutenberg, representing a diverse collection of public domain literature. It provides a wide range of topics and styles, which helps in developing a model's understanding of complex language patterns and narrative structures.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: nan

Organization(s): nan

Task(s): LLM Pre-Training

GPT Level: nan

Description: This dataset includes a selection of books from a proprietary corpus, containing contemporary published works. It offers high-quality, curated text with modern language usage, which enhances the model's ability to generate contextually relevant and coherent responses.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: nan

Organization(s): nan

Task(s): LLM Pre-Training

GPT Level: nan

Description: The Reddit dataset used to train large language models typically consists of a vast collection of posts and comments from the Reddit platform. This dataset is valuable because it captures a wide range of informal conversational language, covering diverse topics, styles, and dialects. The conversational nature of the data helps in training models to understand and generate human-like dialogue and to engage in contextually relevant interactions.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: nan

Organization(s): nan

Task(s): LLM Pre-Training

GPT Level: nan

Description: The Wikipedia dataset used for training large language models is a comprehensive snapshot of English Wikipedia, encompassing millions of articles across various fields of knowledge. This dataset is valuable for its structured, well-written, and encyclopedic content, providing models with high-quality information and a broad understanding of numerous subjects. Wikipedia's consistent formatting and extensive coverage make it ideal for improving a model's ability to generate informative and accurate text.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: nan

Organization(s): nan

Task(s): nan

GPT Level: nan

Description: The ExMix dataset is a collection of text and code designed specifically to challenge large language models. It contains intentionally confusing or nonsensical examples that push LLMs to improve their reasoning and critical thinking abilities. By exposing LLMs to these complexities, researchers hope to develop more robust and reliable models.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: nan

Organization(s): nan

Task(s): LLM Pre-Training

GPT Level: nan

Description: Web text is the vast collection of written content found on websites across the internet. It encompasses everything from articles and blog posts to social media updates, product descriptions, and online comments. This diverse mix of information reflects the real-world use of language and constantly evolves as new websites and content are created.

Splits
Training: nan
Validation: nan
Testing : nan

Dataset Name: Stanford Natural Language Inference

Organization(s): Stanford

Task(s): Natural Language Inference (NLI)

GPT Level: nan

Description: The SNLI dataset is a collection of English sentence pairs labeled as entailment, contradiction, or neutral. It's a popular benchmark for training and evaluating large language models (LLMs) in their ability to understand the relationships between sentences.

Splits
Training: 550k
Validation: 10k
Testing : 10k