Before diving into GenAI, it is important to have a solid foundation. Only by having a sturdy base, will you be able to build upon your knowledge of AI. Skipping some of these fundamental elements can lead to confusion and hinder your progress. Here are some prerequisites that you should have
Multitask learning is an approach where a single model is trained to perform multiple related tasks simultaneously. By sharing information and features across tasks, multitask learning aims to improve the performance of individual tasks through joint learning, leading to better generalization and efficiency.
Meta-learning, also known as learning to learn, involves training models on a variety of tasks with the goal of enabling them to quickly adapt to new tasks or domains with minimal data. It explores techniques for learning effective learning strategies or representations that generalize across tasks.
Zero-shot learning is a machine learning paradigm where a model is trained to recognize classes it has never seen before during training. It involves learning to generalize to new classes by leveraging auxiliary information or semantic embeddings.
Model-Agnostic Meta-Learning (MAML) is a meta-learning algorithm that aims to train models to adapt quickly to new tasks with minimal data. It involves learning a good initialization that can be fine-tuned efficiently for new tasks, facilitating rapid adaptation.
Multi-Question Answering Network () is a neural network architecture designed for multi-hop question answering tasks. It enables the model to reason over multiple pieces of evidence to generate accurate answers to complex questions.
decaNLP is a framework for training and evaluating multitask models across ten diverse natural language processing tasks. It covers a wide range of NLP tasks, including translation, summarization, and question answering, aiming to encourage research in multitask learning and generalization.
Training curriculum refers to a strategy for training machine learning models where training examples are presented to the model in a structured order, typically starting with simpler examples and gradually increasing in complexity. It is used to facilitate learning and improve the generalization performance of models.
Byte Pair Encoding (BPE) with Unicode is an extension of the BPE algorithm that handles Unicode characters, allowing for the efficient tokenization of multilingual text. It is commonly used in natural language processing tasks such as machine translation and text generation.
The WebText dataset is a large-scale dataset introduced by OpenAI for training language models. It is comprised of a diverse range of internet text.
???
The Newspaper Content Text dataset is a compilation of data gathered from newspaper articles. The dataset may\ include various features such as the title of the article, author, publication date, the content of the article, and\ more.
The Reddit dataset is a large collection of data from the social media platform, Reddit. It includes informa\ tion about Reddit posts, comments, upvotes, downvotes, and other user interactions on the site.
The Common Crawl dataset is a large and regularly updated corpus of web crawl data that is freely available \ to anyone. This dataset contains raw web page data, metadata, and text.
Question Answering-Semantic Role Labeling (QA-SRL) is a task that involves identifying semantic roles in sentences and generating answers based on these roles. It combines the tasks of question answering and semantic role labeling to improve natural language understanding.
Question Answering-Zero-shot Relation Extraction (QA-ZRE) is a task that combines question answering and relation extraction, where models are trained to answer questions about relations between entities in text without explicit supervision for relation extraction.
Wizard-of-Oz (WOZ) is a data collection technique where human operators simulate the behavior of an automated system to collect naturalistic interactions with users. It is commonly used to gather training data for dialogue systems and conversational agents.
WikiSQL is a dataset for semantic parsing and question answering tasks, where models are trained to map natural language questions to SQL queries over a structured table. It is used to evaluate models on their ability to understand and generate SQL queries from text.
The Natural Questions dataset contains real anonymized questions from Google search, each paired with a Wikipedia page and a corresponding answer span. It is used to train and evaluate question answering systems on their ability to find precise answers within long documents.
The Conversational Question Answering (CoQA) dataset consists of question-answer pairs within a conversational context, focusing on answering questions based on a given passage. It evaluates models on their ability to understand and generate contextually relevant answers in a conversation.
The LAMBADA dataset is designed to test language models on their ability to predict the last word of sentences that require a broad understanding of the context provided by preceding text. It emphasizes the need for models to grasp long-range dependencies in text.
The WikiText dataset is a collection of Wikipedia articles curated for language modeling tasks, featuring long-form, coherent text with minimal editing. It is used to train and evaluate models on their ability to generate and predict natural language text.
The Children’s Book Test (CBT) dataset, created by Facebook AI Research (FAIR), consists of sentences from children’s books with one word removed and multiple-choice options provided for the missing word. It is designed to evaluate language models on their ability to understand and predict the context of a given text.
The CNN/Daily Mail dataset is a large collection of news articles paired with multi-sentence summaries, created to facilitate research in automatic summarization and reading comprehension. It is used to train models to generate concise summaries from longer texts.
The 1 Billion Word Benchmark is a dataset composed of a large collection of sentences from news articles, designed to support research in language modeling. It aims to evaluate models on their ability to predict and generate fluent and coherent text.
The Penn Treebank (PTB) dataset contains text from the Wall Street Journal, annotated with syntactic structure and part-of-speech tags. It is a standard benchmark for evaluating models on tasks such as syntactic parsing and language modeling.
A scaled-up version of GPT with 1.5 billion parameters, showing significant improvements in text generation quality and diversity compared to its predecessor. This advancement demonstrated the potential of increasing model size for performance gains in NLP tasks.