Tokenization

ACME Inc. located in Santa Clara, California, is in the business of selling products and services to consumers and has a copious amount of language data: documents and voice recordings. Utilizing various AI technologies, including LLM, speech to text, text to speech, ACME Inc. wishes to service their customers better, faster, and more efficiently. The most important preparation before ACME’s own proprietary LLM can be developed is constructing a vocabulary that the LLM is based one. Here is a detailed description of how we can create a vocabulary for ACME’s LLM. All LLM’s need to define and construct their own vocabulary before the AI models can be trained. The vocabulary for an LLM is consisted of tokens not words. A token can be a phrase, a word, or a sub-word (a part of a word) but most LLM’s tokens are just words and sub-words not phrases. The size of a vocabulary in any language can be virtually infinite due to the fact that names (proper nouns) can have virtually infinite variations. The process of creating a vocabulary for an LLM is called Tokenization which is not trivial. Tokenization strategies often impact the performance of LLM’s in many downstream tasks and need to be well thought out. There are few algorithms for tokenization process but most popular ones in the LLM arena are BPE (Byte Pair Encoding) and WordPiece algorithm. Both algorithms are very similar but slightly different in their merging operation. In order to tokenize (create a vocabulary), we need a text input, base tokens, desired size of vocabulary, and merging method. Let’s explain tokenization process using an example language, English. Let’s choose the US constitution as the input text, and the base tokens are each letter in the alphabet (26 base tokens), and the merging instruction is to merge two consecutive tokens the appear most in the input text. Our starting point for our vocabulary is the 26 letters in the alphabet. At this stage, word “the” will be decomposed into three tokens: “t”, “h”, and “e” according to our vocabulary. Next we scan through our input text (the Constitution), to see which two consecutive tokens appear most and merge them to create a new token. Let’s say the word “the” appears the most in the Constitution. Then the merging step will merge the token “t” and “h” to create a new token “th”. Then the next merging step will merge the token “th” with the token “e” to create another token “the”. At this stage, we have 28 tokens in our vocabulary: 26 alphabet, “th”, and “the”. We repeat this process until we reach the limit of how many tokens that we would like to have in our vocabulary. It is important for enterprises to develop their own vocabulary to enhance the performance of LLM’s that cater to their own need. Using the Constitution example above, even though the phrase “United States of America” may not appear often, it most likely will be beneficial for LLM that we build for our application to have the token “United States of America” not as 4 separate tokens. Because the tokenization strategy impacts the performance of LLM’s, each enterprise’s tokenization strategy should be customized for the specific need.
ACME’s data team identified the following few phrases that are very important to its business so the decision has been made to create tokens for each phrases below.
  • To infinity and beyond => 1 token
  • Hakuna Matata! => 1 token
  • Bibbidi Bobbidi Boo => 1 token
  • Circle of life => 1 token …
ACME’s LLM will consider these phrases as a single token, therefore these phrases will never be broken but always appear intact to improve the accuracy of downstream applications.

Text Summarization

Next using ACME’s LLM that is trained on enterprise proprietary data, we will start processing the language data that is need for business. First, using speech to text AI model, all voice recording data is converted to documents. Next, using text summarization feature of the LLM that their GenAI team has created, all documents (both native documents and speech-to-text converted documents) are processed to produce a short and medium length summary. Text summation process will create a summary for the entire document and as granular level as each paragraph within documents. Because ACME Inc. has millions of documents, this process would have taken months or years, hiring hundreds of employees proficient in English but the LLM that the AI team developed accomplished this in few hours.
Text summarization is the process of distilling the most important information from a text while retaining its core meaning. It is a process of condensing a larger piece of text, such as an article, essay, or document, into a shorter version while retaining the main ideas and key points. This condensed version, known as a summary, provides readers with a quick overview of the original text without requiring them to read the entire document. After processing all the documents with the LLM to generate summaries for the entire documentation as well as all paragraphs within each doc, ACME’s database team organized all the documentations and paragraphs indexed by the summary into the ACME’s database, The Database, for an easy access for their employees and customers.  Now ACME’s employees and customers alike can search through The Database using natural language to access a document or a part of a document that they desire to information on.  Condensing the whole document or a part of a document using LLM’s text summarization feature gives the following tangible efficiency boost for ACME Inc.
  1. Organize: By using the summary of a document as an index into the database, documents can be much better organized
  2. Search: When anyone needs to access any document, searching may become easier with the summary of documents
  3. Compare: The summary of documents can be used to compare documents much more easily to boost efficiency

Semantic Similarity

ACME’s ability to manage the language data efficiently through AI technology catapulted the enterprise to be one of the leaders in the industry boosting their sales and increasing the number of customers. Now they are getting over ten thousand calls a day from their current customers and prospective customers.

Because there are virtually infinite permutations of how natural language is spoken (the way customers ask about a product or services may differ widely from person to person even though the product or services may be the same), ACME Inc. hired and trained agents that are proficient in English to service their customers. Here is an example of how customers would inquire about products and services.

  • I heard that you have a product that has feature “A”.
  • I’d like to get information about a product with “A” ability.
  • Having “A” is so cool. I have to have it!
  • Can you tell me about specification of “A” and if there is any product that has it?
  • I have been using Coyote’s product XYZ that can let me do “A”. Do you have a similar product?

But now they are inundated with customer calls that their agents are not able to respond fast enough. ACME’s AI team yet again steps into resolve the current crisis with another LLM’s feature called Semantic Similarity.

Semantic similarity refers to how similar two pieces of text are in meaning, even if the wording is different. It’s not just about identical or matching words, but rather the underlying concepts and ideas they convey.
Semantic similarity is crucial in natural language processing (NLP) tasks, as it helps algorithms understand the context and meaning of text beyond just individual words.

For each product, service, or information that ACME Inc. has, we will add few examples statements of how a customer (or employee) would describe the product, service, or info. Because we will be using the Semantic Similarity feature of the LLM, the few statements that we need to add do not have to be exhaustive which by the way if virtually impossible. We just need to add few representative statements that people can use to ask or describe each product, service, or info.

 

Question Answering

ACME’s various AI technologies and LLM features enable the enterprise’s customers and employees to access vast amount of language data automatically via natural language instructions/prompts.  Given a natural language instruction, we can now retrieve the relevant documents or paragraphs in a document.  Before customers commit to purchase a product or a service, there are typically questions to be asked and answered.  There is another feature of LLM’s that can help with this process called Question Answering (QA).

Question answering (QA) is a branch of Artificial Intelligence (AI) that deals with automatically creating answers to questions posed in natural language. The goal is to develop systems that can understand the meaning of a question and retrieve or generate an accurate answer from a variety of sources, just like a human might do.

There are two main categories of question answering systems:

  • Open Domain QA: These systems aim to answer a wide range of questions from any domain, similar to a search engine. They rely on vast amounts of text data and require advanced techniques to understand the context and intent of a question.
  • Closed Domain QA: These systems are designed to answer questions within a specific domain, like customer service chatbots for a particular company. They typically have access to a well-defined knowledge base or dataset relevant to their domain, making answer retrieval more focused.

A typical closed-domain QA dataset is very similar to reading comprehension questions that many schools exams employ.  One popular QA dataset is called SQuAD (Stanford Question Answering Dataset).  There are three parts in SQuAD data set: a passage from Wikipedia, a question related to the passage, and the answer.

The passage and the question related to the passage becomes an input to the LLM model like ChatGPT, and the answer is the output/label which is the output ChatGPT is expected to generate. ChatGPT’s ability to answer questions based on a text depends on quality of these triplet QA dataset. The obvious scenario is that if someone asks ChatGPT questions that it was trained with, such as above example of the question regrading graupel, ChatGPT would answer correctly. But the less obvious scenario is that when ChatGPT is given a text that it has not seen during its training, and a person asks a question related to the text, can ChatGPT answer correctly? Well, it depends. If ChatGPT is trained with many permutations of text – permutation on vocabulary, sentence structure, content, ChatGPT can be “generalized” to answer questions on a text that it has not seen. Example is the following.

In ACME Inc. case, they are more interested in solving our enterprise specific question answering task which falls to Closed Domain QA. LLM’s such as ChatGPT or Gemini, which has impressive abilities on Open Domain QA, in this scenario would not be a help at all since those models are not trained with ACME’s proprietary language data. The LLM that ACME’s AI team developed takes in documents or paragraphs that we have retrieved as input and customers’ questions, and provide answers according to the information that is contained in the documents or paragraphs. Because ACME’s LLM has been trained to answer questions from documents or paragraphs (closed domain QA), it will hallucinate far less and produce much less misinformation compared to general-purpose LLM’s such as ChatGPT or Gemini. As you can see in the diagram, ACME’s LLM automates what used to be a manual process requiring many human agents to answer questions for customers.

Sentiment Analysis

It is important to gauge whether ACME’s customers are happy and satisfied with the services that are provided.  Acquiring customers’ approval to record the interactions with  ACME’s customer service platform, AI technologies can be used to assess how well ACME’s customer service platform is doing.  If a customer is using a voice driven service platform, all the voice data will be converted into text data by an speech-to-text AI model.  If a customer is using a text driven service platform, no conversion is necessary.  Once the customer interaction text data is attained, LLM’s Sentiment Analysis feature can be used to check the satisfaction of each customer interaction.

Sentiment Analysis refers to the process of automatically identifying the emotional tone of a piece of text, categorizing it as positive, negative, or neutral. It’s a crucial tool in Natural Language Processing (NLP).

With ACME’s LLM with Sentiment Analysis capability, the enterprise can attain invaluable customer satisfaction data by grading millions of customer interactions automatically.

Overview

ACME Inc. located in Santa Clara, California, is in the business of selling products and services to consumers.  By utilizing various AI technologies, it wishes to achieve automation, efficiency, and cost reduction in its operation to deal with vast amount of language data: documents and voice recordings.

Speech-to-text (speech recognition) AI Model:

Convert all voice recording data to text data using the speech recognition AI model.

Construct Vocabulary for LLM (Tokenization)

Using proprietary enterprise language data statistics, create enterprise specific LLM vocabulary to increase the performance of the LLM to be trained.

Train LLM with Enterprise Proprietary Data:

ACME’s LLM is trained on enterprise proprietary data for better performance, less hallucination or misinformation.

Text Summarization:

Summarize documents and paragraphs using the LLM and organize the language data in the enterprise database.

Semantic Similarity:

Automate all customer inquiries via LLM’s Semantic Similarity feature.

Question Answering:

Automate customer questions with Question Answering of LLM.

Sentiment Analysis:

Gather millions of customer feedback and satisfaction report automatically using Sentiment Analysis of LLM.