Natural Language Processing in the Age of ChatGPT –– Data Science Initiative Talk with Prof. Shira Wein

Following the advent of ChatGPT, natural language processing (NLP) has gained immense public attention and is very in demand. In fact, OpenAI, the company that developed ChatGPT, has propelled its valuation to $157 billion on October 2, 2024. As a new addition to the faculty at Amherst College, Assistant Professor of Computer Science Shira Wein is ready to bring her knowledge and research about NLP to our community.

NLP refers to technologies that handle human language via text or speech. There are three main goals:

To analyze or understand language, especially when applied to social science
To aid human-human communication (e.g. translation)
To aid human-machine communication (e.g. speech recognition)

The end applications of NLP vary in complexity, ranging from spell-checks on our phones to the well-known generative AI ChatGPT. However, ChatGPT was not the focus of Wein’s fall talk. “This talk is about language,” clarified Wein.

Shifting gears, Wein began explaining four of the many linguistic problems that NLP systems must deal with:

Language is ambiguous.

Wein illustrated this problem by playing a video of Demi Lovato answering a journalist’s question:

“What is your favorite dish?”

“I like mugs.”

This aroused laughter among the audience. Dish has multiple meanings, but generally, when discussing a “favorite dish,” the intended meaning is a meal. Hence, researchers must develop solutions to address semantic ambiguity.

*In addition to the YouTube video, Wein uses the Groucho Marx example to illustrate language can have ambiguous meanings.*

Language data are sparse.

NLP researchers call datasets of text “corpora.” Some common NLP corpora include EuroParl (covers 21 European languages), Penn Treebank (Wall Street Journal data), and RedPajama (100 billion-plus text documents from the Internet). However, no matter how large the corpus, they will always exclude words that are not as commonly used in everyday contexts. This poses a challenge for NLP models to account for infrequent and zero-frequency words when making predictions. Additionally, many languages are “low-resource,” such as some Indigenous languages. Even within Indigenous languages, some have significantly less data than others, such as Plains Cree (spoken only by 34,000 people in Canada) compared to Southern Quechua (spoken by 6.9 million people in central Peru). It would be harder to develop NLP models with these languages, where extant data are scarce.

Language is variable.

People use languages differently in different contexts. Suppose you train part of your speech tagger using Penn Treebank, which mainly includes data from the Wall Street Journal. While this model could be appropriate in certain contexts, it would not work well in analyzing social media texts that include abbreviations and slang. Thus, choosing the right dataset to train on is essential to the success of your model.

Language is expressive.

The same meaning of a sentence can be expressed in many ways by changing the order of subject(s) and object(s). Also, there may be an intended meaning underlying the explicit meaning of a sentence. For instance, in the late fall of New England, your roommate asks, “Is that window still open?” They may mean, “Please close the window.”

If language is so complicated, how does NLP address these challenges?

Wein started by giving an overview of NLP research progress. In the last 80 years, we have improved in leveraging large quantities of data available over the Internet and advancing algorithms, models, and hardware. From the ‘50s to ‘90s, NLP adopted simple rule-based technologies and earlier statistical methodologies, such as the simple ELIZA chatbot that explored communications between humans and machines. Since the ‘90s, a mix of linguistic features and advanced statistical techniques has been applied to NLP, in particular with burgeoning interest in the “language model.” One example of a very simple language model is the n-gram language model.

*Wein explains the n-gram language model using a chart that shows how it predicts texts’ probabilities.*

Using this simple chart, Wein outlined how the n-gram language model works. Given the sentence “She sells seashells by the seashore,” a bigram predicts the probability of “sells” given “she,” “seashells,” “sells,” and so on. A trigram has more context –– it predicts the probability of “sells seashells” given “she sells,” “seashells by,” “sells seashells,” and so forth. N-gram models find many common applications, including auto text correction on your phone.

Regarding data sparsity, NLP researchers use “smoothing,” which adds “pseudocounts” to every possible n-gram count over the seen vocabulary. Without smoothing, if a bigram has a probability of zero, then the entire sentence’s probability becomes zero when aggregating via taking the product of the individual probability, just because the model has never encountered that bigram before. For example, consider the previous sentence again but with a typo: “She sells seesells by the seashore.” As “seesells” is misspelled and probably the model has never encountered it before, without smoothing, a bigram would predict the probability of “seesells” given “sells” as zero. This would result in the entire sentence appearing to make no sense to the model even though we know there is just a typo. Hence, smoothing makes the model more flexible when encountering unseen language usage.

To address variability, researchers train models on the same kind of data they are interested in predicting.

Regarding semantic expressivity, there are many ways to improve predictions in NLP, such as vector semantics. In vector semantics, words are represented as vectors in multidimensional semantic space, and if they have similar meanings, they should be grouped nearby in space. If we project these multidimensional vectors onto a 2D space, we can easily visualize the relationships between them.

A two-dimensional projection of embeddings for some words and phrases, which shows that words with similar meanings are nearby in space. The colors are added for explanation. Source: https://web.stanford.edu/~jurafsky/slp3/6.pdf

Neural NLP models have begun to dominate the traditional approaches. For instance, one commonly used neural model is the Seq2seq (a.k.a. Encoder-Decoder) model. The encoder takes an input (e.g. an English sentence) and produces a vector that provides context for the input, which the decoder takes in and uses to produce an output (e.g. a translation in Spanish of the original English sentence). This model is commonly used in translating between languages, among other NLP applications.

*Wein uses a diagram to show the mechanism of the Seq2seq model.*

More recently, some neural NLP models leverage the “attention” mechanism, which allows models to pay attention to input while producing the output of predictions. The most current mechanism is “self-attention,” which, adopted by the transformer architecture in deep learning, allows the model to capture relationships among distant elements in a sequence. This ability enables it to understand complex patterns and contexts.

Then, is NLP solved?

“No –– just think of examples of when ChatGPT is not working or Siri doesn’t quite understand you,” answered Wein.

There are still many open problems in NLP, such as:

low-resourced language translation
incorporating physical context to process dialogues
modeling language acquisition to understand how humans acquire language
computational social science that analyzes text to learn more about society

After giving an overview of NLP research, Wein elaborated on her research focus, which addresses the expression of language. Most of her work is multilingual with a focus on formally expressing the meaning of a text, as well as the meaning of translated documents (e.g. how meaning and form change through the process of translation). As there are so many problems still waiting to be solved, Wein invites aspiring students to take her NLP course in the spring, take part in her research lab, or look into the amazing online resources compiled by her peers.