Engineering

Discovering Useful Sentence Representations from Large Pre-Trained Language Models

by Nishant Subramani and Nivedita Suresh on August 20th, 2020

Discovering Useful Sentence Representations from Large Pre-Trained Language Models cover

AI/ML Research at Scale AI:

Scale AI is dedicated to accelerating the development of AI applications. As part of this mission, Scale AI’s growing AI/ML Team aims to meaningfully contribute to the AI ecosystem with research and resources that can bring significant and practical benefits to our customers and the broader research community.

Research and experimentation happens throughout Scale. In some cases, our contribution is in areas where the bar has already been set high. Some examples include the open-source datasets we have released with various partners such as PandaSet, nuScenes, and CoNLL Balanced. As our AI/ML team continues to grow, we aim to contribute in entirely new spaces by publishing cutting edge research papers. In this blog post, we spotlight a research paper we recently submitted.

The research paper was a collaboration with Nivedita Suresh of Arrive. Arrive links multi-omics and AI cell characterization to discover hidden patterns in data and improve outcomes in precision medicine.

Conditioning Unconditional Language Models to Recover Sentences without Fine-Tuning:

Pretrained language models are used as encoders with widespread success for a wide variety of NLP tasks. These models learn useful representations that can be directly applied to certain tasks. Despite the success of these models on the encoder, there isn't much research out today on whether these large pretrained language models can be used as universal decoders. To be considered "universal," a decoder must have an implicit representation for any sentence, such that it can recover that sentence exactly when conditioned on its representation.

Task-specific decoders are typically trained from scratch for each task. In this research, we propose a method to try and learn useful representations of sentences using large, pretrained, transformer-based decoders. Our work investigates whether a universal decoder is even possible. If we had access to such a decoder, we could use it for a wide variety of sequence generation tasks without having to retrain or rebuild the decoder on new data. This could open up a new paradigm for low-resource machine translation, summarization, image captioning, and dialog generation.

The Sentence Space: How Language Models Represent Sentences

Transformer-based language models like ELMo, BERT, and GPT-2 have replaced word vectors as off-the-shelf general-purpose encoders for natural language understanding. These models are trained on large amounts of text data and represent a sentence as a sequence of hidden states that come from a final layer of the transformer model. Representations in this sentence space are sequence length dependent, making comparisons between sentences with differing lengths inequitable and measuring the efficacy of using an unconditional language model as a universal decoder impossible.

To resolve these issues and to make analysis easier, we propose to reparametrize the original sentence space into a lower-dimensional and sentence length agnostic vector space. We do this by adding a bias term to the fixed language model and finding the representation that minimizes the cross entropy loss of the sentence. This reparameterization now gives us the ability to project sentence tokens to representations (sentence encoding) and recover sentences from the representation (sentence recovery) via the fixed language model. We develop three representation injection mechanisms and inject biases at three different locations.

We add a bias Z’ to the embedding, transformer layers, and before the language modeling head in GPT-2. ’SA’ refers to selfattention, ’LN’ to layer normalization, ’FFN’ to a fully-connected layer, and ’LM Head’ to the last fully-connected layer.

Experiments:

Since we are testing the efficacy of a large pretrained language model as a universal decoder, we measure how well a representation can recover a sentence when injected into a language model. We measure performance across sentences from 4 different genres: books, news articles, Wikipedia, and movie dialogs. We conduct controlled experiments, varying the representation injection mechanism, representation injection location, initialization, and dimensionality of the representation.

Results:

We find that our representations recover sentences nearly perfectly even across genres -- achieving a BLEU score of over 98.

BLEU Scores
BLEU score performance stratified by genre for different dimensionalities of Z.

In addition, we observe that the intrinsic dimension of the representation space is the model’s latent dimension, which indicates that the language model uses its entire capacity to represent sentences.

Sentence length v. BLEU score
Plot of sentence length vs. BLEU score on the dataset.

Additionally, our interpolation experiments reveal that the representation space has some human-understandable meaning. Our learned representations seem to have some synonym awareness. In the first sentence pair example below, the word “tale” transforms to the word “story” and the word “long” transforms to “long-running” when referring to a war. In the second example, we observe some syntactic awareness at the 0.7 mixture level. The syntax of the first sentence is retained with mostly words from the second sentence.

Linear interpolations between perfectly recovered pairs.
Two linear interpolations between perfectly recovered pairs of representations. Pink indicates token overlap to the first sentence, while blue indicates token overlap to the second sentence.

Broader Impact:

This work informs us that we can discover meaningful, sentence-length agnostic sentence representations, hinting at the possibility of a “universal” decoder. Such a decoder would improve low-resource sequence generation task performance and allow for considerable parameter sharing in memory and data-limited environments.

Conclusion:

For further details on how we set up the experiment, results, and analysis, you can find the full paper on arXiv. If you’re interested in joining our growing machine learning team, you can find all of the open positions on our careers page.

Get Started Today