Hypert: hypernymy-aware BERT with Hearst pattern exploitation for hypernym discovery

Yun, Geonil; Lee, Yongjae; Moon, A-Seong; Lee, Jaesung

doi:10.1186/s40537-023-00818-0

Research
Open access
Published: 12 September 2023

Hypert: hypernymy-aware BERT with Hearst pattern exploitation for hypernym discovery

Geonil Yun¹,
Yongjae Lee²,
A-Seong Moon¹ &
…
Jaesung Lee^1,3

Journal of Big Data volume 10, Article number: 141 (2023) Cite this article

Abstract

Hypernym discovery is challenging because it aims to find suitable instances for a given hyponym from a predefined hypernym vocabulary. Existing hypernym discovery methods used supervised learning with word embedding from word2vec. However, word2vec embedding suffers from low embedding quality regarding unseen or rare noun phrases because entire noun phrases are embedded into a single vector. Recently, prompting methods have attempted to find hypernyms using pretrained language models with masked prompts. Although language models alleviate the problem of w embeddings, general-purpose language models are ineffective for capturing hypernym relationships. Considering the hypernym relationship to be a linguistic domain, we introduce Hypert, which is further pretrained using masked language modeling with Hearst pattern sentences. To the best of our knowledge, this is the first attempt in the hypernym relationship discovery field. We also present a fine-tuning strategy for training Hypert with special input prompts for the hypernym discovery task. The proposed method outperformed the comparison methods and achieved statistically significant results in three subtasks of hypernym discovery. Additionally, we demonstrate the effectiveness of the several proposed components through an in-depth analysis. The code is available at: https://github.com/Gun1Yun/Hypert.

Introduction

Hypernymy denotes a semantic relationship characterized by a hierarchical connection between an abstract term and subordinate instances. To illustrate, when presented with a directive to enumerate various exemplars of vehicle, one may readily evoke representations such as an automobile, a watercraft, and an aircraft. In this context, these entities materialize as specific manifestations falling within the broader classification of vehicle, thereby designating vehicle as the hypernym and the entities mentioned above as their respective hyponyms.

The hypernym relation holds significant importance in natural language processing (NLP). This salient semantic connection assumes a crucial role in diverse NLP tasks, including question answering, ontology construction, textual entailment, and lexicon augmentation [1,2,3]. To facilitate these tasks, a large lexical database, WordNet, was introduced, which delineates semantic relations among words. However, using manual human effort, constructing such a resource is labor-intensive and time-consuming. Consequently, numerous studies have endeavored to automatically extract hypernym relationships from corpora [4,5,6,7].

Hypernym discovery entails identifying all possible instances of hypernyms for a given query within the vocabulary [8]. Recently, hypernym discovery studies have used word embeddings to capture the meaning and relationship between words. These studies can be further categorized into two classes: word2vec embedding methods [9,10,11,12] and prompting methods [13,14,15,16]. The word2vec embeddings are based on the distribution hypothesis to vectorize the meaning of a word into a vector space [17]. However, noun phrases are mapped to a single vector, and thus rare noun phrases in the corpus can be poorly embedded in this way.

The prompting method presents a potential solution to alleviate this issue, employing pretrained language models (PLMs) with subword tokenization algorithm [18]. The prompt takes sentences as input, which consist of a query token x and a [MASK] token (e.g., “A/An x is a [MASK]”) [13]. Despite the promise that harnessing PLMs holds for addressing challenges associated with infrequent terms, certain limitations persist. Only one word can be predicted from the prompt using one [mask] token, even though the gold hypernyms can be multiple words. Additionally, it is essential to acknowledge that PLMs, in their original design, are not inherently geared towards discerning hypernym relationships. Hence, customary procedures involving supplementary rounds of pretraining and subsequent fine-tuning are often employed to tailor the PLM to the specific demands of domain-specific tasks [19, 20].

Within the realm of hypernym discovery, the application of these processes (i.e., pretraining and fine-tuning) remains unexplored. An inherent shortcoming of existing studies concerning hypernym discovery is their limited efficacy in deciphering the semantics and hypernymic relationships inherent in noun phrases. Approaches grounded in word2vec suffer vulnerability when confronted with noun phrases due to their amalgamation into singular vectors. Even though prompting methodologies offer potential relief, the all-encompassing nature of general-purpose PLMs renders them unsuited for comprehending hypernymic information, given their training on broad-spectrum sentences.

To address the limitations posed by existing approaches, in this work, we propose Hypert, a hypernymy-aware pretrained language model for hypernym discovery that harnesses Hearst pattern sentences. The method we present aligns with established domain adaptation practices for PLMs. To elucidate, the PLM is subjected to additional training using a corpus consisting of sentences based on the Hearst pattern, thereby heightening its sensitivity to hypernymic constructs. The Hearst pattern, though simple, possesses a solid foundation, thereby facilitating the construction of a corpus rich in hypernym relationships [21]. More specifically, we employ the extended Hearst patterns, which embody the core concept of the original Hearst patterns. Then, the augmented pretrained model undergoes fine-tuning on a dedicated dataset tailored for the hypernym discovery task, employing a specialized input prompt. By amplifying the hypernym awareness of language models through supplementary pretraining, our approach envisages an elevation in the efficacy of hypernym discovery, ultimately yielding enhanced performance.

The effectiveness of the proposed method is assessed through a comparison with conventional word2vec-based methods and the prompting approach, utilizing the SemEval-2018 task9 dataset [8]. The experimental results show that the proposed method significantly outperforms the comparison methods. Furthermore, an in-depth analysis was also conducted to evaluate the effectiveness of individual components of the proposed method. We also presented the distribution of utilized Hearst patterns in the corpus and analyzed them. Additionally, our investigation reveals the efficacy of further pretraining, contrasting favorably with BERT [22]. To discover the robustness of our method, we conduct a comparative evaluation of prediction outcomes for rare words. Lastly, the t-distributed stochastic neighbor embedding (tSNE) plots are presented to visually represent the classification token ([CLS]) embedding representation space in the context of hypernymy.

The contributions of this work are be summarized as follows:

We propose a further pretraining method to enhance the hypernymy awareness of language models by denoising Hearst pattern sentences from a corpus and using a special input prompt for fine-tuning to achieve the best performance among the comparison methods.
An in-depth analysis is conducted regarding the further pretraining. The results reveal that the proposed further pretraining method improves performance compared to BERT.
The proposed method solves the low embedding quality of noun phrases that rarely appear and demonstrates this through the case study.

Related work

In traditional hypernym discovery and detection studies, pattern-based methods have been used to identify hypernym-hyponym pairs from a corpus. In the work of [1], the author defined lexico-syntactic patterns, called Hearst patterns, which can automatically filter hypernym-hyponym pairs from large corpora. For instance, a pattern like “y such as x” indicates that x is a hypernym of y. Because such pattern-based syntactic relation extraction is a good starting point, this seminal research affects lots of subsequent studies that can be roughly divided into distributional similarity-based, ontological knowledge-based, and machine learning-based methods.

Several studies based on distributional similarity are inspired by the distributional inclusion hypothesis [23,24,25], which assumes that the hypernym can substitute for its hyponym. Based on this hypothesis, researchers have proposed unsupervised distributional measures with asymmetric scoring functions. In the work of [26], WeedsPrec, a precision-based similarity measure was proposed to quantify the weighted inclusion of a narrow term x to a broad term y (i.e., $\left<x\rightarrow {y}\right>$). Additionally, coWeeds [27], the geometric average of cosine similarity with WeedsPrec, was proposed. In the work of [28], ClarkeDE, a variant of WeedsPrec, was proposed to compute the degree of inclusion. Alternatively, invCL [27] considered the inclusion of a narrow term x to a broad term y and the exclusion of y to x using the ClarkeDE measure. However, distributional inclusion hypothesis methods require specific similarity measures to identify semantic relationships. In addition, there may also be a sparsity problem for noun phrases that contain more than one word, making it challenging to identify hypernym relationships.

Since the concept of word2vec embedding was introduced [17], researchers have attempted to use it as a word representation in supervised learning. In the work of [9], the authors demonstrated how to learn semantic hierarchies from word embeddings via projection learning. Projection learning achieved remarkable improvement in the Chinese hypernym detection task compared to other pattern-based and distributed methods using learnable piecewise uniform projection matrices that map queries to various hypernym representations. In addition, RMM [11] repeatedly uses a shared weight projection matrix for a given query with word2vec embeddings, assuming that hypernyms may come from various conceptual hierarchy levels. RMM exploits the attention mechanism [29] and residual connection [30] to capture corresponding candidate hypernyms. Furthermore, SPON [12] uses word2vec embeddings with a simple neural network to enforce hypernym relationship properties, asymmetry, and transitivity, as a soft constraint. However, these methods are vulnerable to rare or unseen words because they map noun phrases to a single vector. Moreover, traditional methods are highly inefficient when dealing with out-of-vocabulary terms because they either use random vectors or require retraining the entire word representation from scratch.

Ontological knowledge also aids hypernym discovery as many distant supervision approaches rely on existing ontologies. [31] utilized BabelNet [32], a multilingual lexicalized semantic network and ontology, to extract sentences containing terms linked by hypernym relations within BabelNet. The sentences are incorporated into the training data when these terms exhibit hypernymic connections. With this training data, they built classifiers to determine whether a given sentence contains expressions indicative of hypernymic relationships. Similarly, [33] use BabelNet [32] and embed pairs of its synsets (namely, term-hypernym) into the sense embedding spaces [34]. For example, Apple and the concept company could form a term-hypernym pair. In that embedding spaces, the authors learn a hypernym transformation matrix of all term-hypernym pairs and then compute similarity values over the pairs.

The machine learning-based methods try to identify the patterns of hypernym relationships from given data. For example, HyperNET [35] used an LSTM-based network to recognize hypernym syntactic patterns by representing dependency tree paths as sequential data. Despite the many efforts to find hypernyms with syntactic patterns, pattern-based methods have sparsity problems in which hypernym-hyponym pairs that match the pattern are rare in the corpus [10, 36]. Recently, matrix factorization techniques, such as Singular Value Decomposition, have been used to mitigate the sparsity problem of pattern-based methods and showed improved results [21].

With the emergence of the transformer [37] in NLP, many researchers have recently used transformer-based PLM, such as BERT [22], in various NLP tasks and applications. The transformer [37] is a novel encoder-decoder network architecture based solely on the attention mechanism, called self-attention, and does not rely on recurrence or other convolutions. In addition, BERT [22] is an effective PLM that can be fine-tuned for a wide range of NLP tasks using the encoder block of the transformer and has achieved state-of-the-art results on 11 benchmark datasets. Following the success of the transformer and BERT, many BERT variants have been proposed [19, 20, 38,39,40]. For example, BioBERT [20] and FinBERT [19] improved their performance by further pretraining BERT with a domain-specific corpus.

To use a PLM for identifying hypernym relationships, [16] evaluated the ability of BERT through human language experiments called prompting and demonstrated proficient results in hypernym retrieval. Moreover, several studies have used human language experiments to evaluate the linguistic knowledge of PLM [41,42,43]. For example, singular and plural prompts were used to probe the hypernymy knowledge of BERT [15]. In addition, various prompts, specifically Hearst patterns, natural sentences, and handwritten context, have been used to find hypernyms with BERT [14]. In the work of [13], the authors investigated the performance of general-domain and domain-adapted language models on financial hypernymy pair datasets using prompting masked language models. However, the prompting approaches can incur unstable identification because the performance varies depending on the prompt type. In addition, due to the format of the prompt, only one token within the vocabulary of the language model used can be predicted.

Lastly, a group of studies adopts a strategy to hybridize multiple approaches to maximize identification performance. CRIM [10] can be a notable hybrid approach that combines pattern-based and projection learning methods. The supervised learning approach of CRIM uses projection learning with multiple parallel projection matrices. The pattern-based part of CRIM uses Hearst patterns to assign weight to word2vec embeddings. Then, the cosine similarity is used as a score between two words. Like CRIM, [44] proposed a hybrid approach to discover hypernym relations using a pattern-based and distributional model. Their model begins with finding seed hypernyms using extended Hearst patterns, then adds the hypernyms of the nearest neighbor.

Inspired by the effectiveness and generality of PLM, this study aims to find hypernyms using a language model further pretrained by masked language modeling (MLM) using Hearst pattern sentences. We employ a specially formatted input sentence consisting of noun phrases and special tokens. Moreover, projection learning is adopted to capture semantic relationships between noun-phrase embeddings.

Proposed method

This section introduces the proposed hypernym discovery system with Hypert. We illustrate the architecture of the proposed hypernym discovery system in Fig. 1. Data preparation (left) presents the dataset and generation of the input data for the model. Model training (middle) represents the model training and inference for Hypert. Prediction (right) sorts the model output, taking the top 15 candidates.

Data preparation

Table 1 Examples from the SemEval2018 Task9-hypernym discovery dataset

Hypert: hypernymy-aware BERT with Hearst pattern exploitation for hypernym discovery

Abstract

Introduction

Related work

Proposed method

Data preparation

Hypert: hypernymy-aware BERT

Fine-tuning and prediction

Experimental results

Experimental settings

Comparison results

In-depth analysis

Conclusions

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Appendices

Appendix 1. Statistical test details

Appendix 2. Results of the best hypert model for each subtask

Appendix 3. Extended hearst patterns

Rights and permissions

About this article

Cite this article

Share this article

Keywords