Skip to main content

Hypert: hypernymy-aware BERT with Hearst pattern exploitation for hypernym discovery


Hypernym discovery is challenging because it aims to find suitable instances for a given hyponym from a predefined hypernym vocabulary. Existing hypernym discovery methods used supervised learning with word embedding from word2vec. However, word2vec embedding suffers from low embedding quality regarding unseen or rare noun phrases because entire noun phrases are embedded into a single vector. Recently, prompting methods have attempted to find hypernyms using pretrained language models with masked prompts. Although language models alleviate the problem of w embeddings, general-purpose language models are ineffective for capturing hypernym relationships. Considering the hypernym relationship to be a linguistic domain, we introduce Hypert, which is further pretrained using masked language modeling with Hearst pattern sentences. To the best of our knowledge, this is the first attempt in the hypernym relationship discovery field. We also present a fine-tuning strategy for training Hypert with special input prompts for the hypernym discovery task. The proposed method outperformed the comparison methods and achieved statistically significant results in three subtasks of hypernym discovery. Additionally, we demonstrate the effectiveness of the several proposed components through an in-depth analysis. The code is available at:


Hypernymy denotes a semantic relationship characterized by a hierarchical connection between an abstract term and subordinate instances. To illustrate, when presented with a directive to enumerate various exemplars of vehicle, one may readily evoke representations such as an automobile, a watercraft, and an aircraft. In this context, these entities materialize as specific manifestations falling within the broader classification of vehicle, thereby designating vehicle as the hypernym and the entities mentioned above as their respective hyponyms.

The hypernym relation holds significant importance in natural language processing (NLP). This salient semantic connection assumes a crucial role in diverse NLP tasks, including question answering, ontology construction, textual entailment, and lexicon augmentation [1,2,3]. To facilitate these tasks, a large lexical database, WordNet, was introduced, which delineates semantic relations among words. However, using manual human effort, constructing such a resource is labor-intensive and time-consuming. Consequently, numerous studies have endeavored to automatically extract hypernym relationships from corpora [4,5,6,7].

Hypernym discovery entails identifying all possible instances of hypernyms for a given query within the vocabulary [8]. Recently, hypernym discovery studies have used word embeddings to capture the meaning and relationship between words. These studies can be further categorized into two classes: word2vec embedding methods [9,10,11,12] and prompting methods [13,14,15,16]. The word2vec embeddings are based on the distribution hypothesis to vectorize the meaning of a word into a vector space [17]. However, noun phrases are mapped to a single vector, and thus rare noun phrases in the corpus can be poorly embedded in this way.

The prompting method presents a potential solution to alleviate this issue, employing pretrained language models (PLMs) with subword tokenization algorithm [18]. The prompt takes sentences as input, which consist of a query token x and a [MASK] token (e.g., “A/An x is a [MASK]”) [13]. Despite the promise that harnessing PLMs holds for addressing challenges associated with infrequent terms, certain limitations persist. Only one word can be predicted from the prompt using one [mask] token, even though the gold hypernyms can be multiple words. Additionally, it is essential to acknowledge that PLMs, in their original design, are not inherently geared towards discerning hypernym relationships. Hence, customary procedures involving supplementary rounds of pretraining and subsequent fine-tuning are often employed to tailor the PLM to the specific demands of domain-specific tasks [19, 20].

Within the realm of hypernym discovery, the application of these processes (i.e., pretraining and fine-tuning) remains unexplored. An inherent shortcoming of existing studies concerning hypernym discovery is their limited efficacy in deciphering the semantics and hypernymic relationships inherent in noun phrases. Approaches grounded in word2vec suffer vulnerability when confronted with noun phrases due to their amalgamation into singular vectors. Even though prompting methodologies offer potential relief, the all-encompassing nature of general-purpose PLMs renders them unsuited for comprehending hypernymic information, given their training on broad-spectrum sentences.

To address the limitations posed by existing approaches, in this work, we propose Hypert, a hypernymy-aware pretrained language model for hypernym discovery that harnesses Hearst pattern sentences. The method we present aligns with established domain adaptation practices for PLMs. To elucidate, the PLM is subjected to additional training using a corpus consisting of sentences based on the Hearst pattern, thereby heightening its sensitivity to hypernymic constructs. The Hearst pattern, though simple, possesses a solid foundation, thereby facilitating the construction of a corpus rich in hypernym relationships [21]. More specifically, we employ the extended Hearst patterns, which embody the core concept of the original Hearst patterns. Then, the augmented pretrained model undergoes fine-tuning on a dedicated dataset tailored for the hypernym discovery task, employing a specialized input prompt. By amplifying the hypernym awareness of language models through supplementary pretraining, our approach envisages an elevation in the efficacy of hypernym discovery, ultimately yielding enhanced performance.

The effectiveness of the proposed method is assessed through a comparison with conventional word2vec-based methods and the prompting approach, utilizing the SemEval-2018 task9 dataset [8]. The experimental results show that the proposed method significantly outperforms the comparison methods. Furthermore, an in-depth analysis was also conducted to evaluate the effectiveness of individual components of the proposed method. We also presented the distribution of utilized Hearst patterns in the corpus and analyzed them. Additionally, our investigation reveals the efficacy of further pretraining, contrasting favorably with BERT [22]. To discover the robustness of our method, we conduct a comparative evaluation of prediction outcomes for rare words. Lastly, the t-distributed stochastic neighbor embedding (tSNE) plots are presented to visually represent the classification token ([CLS]) embedding representation space in the context of hypernymy.

The contributions of this work are be summarized as follows:

  • We propose a further pretraining method to enhance the hypernymy awareness of language models by denoising Hearst pattern sentences from a corpus and using a special input prompt for fine-tuning to achieve the best performance among the comparison methods.

  • An in-depth analysis is conducted regarding the further pretraining. The results reveal that the proposed further pretraining method improves performance compared to BERT.

  • The proposed method solves the low embedding quality of noun phrases that rarely appear and demonstrates this through the case study.

Related work

In traditional hypernym discovery and detection studies, pattern-based methods have been used to identify hypernym-hyponym pairs from a corpus. In the work of [1], the author defined lexico-syntactic patterns, called Hearst patterns, which can automatically filter hypernym-hyponym pairs from large corpora. For instance, a pattern like “y such as x” indicates that x is a hypernym of y. Because such pattern-based syntactic relation extraction is a good starting point, this seminal research affects lots of subsequent studies that can be roughly divided into distributional similarity-based, ontological knowledge-based, and machine learning-based methods.

Several studies based on distributional similarity are inspired by the distributional inclusion hypothesis [23,24,25], which assumes that the hypernym can substitute for its hyponym. Based on this hypothesis, researchers have proposed unsupervised distributional measures with asymmetric scoring functions. In the work of [26], WeedsPrec, a precision-based similarity measure was proposed to quantify the weighted inclusion of a narrow term x to a broad term y (i.e., \(\left<x\rightarrow {y}\right>\)). Additionally, coWeeds [27], the geometric average of cosine similarity with WeedsPrec, was proposed. In the work of [28], ClarkeDE, a variant of WeedsPrec, was proposed to compute the degree of inclusion. Alternatively, invCL [27] considered the inclusion of a narrow term x to a broad term y and the exclusion of y to x using the ClarkeDE measure. However, distributional inclusion hypothesis methods require specific similarity measures to identify semantic relationships. In addition, there may also be a sparsity problem for noun phrases that contain more than one word, making it challenging to identify hypernym relationships.

Since the concept of word2vec embedding was introduced [17], researchers have attempted to use it as a word representation in supervised learning. In the work of [9], the authors demonstrated how to learn semantic hierarchies from word embeddings via projection learning. Projection learning achieved remarkable improvement in the Chinese hypernym detection task compared to other pattern-based and distributed methods using learnable piecewise uniform projection matrices that map queries to various hypernym representations. In addition, RMM [11] repeatedly uses a shared weight projection matrix for a given query with word2vec embeddings, assuming that hypernyms may come from various conceptual hierarchy levels. RMM exploits the attention mechanism [29] and residual connection [30] to capture corresponding candidate hypernyms. Furthermore, SPON [12] uses word2vec embeddings with a simple neural network to enforce hypernym relationship properties, asymmetry, and transitivity, as a soft constraint. However, these methods are vulnerable to rare or unseen words because they map noun phrases to a single vector. Moreover, traditional methods are highly inefficient when dealing with out-of-vocabulary terms because they either use random vectors or require retraining the entire word representation from scratch.

Ontological knowledge also aids hypernym discovery as many distant supervision approaches rely on existing ontologies. [31] utilized BabelNet [32], a multilingual lexicalized semantic network and ontology, to extract sentences containing terms linked by hypernym relations within BabelNet. The sentences are incorporated into the training data when these terms exhibit hypernymic connections. With this training data, they built classifiers to determine whether a given sentence contains expressions indicative of hypernymic relationships. Similarly, [33] use BabelNet [32] and embed pairs of its synsets (namely, term-hypernym) into the sense embedding spaces [34]. For example, Apple and the concept company could form a term-hypernym pair. In that embedding spaces, the authors learn a hypernym transformation matrix of all term-hypernym pairs and then compute similarity values over the pairs.

The machine learning-based methods try to identify the patterns of hypernym relationships from given data. For example, HyperNET [35] used an LSTM-based network to recognize hypernym syntactic patterns by representing dependency tree paths as sequential data. Despite the many efforts to find hypernyms with syntactic patterns, pattern-based methods have sparsity problems in which hypernym-hyponym pairs that match the pattern are rare in the corpus [10, 36]. Recently, matrix factorization techniques, such as Singular Value Decomposition, have been used to mitigate the sparsity problem of pattern-based methods and showed improved results [21].

With the emergence of the transformer [37] in NLP, many researchers have recently used transformer-based PLM, such as BERT [22], in various NLP tasks and applications. The transformer [37] is a novel encoder-decoder network architecture based solely on the attention mechanism, called self-attention, and does not rely on recurrence or other convolutions. In addition, BERT [22] is an effective PLM that can be fine-tuned for a wide range of NLP tasks using the encoder block of the transformer and has achieved state-of-the-art results on 11 benchmark datasets. Following the success of the transformer and BERT, many BERT variants have been proposed [19, 20, 38,39,40]. For example, BioBERT [20] and FinBERT [19] improved their performance by further pretraining BERT with a domain-specific corpus.

To use a PLM for identifying hypernym relationships, [16] evaluated the ability of BERT through human language experiments called prompting and demonstrated proficient results in hypernym retrieval. Moreover, several studies have used human language experiments to evaluate the linguistic knowledge of PLM [41,42,43]. For example, singular and plural prompts were used to probe the hypernymy knowledge of BERT [15]. In addition, various prompts, specifically Hearst patterns, natural sentences, and handwritten context, have been used to find hypernyms with BERT [14]. In the work of [13], the authors investigated the performance of general-domain and domain-adapted language models on financial hypernymy pair datasets using prompting masked language models. However, the prompting approaches can incur unstable identification because the performance varies depending on the prompt type. In addition, due to the format of the prompt, only one token within the vocabulary of the language model used can be predicted.

Lastly, a group of studies adopts a strategy to hybridize multiple approaches to maximize identification performance. CRIM [10] can be a notable hybrid approach that combines pattern-based and projection learning methods. The supervised learning approach of CRIM uses projection learning with multiple parallel projection matrices. The pattern-based part of CRIM uses Hearst patterns to assign weight to word2vec embeddings. Then, the cosine similarity is used as a score between two words. Like CRIM, [44] proposed a hybrid approach to discover hypernym relations using a pattern-based and distributional model. Their model begins with finding seed hypernyms using extended Hearst patterns, then adds the hypernyms of the nearest neighbor.

Inspired by the effectiveness and generality of PLM, this study aims to find hypernyms using a language model further pretrained by masked language modeling (MLM) using Hearst pattern sentences. We employ a specially formatted input sentence consisting of noun phrases and special tokens. Moreover, projection learning is adopted to capture semantic relationships between noun-phrase embeddings.

Fig. 1
figure 1

Architecture of the hypernym discovery system in this study. (Left) Data preparation used for SemEval-2018 task9 to generate input data. (Middle) Model training applied the Hypert of each subtask to initialize the decision model, adopting projection learning. (Right) Prediction sorts the top 15 output results

Proposed method

This section introduces the proposed hypernym discovery system with Hypert. We illustrate the architecture of the proposed hypernym discovery system in Fig. 1. Data preparation (left) presents the dataset and generation of the input data for the model. Model training (middle) represents the model training and inference for Hypert. Prediction (right) sorts the model output, taking the top 15 candidates.

Data preparation

Table 1 Examples from the SemEval2018 Task9-hypernym discovery dataset
Table 2 Dataset statistics for each subtask

This study employs the SemEval-2018 Task9 (Hypernym Discovery) [8] dataset for the benchmark. The dataset contains five subtasks: three subtasks for general purposes for three languages (English, Spanish, and Italian) and two domain-specific subtasks in English (Medical and Music). We considered three of these subtasks for the experiments: 1A (English), 2A (Medical), and 2B (Music). Table 1 presents the examples of the query-hypernym pair dataset. The queries and candidate hypernyms in the hypernym discovery task can be one-word, two-word, or three-word noun phrases. Each query can have up to 15 gold hypernyms. In addition, the query is given with a noun phrase and type of query, which is either a concept or entity.

Table 2 lists the statistics of each subtask. Each subtask comprises a corpus, vocabulary, and the training, validation, and testing sets. The corpus was used to train word embeddings. The vocabulary includes noun phrases that can be target hypernyms. The datasets contain query-gold hypernym pairs, where the number of train, the number of valid, and the number of test are the number of queries for training, validation, and testing, respectively, and the number of vocabulary represents the number of candidate hypernyms. More dataset details can be found in the work of [8].

Fig. 2
figure 2

Generation of \(M_Q\) and \(M_C\) for \(S_{\text {prompt}}\). \(M_Q\) is 1 for query tokens and 0 for all other tokens; \(M_C\) is 1 for hypernym tokens and 0 for all other tokens

We used extended Hearst patterns to build a hypernym-related corpus from the given corpus.Footnote 1 47 patterns were used to extract hypernym-related sentences. Specifically, the number of extracted sentences for 1A, 2A, and 2B are 5 M, 137K, and 153K, respectively. These are 4%, 4%, and 3% of the total sentences for each subtask. Although there may be more efficient patterns, the patterns exploited in this study are sufficient to build a training corpus and achieve the best performance among the comparison methods.

We used a special input prompt sentence \(S_{\text {P}}\) as an input sentence of the PLM:

$$\begin{aligned} S_{\text {P}} = ``{} \texttt {[CLS]} \texttt {[TYPE]} \text { Q } \texttt {[SEP]} \text { C}'', \end{aligned}$$

where Q and C are the query and candidate hypernym terms, respectively. The [CLS] token is a special token that is always the first of every input sequence. This special token is a classification token used as the aggregate sequence representation. The [SEP] token is also a special token to separate sentences. We used the [CLS] token embedding to represent the hypernym relationship and the [SEP] token to separate query and candidate hypernym terms in this study. However, in the hypernym discovery task, the query is given with the type of query: concept or entity. For instance, the query “fuse” is a concept, and the query “Louis Armstrong” is an entity. To provide type information for the input query, we added [CON] and [ENT] special tokens, referring to concept and entity types. Thus, the [Type] token of \(S_{\text {P}}\) is [CON] when the query type is a concept and is [ENT] when the query type is an entity.

However, both input terms, the query and candidate hypernym, were split into multiple tokens because of subword tokenization. Thus, the number of each term token varies for each \(S_\text {P}\). The span of the subword tokens that correspond to each term must be identified to obtain word embeddings. To achieve this, we generated \(M_Q, M_C\in \mathbb {R}^{l\times 1}\), masking vectors as span information vectors. The length of \(S_\text {P}\) is l. Figure 2 illustrates the generated \(M_Q\) and \(M_C\) vectors for the given \(S_{\text {P}}\). Both vectors are one-hot vectors consisting of 0 and 1. In addition, \(M_Q\) is the query token span vector, and \(M_C\) is the candidate hypernym token span vector. We set the masking vectors to 1 for each token span and 0 for the others. For instance, \(M_Q\) was set to 1 for the query token span and 0 for the others.

Hypert: hypernymy-aware BERT

Fig. 3
figure 3

Overview of the further pretraining method. (Left) Sentence extraction used extended Hearst patterns for sentence retrieval. (Right) Masked language modeling exploits each extracted subtask corpus and creates the Hypert for each subtask

We introduce Hypert, a PLM for the hypernym relationship. The overall process of further pretraining is illustrated in Fig. 3. Sentence extraction (left) indicates the pattern retrieval process to build a pretraining corpus. In addition, MLM (right) uses extracted sentences and generates Hypert, the hypernymy-aware BERT, for each subtask.

We hypothesize that the language models can learn hypernym relation knowledge from specific sentences in this study. As with BioBERT [20], FinBERT [19], and DarkBERT [45], domain-specific tasks are significantly improved with a further pretraining using the domain corpus. By considering the hypernym relationship to be a specific domain, the language model can be further pretrained on the hypernymy-related corpus to improve hypernym relationship awareness. To achieve this, sentences representing a hypernym relation are required to construct a hypernymy-related domain corpus. The Hearst pattern is devised to detect hypernym-hyponym pairs from the corpus [1]. For example, if the sentence “mammal such as dog” matches “y such as x,” one of the Hearst patterns, then (dog, mammal) can be extracted as a hypernym relationship. We exploited the Hearst pattern to identify sentences that contain hypernym relationships. Sentences matching the Hearst patterns were extracted to build the hypernym-related corpus. When extracting sentences, only the part matching the pattern in the sentence was extracted, not the entire sentence containing the pattern.

Similar to BioBERT and FinBERT, Hypert is initialized with BERT, a pretrained model consisting of transformer encoder layers before further pretraining [22]. We also employed the BERT tokenizer and added special tokens [CON] and [ENT]. Subword tokenization in BERT splits words into multiple subtokens defined in the vocabulary pool of BERT, allowing rare or unseen words to be represented with proper subtoken embeddings. However, this advantage leads to the need for \(M_Q\) and \(M_C\), as mentioned previously. In contrast, further pretraining on a corpus related to hypernymy allows BERT to gain a multiple-perspective understanding of the critical information regarding the hypernym relationship between input query tokens and candidate tokens. While MLM effectively improves contextual hypernymy understanding, next sentence prediction is irrelevant to this task as we are unconcerned with the relationship between two consecutive sentences. Therefore, the pretrained model initialized with BERT\(_{\text {base}}\) is further trained using the constructed corpus from above without the next sentence prediction objective. Additionally, Hypert is generated separately for each constructed subtask corpus. In other words, there are three Hypert models for 1A,Footnote 2 2A,Footnote 3 and 2B.Footnote 4

Fine-tuning and prediction

We present a fine-tuning and prediction process using Hypert. The output of Hypert \(H\in \mathbb {R}^{l\times d_{\text {model}}}\) is obtained by \(f(S_\text {P})\), where f is the proposed Hypert. The length of the input sentence tokens is l, and \(d_{\text {model}}\) is 768, the dimension of the BERT\(_{\text {base}}\) model:

$$\begin{aligned} H = f(S_{\text {P}}). \end{aligned}$$

The embedding of each term is computed by averaging the token embedding, which can be obtained by multiplying H and each span information vector described above, divided by the sum of masking vectors as follows:

$$\begin{aligned} \hat{e}_q = \frac{H^\top \cdot M_Q}{\sum _i^l{M_{Q_i}}} \end{aligned}$$


$$\begin{aligned} \hat{e}_c = \frac{H^\top \cdot M_C}{\sum _i^l{M_{C_i}}}, \end{aligned}$$

where \(\hat{e}_q, \hat{e}_c \in \mathbb {R}^{d_{\text {model}}\times 1}\) are embeddings of query and candidate hypernyms. In addition, the embedding of the [CLS] token \(\hat{e}_{\texttt {[CLS]}} \in \mathbb {R}^{d_{\text {model}}\times 1}\) is obtained by taking the first index of the final hidden state of Hypert H. An affine transformation is applied to reduce the dimensions of each embedding. Thus, \(e_{\texttt {[CLS]}}\) can be defined as follows:

$$\begin{aligned} e_\texttt {[CLS]} = W_{\texttt {[CLS]}}\cdot \hat{e}_{\texttt {[CLS]}}+b_{\texttt {[CLS]}} . \end{aligned}$$

The query embedding \(e_{q}\) can be given as follows:

$$\begin{aligned} e_q = W_Q\cdot \hat{e}_q + b_Q , \end{aligned}$$

and the candidate hypernym embedding is defined as follows:

$$\begin{aligned} e_c = W_C\cdot \hat{e}_c + b_C. \end{aligned}$$

The dimensions of query embedding \(\hat{e}_q\) and candidate hypernym embedding \(\hat{e}_c\) are reduced by d with \(W_Q, W_C \in \mathbb {R}^{d\times d_{\text {model}}}\), and \(b_Q, b_C\in \mathbb {R}^{d\times 1}\). The embedding of the [CLS] token \(\hat{e}_{\texttt {[CLS]}}\) is used in the last layer, so the dimensions are reduced by the number of projection matrices k with \(W_{\texttt {[CLS]}}\in \mathbb {R}^{k\times d_{\text {model}}}\), and \(b_{\texttt {[CLS]}}\in \mathbb {R}^{k\times 1}\). All W and b are learnable wnd biases.

Previous studies have used projection learning for the supervised approach [9,10,11]. In this study, we adopted the projection learning method, using projection matrices \(\Phi\) to capture the relationship between the query and candidate hypernym embeddings produced by Hypert. The projection matrix was created by applying a normal distribution \(\mathcal {N}(0,\,1/d)\) as noise to the identity matrix as follows:

$$\begin{aligned} \Phi _i = I + \epsilon _i,\, \epsilon _i \sim \mathcal {N}(0,\, 1/d)\, , \end{aligned}$$

where I denotes an identity matrix, and \(\epsilon _i\in \mathbb {R}^{d\times d}\) represents the ith noise term sampled from a normal distribution. Each projection matrix \(\Phi _i\) was generated by adding the individual noise \(\epsilon _i\) to I.

Then, the query embedding \(e_q\) was multiplied by multiple k square projection matrices \(\Phi _i \in \mathbb {R}^{d\times d}\) to obtain projected matrices \(P_i\), where \(i=\{1,..., k\}\). P can be defined as

$$P_{i} = \left( {\Phi _{i} \cdot e_{q} } \right)^{{ \top }} .$$

Finally, the score matrix \(s\in \mathbb {R}^{k\times 1}\) was computed using \(P\in \mathbb {R}^{k\times d}\) and the candidate hypernym embedding \(e_c\) as follows:

$$\begin{aligned} s = P \cdot e_c . \end{aligned}$$

The embedding of the [CLS] token was used for relation representation. To achieve this, \(e_{\texttt {[CLS]}}\) and s were concatenated to \(F\in \mathbb {R}^{2k\times 1}\) as follows:

$$\begin{aligned} F = [\ e_{\texttt {[CLS]}}\ ; \ s\ ], \end{aligned}$$

The input prompt includes the query and candidate hypernyms; thus, the output of the proposed model is the probability of a hypernym relationship. Thus, the final layer is a feedforward network with a sigmoid activation function to output [0, 1] as follows:

$$\begin{aligned} o = \sigma (W_o\cdot F + b_o), \end{aligned}$$

where o is the probability of a hypernym relationship, \(W_o\) and \(b_o\) represent learnable parameters, and \(\sigma\) denotes a sigmoid function.

figure a

However, hypernym discovery aims to retrieve suitable hypernyms from a given predefined vocabulary. Therefore, as many input prompts as the number of all words in the predefined vocabulary are generated and calculated for one query. The number of o for each query equals the number of words in the predefined vocabulary. Each query has a maximum of 15 gold hypernyms; thus, we sorted the output, taking the top 15 candidates. The inference of the proposed method is described in Algorithm 1. In Line 4, \(make\_prompt\) is a function that generates \(S_\text {P}\) for the query Q and candidate hypernym \(c^{(i)}\), as demonstrated in Eq. 1 whereas the function \(make\_masking\_vectors\) of Line 5 produces masking vectors \(M_Q\) and \(M_C\) illustrated in Fig. 2 for a given \(S_\text {P}\).

Experimental results

This section presents the performance of the proposed and conventional methods. In addition, it describes the experimental settings for the hypernym discovery dataset, evaluation measures, and employed statistical tests.

Experimental settings

The HuggingFace transformers libraryFootnote 5 with PyTorch [46] was used for the implementations. The experiment was conducted using an Intel i9-10980XE, three NVIDIA GeForce RTX 3090 GPUs, and 128GB RAM. In addition, distributed training was employed by using Data Parallel functionality in PyTorch. In the further pretraining of Hypert, the BERT\(_{\text {base}}\) modelFootnote 6 initialized the PLM. We set the batch size, learning rate, and cosine scheduler warm-up steps to 216, 5e-5, and 500, respectively. The maximum training step was also limited to 10k for each subtask dataset.

After further pretraining, the proposed pretrained model was fine-tuned on the training dataset of each subtask. In the fine-tuning model, k was set to 24, and d was set to 200. We set the batch size to 32 and the maximum epoch to 15 for training. In addition, we used negative sampling because the dataset consists of only positive samples. For each positive sample, 50 negative samples were generated. The model with the best validation mean average precision (MAP) score epoch was used for testing. The loss function was set to binary cross-entropy loss and minimized using the AdamW optimizer [47]. The binary cross-entropy loss function is defined as follows:

$$\begin{aligned} L(q, c, t) = t\times \log\,(y) + (1-t)\times \log (1-y), \end{aligned}$$

where L, q, c, t, and y refer to the loss value, query, candidate hypernym, label, and prediction of the proposed model. The label is 0 for negative pairs and 1 for positive pairs. We conducted hold-out cross-validation for each experiment. The training, validation, and testing sets were combined and randomly selected in equal proportions to the given split. For each subtask dataset, the experiment was repeated 10 times. We obtained 10 performance values for each measure.

The proposed method was compared to three conventional hypernym discovery methods: RMM, SPON, and prompting BERT. Details of each method are provided below.

  • RMM [11]: This method utilizes a projection matrix with word2vec embeddings. The shared projection matrix is applied to hyponym term embedding recurrently to obtain representations of higher concept-level of words. To obtain word2vec embeddings for RMM, we set embedding dimensions and window sizes to 200 and seven, respectively. Then it is trained based on ten negative samples with ten epochs training for each given corpus. Next, to train the RMM model, we set the batch size to 32, and the number of negative samples was set to 50. The maximum training epoch was set to 1,000 with 200 patience. Lastly, the best validation MAP model was selected for testing. RMM was chosen for comparison because it is a representative method based on the projection matrix.

  • SPON [12]: This method creates a distance-to-satisfaction vector for a given hyponym and candidate hypernym. The output representation is subtracted from the candidate hypernym term. All the parameter settings and procedures for obtaining word2vec embeddings and training SPON are the same as the experimental settings of RMM. Again, the best validation MAP model was selected for testing. In our comparison, SPON was chosen because it effectively reflects asymmetricity and transitivity properties which are essential for identifying hypernym relations.

  • Prompting BERT (is-a) [14]: Prompting BERT generates hypernym for a given prompt by predicting [MASK] token. Because the original BERT is used directly, an additional fine-tuning process is unnecessary. In our experiment, we considered the prompt “A/An x is a [MASK].” because of its simplicity in identifying hypernym relationships. The strategy to prompt BERT was chosen to validate the superiority between Hypert and the pre-trained language model for our task. Another reason for choosing the prompting strategy is that it does not rely on distributional similarity, for example, word2vec, in contrast to RMM and SPON.

  • Prompting BERT (such as) [14]: Similar to experimental settings of Prompting BERT (is-a), we considered a prompt “A [MASK] such as A/An x.” (such as) because of its superior performance in identifying hypernym relationships.

Next, we employed three evaluation measures as follows:

  • Mean average precision (MAP): The MAP is the mean of average precision, the average of each obtained hypernym from the search space, for a given query word. The MAP is defined as

    $$\begin{aligned} \text {MAP} = \frac{1}{|Q|}{\sum _{q\in {Q}}^{|Q|}\text {AP}(q)}, \end{aligned}$$

    where Q and |Q| refer to the given set of query words and the size of the set, respectively.

  • Mean reciprocal rank (MRR). The MRR is usually used to evaluate the effectiveness of an information retrieval system [48, 49]. The reciprocal rank is the reciprocal of the first relevant or correct outcomes. The MRR is the average of the reciprocal rank for each given query word and is defined as

    $$\begin{aligned} \text {MRR} = \frac{1}{|Q|}{\sum _{i=1}^{|Q|} \frac{1}{rank_i}}, \end{aligned}$$

    where the \(rank_i\) refers to the rank position of the first correct hypernym of i-th query.

  • Precision at k (P@k). The P@k metric calculates the top-k hypernym outcome precision and is defined as

    $$\begin{aligned} \text {P@}k =\frac{\text {TP@}k}{(\text {TP@}k)+(\text {FP@}k)}, \end{aligned}$$

    where TP and FP refer to the true positive and false positive, respectively. Specifically, we set the cut-off threshold k to 1, 3, 5, or 15 in this study.

We compared each method on different iterations using the Wilcoxon signed-rank test [50] because we are interested in the superiority of the proposed method over the comparison methods. We let \(d_i\) be the difference between the performance of the two methods on the ith iteration. The differences were ranked according to their absolute values: the smallest \(d_i\) was assigned to the first rank. In the case of ties, average ranks were assigned. We let \({\textit{R}}^+\) be the sum of the ranks for the iterations on which the compared method outperforms the proposed method, defined as

$$\begin{aligned} \textit{R}^{+} = \sum _{d_i>0}{\textit{rank}}(d_i) + \frac{1}{2}\sum _{d_i=0}{\textit{rank}}(d_i) \end{aligned}$$

and \({\textit{R}}^-\) is the opposite, as follows:

$$\begin{aligned} \textit{R}^{-} = \sum _{d_i<0}{\textit{rank}}(d_i) + \frac{1}{2}\sum _{d_i=0}{\textit{rank}}(d_i) \end{aligned}$$

Then, according to the critical values for the Wilcoxon’s test, for a confidence level of \(\alpha =0.05\) and with \(N=10\), the difference between the compared methods is significant if min(\({\textit{R}}^+\), \({\textit{R}}^-\)) \(\le\) 8. In this case, the null hypothesis of equal performance is rejected.

Table 3 Model performance on SemEval2018-task9 dataset
Table 4 Wilcoxon signed-rank test results of the proposed method against comparison methods for the 1A dataset with 10 iterations

Comparison results

Table 3 presents the results of the experiments on three subtask datasets. This table contains the MRR, MAP, and precision at ranks \(k=\{1,\,3,\,5,\,15\}\) (P@k) of the proposed and comparison methods. The average performance of the holdout cross-validation with the corresponding standard deviation is presented for each evaluation measure and method, and the best performance among the methods is represented in bold. As listed in Table 3, the proposed method outperforms all measures across subtasks.

The MRR indicates the ability of the related item to be ranked high, suggesting that the proposed method performs more accurately in identifying hypernyms than other methods. For example, in the results of the 1A dataset, the MRR value of the proposed method is 38.86. Compared to RMM, which uses word2vec embedding with projection learning, the average performance difference is 11.47. The MAP value of the proposed method is 24.17, the first rank, and the difference in average performance from RMM, which is the second rank, is 5.92. The MAP considers the precision of all related items. Therefore, the results indicate that the proposed method predicts more gold hypernyms regarding ranking problems than other methods. The results of P@k also support this. These results appear the same in all other subtask datasets.

Table 4 reveals the results of the Wilcoxon signed-rank test of the proposed method against the comparison methods for 10 iterations on the 1A dataset. The table confirms that the proposed method significantly outperforms other methods because all p-values are less than the significance level of \(\alpha =0.05\), rejecting the null hypothesis. For each evaluation measure, the winning method is remarked with bold, and p-values are presented in the parenthesis. For the 2A and 2B datasets, the Wilcoxon signed-rank test result is the same as that for 1A, which can be observed in the Appendix 1.

In-depth analysis

This study introduces a further pretraining and fine–tuning process for hypernym discovery. Specifically, the pretraining phase uses MLM with extended Hearst patterns extracted from the given corpus, and the fine–tuning phase adopts projection learning with Hypert. To assess the influence of the choices, we examined several components of the proposed method. We discuss the effects of the proposed pretraining method and provide the results of the outcomes from each pretraining step. We also defined and evaluated two subgroups in the 1A subtask dataset to validate the robustness of Hypert against conventional methods. This study provides the pattern distribution with statistics and analyzes which patterns appeared frequently. We speculated that the proposed method could handle rare noun phrases. To support this, we present the prediction list of the proposed method and comparison methods. Additionally, the tSNE plots of the \(e_{\texttt {[CLS]}}\) representation space are presented to analyze the effectiveness of using the [CLS] token as a hypernym relationship information vector.

Fig. 4
figure 4

Distribution of patterns for each subtask

Table 5 Top five patterns and counts

The pattern analysis was conducted to determine which patterns effectively construct a hypernym-related corpus. The pattern distribution for the extracted sentences is displayed in Fig. 4. Most sentences were extracted using almost five patterns. Table 5 presents the counts for the top five patterns. The “\(NP_y\) as \(NP_x\)” pattern was more than 50% for all subtasks. The “\(NP_y\) such as \(NP_x\)” pattern was more than 20%. These five patterns comprised over 98% of sentences.

Table 6 Comparison of evaluation measures on the Hypert and BERT models for each subtask
Table 7 Wilcoxon signed-rank test results for the Hypert and BERT models with 10 iterations
Table 8 Comparison of MRR, MAP, and P@1 results by number of pretraining steps
Table 9 Comparison of for P@3, P@5, and P@15 results by number of pretraining steps

Table 6 represents the results of using the Hypert and BERT models. The results indicate that the proposed pretraining method improves the performance of all subtask datasets across all evaluation measures. In addition, we also employed the Wilcoxon signed-rank test to confirm the superiority of the pretraining method. The results of the Wilcoxon signed-rank test are provided in Table 7. Most results reject the null hypothesis of the Wilcoxon signed-rank test with a significance level of \(\alpha =0.05\), except for the 2A dataset.

Because we chose 1k steps for all subtasks for fairness, we varied the pretraining from step 0k (without pretraining) to 10k to observe the performance of increasing the steps. Tables 8 and 9 detail the performance of the proposed further pretraining method for each 1k step.

In the 1A dataset, the MRR value of pretraining with the 1k steps model is the best through all steps. Compared to the 0k steps model, which does not use further pretraining, the average performance difference is 2.24. Moreover, the results indicate that the proposed further pretraining method improves performance compared to the 0k steps model at every 1k step. The rational choice for each subtask will now be discussed. For the 1A dataset, the model with 9k pretraining steps seems reasonable considering average performance and standard deviation. For the 2A dataset, despite the 8k steps showing most of the best performance, we consider the 2k steps to be selected as the best choice because of the low standard deviation and the negligible difference in performance between them. However, the results for the 2B dataset clearly provide reasonable pretraining steps, 1k. The second is also evident, 6k. Note that we merely selected the model with the 1k steps through all subtask datasets for fairness. A comparison of the pretraining steps considered to be the best performance can be found in Appendix 2.

In the 1A dataset, the MRR value of pretraining with the 1k-step model is the best through all steps. Compared to the 0k-step model, which does not use further pretraining, the average performance difference is 2.24. Moreover, the results indicate that the proposed further pretraining method improves performance compared to the 0k-step model at every 1k step. For the 1A dataset, the model with 9k pretraining steps seems reasonable, considering the average performance and standard deviation. For the 2A dataset, despite the 8k steps showing most of the best performance, the 2k steps are the best choice because of the low standard deviation and negligible difference in performance between them. However, the 2B dataset results provide reasonable pretraining steps, 1k. The second is also evident at 6k. We selected the model with 1k steps through all subtask datasets for fairness. A comparison of the pretraining steps considered to have the best performance is provided in Appendix 2.

Table 10 Model performance on two subgroups in the 1A dataset

We defined and evaluated two subgroups to assess the robustness of performance for queries that can be grouped within the 1A dataset. One is the person group, and the other is the computer-software group. The person group consists of the query word if “person” exists in the gold hypernyms of the test set. The computer-software group consists of queries that correspond when “computer” or “software” exists in the test set gold hypernyms. On average, the person group had 320 queries, and the computer-software group had 61 queries. We evaluated the proposed method and comparison methods for the two subgroups. The results are shown in Table 10.

For the person group, the proposed method showed the best performance among the comparison methods for all evaluation measures. The MRR value of the proposed method is 84.78, which is significantly higher than other methods. For the computer-software group, the proposed method also outperforms other methods. Thus, the proposed method consistently outperformed compared methods in the experiment of two subgroups which is a similar result observed from the experiment of the original 1A dataset. In detail, the performance of the person group is substantially higher than the computer-software group. The reason for this result may be found from the characteristics of the person group, that most of the gold hypernym “person” appears first in the gold hypernyms, and the number of gold hypernyms is small. In contrast, the gold hypernyms of the computer-software group are much more varying compared to that of the person group where most gold hypernyms are multi-words, such as (“Xpdf”, “code, computer software, software package,...”), indicating that the hypernym relation is much more difficult to predict.

Table 11 Prediction results for each method for the rare noun phrase “open proxy server” (appearing nine times in the corpus)
Table 12 Prediction results for each method for the rare noun phrase “tempestuousness” (appearing once in the corpus)

To assess the robustness of the rare words, we compared the predictions of the proposed method with word2vec-based methods for the rare words. The test query “open proxy server,” which appears nine times in the given corpus, was used for analysis. The prediction lists of each method and the gold hypernyms are presented in Table 11. The gold hypernyms are represented in bold with the \(^{\star }\) symbol. Because the dataset is handcrafted, there may be more hypernyms. Thus, using underlines, we annotated relevant words on hypernym relationships using our judgment. Each prediction list was produced by sorting the probabilities of modes and taking the top 15 candidates. Hence, the earlier a word appears in the list, the more likely it is to be a hypernym.

Fig. 5
figure 5

tSNE plots of the [CLS] token embedding representation for subtasks (a) 1A, (b) 2A, and (c) 2B. Blue indicates hypernymy, and red indicates non-hypernymy (please see color version)

The results reveal that the proposed method adequately predicts rare words. In addition, most prediction words, including relevant words, are predicted better than the others. Conversely, RMM and SPON, which are word2vec-based methods, perform poorly on low-frequency words. For example, SPON only corrects for one gold hypernym ranked low on the list. Except for “computer program\(^\star\),” SPON predicted the wrong words. Although RMM did not correctly predict any gold hl relevant words were present in its prediction list, such as “pseudonymized,” “spoofing attack,” and “IP address spoofing,” but none of them were hypernymy. Table 12 lists the prediction list for “tempestuousness,” which appeared once in the corpus. The result also suggests that the proposed method contains more gold hypernyms than the others for rare words.

In addition, we explored the quality of the [CLS] token embeddings. We randomly selected hypernym pairs for each subtask from the testing set. The positive \(S_{\text {P}}\) were created from the selected pairs. The negative \(S_{\text {P}}\) were also generated by replacing a gold hypernym with a random candidate hypernym that is not gold. Then, each \(S_{\text {P}}\) was input into the proposed model to obtain \(e_{\texttt {[CLS]}}\) for each \(S_{\text {P}}\). Figure 5 depicts the tSNE plots of the \(e_{\texttt {[CLS]}}\) representation space for each subtask. Blue indicates the \(e_{\texttt {[CLS]}}\) of positive \(S_{\text {P}}\) and red indicates the \(e_{\texttt {[CLS]}}\) of negative \(S_{\text {P}}\). The hypernymy and nonhypernymy clusters are appropriately separated in all three plots, revealing that using the [CLS] token as a hypernym relationship information vector effectively identifies hypernym relationships.


Hypernym discovery is challenging because it finds appropriate hypernyms from a large predefined pool of candidates for a given query. In addition, because the candidates contain noun phrases, conventional word2vec-based methods are challenging to handle. In addition, BERT can solve this problem using subword tokenization. However, there have been no attempts to use BERT in hypernym discovery with its widely used training steps of domain adaptation: pretraining and fine-tuning. Therefore, this study presents the following procedures for adapting BERT to the domain tasks by modifying the pretraining and fine-tuning stages.

We proposed MLM with Hearst pattern sentences as a further pretraining procedure to adapt the hypernymy domain. The proposed method outperformed the comparison methods on all evaluation measures and subtask datasets. The Wilcoxon signed-rank test was employed to confirm the superiority of the proposed method. We also conducted an in-depth analysis to confirm the effectiveness of the proposed pretraining procedure, analyzed the distribution of utilized Hearst patterns, and presented effective patterns. The proposed pretraining performs better than BERT without the proposed pretraining stage. In addition, we demonstrated that the proposed method is robust against rare words compared to the comparison models in the case study and can produce stable performance in the viewpoint of subgroups. The results of the case study indicate the robustness of the proposed method for rare words compared to the existing methods. Furthermore, the tSNE plots were presented to demonstrate the representation space of the special prompt component.

Despite the effectiveness of Hypert, the computational cost of Hypert for inferencing hypernym relationships can be heavier than conventional methods such as Hearst pattern matching. Thus, when a large number of queries and candidates, for example, 200,000 candidates for one query in this study, is considered, Hypert can be slower than conventional methods. For example, the proposed Hypert expenses 15.52 queries per second (q/s), whereas its counterparts RMM, SPON, and prompting BERT consume 0.02, 0.02, and 0.46 q/s, respectively. In addition, the performance of Hypert may still be limited because we employed a general tokenizer instead of developing a domain-specific tokenizer for each general, medical, and music domain. Furthermore, a pretraining process may be required if Hypert is applied to a new domain, such as cyber security because the proposed method is based on a general language model.

In the future, we would like to construct additional benchmark datasets for hypernym discovery because most studies in hypernym discovery tasks reported that no additional benchmark datasets are available so far except SemEval2018 Task9-Hypernym Discovery dataset [11, 44]. Specifically, we would like to start our effort to create new datasets for Cyber Threat Intelligence (CTI) from the cyber security domain to evaluate the efficacy of Hypert from cybersecurity-oriented documents. In the field of CTI, extracting cyber threat insights from diverse data sources spanning multiple domains, including the Web, is an essential task. In addition, cyber threat information is predominantly communicated through written language in diverse CTI reports involving hypernymic relationships. For example, cyber security practitioners may seek to scrutinize CTI reports containing references to specific malware instances. In this context, hypernym discovery can be used to determine the category of a particular malware, and the proposed Hypert can be applied here. We would like to study this issue further.

Availability of data and materials

The datasets generated and/or analysed during the current study are available in the repository and










  1. Hearst MA. Automatic acquisition of hyponyms from large text corpora. In: COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics; 1992

  2. Yamane J, Takatani T, Yamada H, Miwa M, Sasaki Y. Distributional hypernym generation by jointly learning clusters and projections. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, The COLING 2016 Organizing Committee, Osaka, Japan; 2016, pp. 2016:1871–1879.

  3. Roller S, Erk K. Relations such as hypernymy: Identifying and exploiting hearst patterns in distributional vectors for lexical entailment. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas, 2016, pp. 2016:2163–2172.

  4. Caraballo SA. Automatic construction of a hypernym-labeled noun hierarchy from text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, College Park, Maryland, USA; pp. 1999:120–126.

  5. Cederberg S, Widdows D. Using LSA and noun coordination information to improve the recall and precision of automatic hyponymy extraction. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 111–118

  6. Sabirova K, Lukanin A. Automatic extraction of hypernyms and hyponyms from russian texts. In: AIST (supplement), Citeseer 2014; pp. 35–40.

  7. Sheena N, Jasmine SM, Joseph S. Automatic extraction of hypernym & meronym relations in English sentences using dependency parser. Procedia Comput Sci. 2016;93:539–46.

    Article  Google Scholar 

  8. Camacho-Collados J, Delli Bovi C, Espinosa-Anke L, Oramas S, Pasini T, Santus E, Shwartz V, Navigli R, Saggion H. SemEval-2018 task 9: Hypernym discovery. In: Proceedings of The 12th International Workshop on Semantic Evaluation, Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 712–724.

  9. Fu R, Guo J, Qin B, Che W, Wang H, Liu T. Learning semantic hierarchies via word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014, pp. 1199–1209

  10. Bernier-Colborne G, Barrière C. CRIM at SemEval-2018 task 9: a hybrid approach to hypernym discovery. In: Proceedings of The 12th International Workshop on Semantic Evaluation, Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 725–731.

  11. Bai Y, Zhang R, Kong F, Chen J, Mao Y. Hypernym discovery via a recurrent mapping model. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 2912–2921

  12. Dash S, Chowdhury MFM, Gliozzo A, Mihindukulasooriya N, Fauceglia NR. Hypernym detection using strict partial order networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2020;34, pp. 7626–7633

  13. Peng B, Chersoni E, Hsu Y-Y, Huang C-R. Discovering financial hypernyms by prompting masked language models. In: Proceedings of the 4th Financial Narrative Processing Workshop@ LREC2022, 2022, pp. 10–16

  14. Hanna M, Mareček D. Analyzing bert’s knowledge of hypernymy via prompting. In: Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2021, pp. 275–282

  15. Ravichander A, Hovy E, Suleman K, Trischler A, Cheung JCK. On the systematicity of probing contextualized word representations: the case of hypernymy in BERT. In: Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, Association for Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 88–102.

  16. Ettinger A. What bert is not: lessons from a new suite of psycholinguistic diagnostics for language models. Trans Assoc Comput Linguist. 2020;8:34–48.

    Article  Google Scholar 

  17. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013.

    Article  Google Scholar 

  18. Song X, Salcianu A, Song Y, Dopson D, Zhou D. Fast wordpiece tokenization. arXiv preprint. 2020

  19. Araci D. Finbert: financial sentiment analysis with pre-trained language models. arXiv preprint. 2019.

  20. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40.

    Article  Google Scholar 

  21. Roller S, Kiela D, Nickel M. Hearst patterns revisited: Automatic hypernym detection from large text corpora. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Melbourne, Australia,2 018, pp. 358–363.

  22. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186.

  23. Weeds J, Weir D. A general framework for distributional similarity. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 2003, pp. 81–88

  24. Cimiano P, Hotho A, Staab S. Learning concept hierarchies from text corpora using formal concept analysis. J Artif Intell Res. 2005;24:305–39.

    Article  MATH  Google Scholar 

  25. Geffet M, Dagan I. The distributional inclusion hypotheses and lexical entailment. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), 2005, pp. 107–114

  26. Weeds J, Weir D, McCarthy D. Characterising measures of lexical distributional similarity. In: COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, 2004, pp. 1015–1021

  27. Lenci A, Benotto G. Identifying hypernyms in distributional semantic spaces. In: * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), 2012, pp. 75–79

  28. Clarke D. Context-theoretic semantics for natural language: an overview. In: Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, 2009, pp. 112–119

  29. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint. 2014

  30. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

  31. Kamel M, Trojahn C, Ghamnia A, Aussenac-Gilles N, Fabre C. A distant learning approach for extracting hypernym relations from wikipedia disambiguation pages. Procedia Computer Science, Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 21st International Conference, KES-20176-8 September 2017, Marseille, France, vol. 112 2017, pp. 1764–1773

  32. Navigli R, Ponzetto SP. Babelnet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif Intell. 2012;193:217–50.

    Article  MathSciNet  MATH  Google Scholar 

  33. Espinosa-Anke L, Camacho-Collados J, Delli Bovi C, Saggion H. Supervised distributional hypernym discovery via domain adaptation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas, 2016, pp. 424–435.

  34. Iacobacci I, Pilehvar MT, Navigli R. SensEmbed: Learning sense embeddings for word and relational similarity. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Beijing, China, 2015, pp. 95–105.

  35. Shwartz V, Goldberg Y, Dagan I. Improving hypernymy detection with an integrated path-based and distributional method. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 2389–2398.

  36. Yu C, Han J, Wang P, Song Y, Zhang H, Ng W, Shi S. When hearst is not enough: Improving hypernymy detection from corpus with distributional models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 6208–6217.

  37. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017.

    Article  Google Scholar 

  38. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: a robustly optimized bert pretraining approach. arXiv preprint. 2019.

  39. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. Albert: a lite bert for self-supervised learning of language representations. arXiv preprint. 2019.

  40. Sanh V, Debut L, Chaumond J, Wolf T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint. 2019.

  41. Talmor A, Elazar Y, Goldberg Y, Berant J. Olmpics-on what language model pre-training captures. Trans Assoc Comput Linguist. 2020;8:743–58.

    Article  Google Scholar 

  42. Jiang Z, Xu FF, Araki J, Neubig G. How can we know what language models know? Trans Assoc Comput Linguist. 2020;8:423–38.

    Article  Google Scholar 

  43. Petroni F, Rocktäschel T, Riedel S, Lewis P, Bakhtin A, Wu Y, Miller A. Language models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), for Computational Linguistics, Hong Kong, China, 2019, pp. 2463–2473. Association

  44. Held W, Habash N. The effectiveness of simple hybrid systems for hypernym discovery. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 3362–3367.

  45. Jin Y, Jang E, Cui J, Chung J-W, Lee Y, Shin S. DarkBERT: A language model for the dark side of the Internet. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 7515–7533.

  46. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint. 2019.

  47. Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint. 2017.

  48. Wu J-C, Chang Y-C, Mitamura T, Chang JS. Automatic collocation suggestion in academic writing. In: Proceedings of the ACL 2010 Conference Short Papers, 2010, pp. 115–119

  49. Rodríguez-Fernández S, Anke LE, Carlini R, Wanner L. Semantics-driven recognition of collocations using word embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2016, pp. 499–505

  50. Wilcoxon F. In: Kotz S, Johnson NL. (eds.) Individual comparisons by ranking methods, New York: Springer, 1992, pp. 196–202.

Download references


This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2021-0-01341-003, Artificial Intelligence Graduate School Program (Chung-Ang University)).


This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2021-0-01341-003, Artificial Intelligence Graduate School Program (Chung-Ang University) and 2021-0-00766, Development of Integrated Development Framework that supports Automatic Neural Network Generation and Deployment optimized for Runtime Environment.).

Author information

Authors and Affiliations



GY: Conceptualization, Methodology, Formal analysis and investigation, Writing—original draft preparation, and Writing - review and editing. YL: Writing—review and editing. AM: Writing—original draft preparation. JL: Conceptualization, Methodology, Writing—review and editing, Funding acquisition, Resources, and Supervision. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Jaesung Lee.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix 1. Statistical test details

See Tables 13, 14, 15, 16 and 17.

Table 13 Wilcoxon signed-rank test results for the proposed method against the comparison models for the 2A dataset with 10 iterations
Table 14 Wilcoxon signed-rank test results of the proposed method against comparison models for the 2B dataset with 10 iterations

Tables 13 and 14 are the results of the Wilcoxon signed-rank test for the 2A and 2B datasets against comparison models. All p-values in the result are 1.95e–3, rejecting the null hypothesis.

Appendix 2. Results of the best hypert model for each subtask

Table 15 Comparison of evaluation measures on the best Hypert and BERT models for each subtask 
Table 16 Wilcoxon signed-rank test results for the Hypert and BERT models with 10 iterations

Table 15 compares the proposed method with the best Hypert and BERT models. The pretraining steps for the 1A, 2A, and 2B datasets are 9k, 2k, and 1k, respectively. The results of the Wilcoxon signed-rank test for all subtask datasets are provided in Table 16. Except for a few P@k measures, most of the p-values reject the null hypothesis.

Appendix 3. Extended hearst patterns

Table 17 Regular expressions of extended Hearst patterns used in this study

Table 17 lists the regular expressions of the extended hypernym syntactic patterns used in this study, where NP indicates a noun phrase, and PRON represents a pronoun.Footnote 7

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yun, G., Lee, Y., Moon, AS. et al. Hypert: hypernymy-aware BERT with Hearst pattern exploitation for hypernym discovery. J Big Data 10, 141 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Hypernym discovery
  • Hypernym relationship
  • Language model
  • Masked language modeling
  • Hearst pattern
  • Natural language processing