Modeling scientometric indicators using a statistical data ontology

Scientometrics is the field of study and evaluation of scientific measures such as the impact of research papers and academic journals. It is an important field because nowadays different rankings use key indicators for university rankings and universities themselves use them as Key Performance Indicators (KPI). The purpose of this work is to propose a semantic modeling of scientometric indicators using the ontology Statistical Data and Metadata Exchange (SDMX). We develop a case study at Tecnologico de Monterrey following the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology. We evaluate the benefits of storing and querying scientometric indicators using linked data as a mean for providing flexible and quick access knowledge representation that supports indicator discovery, enquiring and composition. The semi-automatic generation and further storage of this linked data in the Neo4j graph database enabled an updatable and quick access model.


Page 2 of 17
Lopez-Rodriguez and Ceballos Journal of Big Data (2022) 9:9 research units. The actual process shown in Fig. 1, begins with the Research Office integrating information from diverse data sources such as Scopus and institutional databases. One example of a Scientometric Indicator is the Citation Count, which it is the sum of citations received to date by institutional outputs and answers the question of how much impact an institution's academic unit has [4]. This metric is defined by the Common European Research Information Formation (CERIF) that has been developed as a flexible model to describe any research data and information, both as a database model but also as a transfer method between repositories [5]. The research office receives the questions and interprets it with their knowledge and context to know what and where to search. A problem in the interpretation stage of the process is when a concept has several definitions, and it leads to wrong answers to the asked question, which would lead to bad decisions. After receiving the question and making an interpretation of it, the specialist performs statistical operations on scientific data following indicator formulas. Information is obtained from different sources and it is difficult to assure that it is up to date. Statistical operations are made and the specialist answers back to the requesting department. Scientometric Indicators are calculated each year in Tecnologico de Monterrey by the Research Office for decision making. Statistical information is gathered from heterogeneous and distributed sources to uncover insights, make predictions, and build smarter systems for the institution [6].

Semantic modeling
The Resource Description Framework (RDF) has formed a more systematic and comprehensive technical architecture in data and knowledge representation and processing. It has become one of the main forms for representing knowledge and ensures that the semantics of temporal data can be described accurately and flexibly, and also may help to realize the sharing in various applications [7].
Ontologies, on the other hand, provide an upper level of semantic representation. The main units of an ontology are concepts, relations between them, and their properties. Relations and properties are represented in the form of triplets (subject, predicate, object or literal value). Each concept has a universal resource identifier (URI) assigned to it [8].
The Statistical Data and Metadata Exchange (SDMX) ontology was initiated by seven international institutions to improve efficiency using technology for sharing and exchanging statistical data and metadata for interoperability. It represents data in flat files and as Extensible Markup Language (XML) to the definition of fact, dimension, and measure [9]. XML is a very flexible text format that plays an important role in the exchange of a wide variety of data on the Web and elsewhere [10].

Proposed approach
The main goal of our research is to build a model of Scientometric Indicators using RDF for the description of the resources by extending SDMX with a vocabulary appropriate for representing dimensions, attributes and values found in scientometrics indicators (e.g. schools, cites, papers). To evaluate our approach we will extract a sample of Scientometric Indicators used in the Research Office from Tecnologico of Monterrey. Data will be transformed manually to RDF and a tool will be constructed for automation. Another goal is to deploy this model in a graph database to evaluate queries and visualizations.
Our motivation is to have a reliable model that ensures data integration and interoperability by building a flexible, updatable, and easy-to-maintain model with quick access for several applications. Our approach is evaluated by performing scientometric indicator's discovery, enquiring and composition supported by the SDMX data model.
The research questions stated for this work are the following: How Scientometric Indicators can be represented using the SDMX model? Which dimensions need to be defined for modeling them? Which benefits do we obtain by using a semantic representation and storage?

Background
We revised previous works on which indicators are semantically modeled and enquired, as well as those where ontologies are used for representing science or scientometric data. Finally we introduce the semantic platform used for storing and enquiring linked data in our research. We conclude this section by identifying the gap our work contributes to close.

Semantic indicator modeling
Fox proposed modeling city indicators with a semantic approach in 2018 [11]. His approach includes key aspects such as membership extent, temporal extent, spatial extent, and measurement of populations. Fox uses the RDF Data Cube Vocabulary for specifying dimensions of city populations, but the SDMX standard was not incorporated as part of his solution. The evaluation of the ontology was divided into the representation of the population as the definition of indicators, consistency of indicator definitions against the interpretation of a city, and how it can be used to support data collection of a city.
A semantic approach can be implemented in several contexts, one of the most common scenarios is in statistical databases. In Thiry et al. [12] presented an interactive tool for a question answering system that accesses statistical databases and it follows the SDMX standard. In this research work, they take a look at understanding general dimensions from user questions and found that time and location were dimensions for this kind of data. This system was evaluated by testing queries and measuring the accuracy of the result in terms of detecting dimensions. Queries were tested using only one dimension. SDMX is used by many institutions and this research work represents a good approach for answering questions about the selected data.

Scientometric ontologies
Hu et al. converted data, originally stored in a relational database, collected from the Semantic Web Journal to RDF and published them as linked data [13]. This data contains an entire timeline for each paper along with metadata from the Semantic Web Journal (SWJ) unique open and transparent review process. This gives insights into scientific networks and new trends. The Bibliographic Ontology (BIBO) ontology was extended for capturing information about the paper's timeline. BIBO provides main concepts and properties for describing citations and bibliographic references (i.e. quotes, books, articles, etc.) on the Semantic Web [14].
In Osborne et al. [15] presented a novel approach for clustering authors according to their citation distribution. This work introduced the Bibliometric Data Ontology (BiDo) which allows an accurate representation of such clusters. BiDO is a modular ontology encoded using Ontology Web Language (OWL) 2, that allows the description of bibliometric data of people, articles, journals, and other entities described by Semantic Publishing and Referencing (SPAR) Ontologies in RDF [16]. BiDO has kinds of bibliometric data: numeric and categorical. Some measures such as citation count, e-index, and journal impact factor are available through BiDO's numeric property. Categorical data is for specifying categories describing the research career of authors.

Linked data platforms
A Linked Data Platform (LDP) provides a set of integration patterns for building RESTful HTTP services capable of reading and writing RDF data. Under this definition we can find applications like VIVO and Neo4j, that at some extent, enables the visualization of information stored in RDF format.
VIVO is an open source linked data platform that supports recording, editing, searching, browsing and visualizing scholarly activity. It encourages research discovery, expert finding, network analysis and assessment of research impact [17]. The VIVO ontology contains the schemas required for representing this information.
Neo4j is a native graph data store built from the ground up, to leverage not only data but also data relationships. It connects data as it is stored, enabling queries at high speed [18]. Neo4J uses native graph storage which provides the freedom to manage and store data in a highly disciplined manner. It is considered the most popular and used graph database worldwide, used in areas such as health, government, automotive production, military area, among others [19].
Stothers defines some advantages of using Neo4j graph database as follows [20]: • Well suited to storing information structures that are not well suited to relational databases, such as ontologies or networks. • Operational simplicity, especially in the use of relationships to avoid joining tables. • Ability to include properties in relationships and nodes. • Neo4j query language has the ability to present query results in multiple formats allowing creative insights in data interpretation. • Efficient queries and attractive interface lead to ease of use and an intuitive user experience.
To load RDF triplets to the graph database the plugin called n10semantics from the Neo4j labs needs to be installed. This plugin enables the use of RDF and its associated vocabularies for data interchange (OWL, RDFS, SKOS and others). This plugin is also use to build integration with RDF generation and consuming components [21]. Some functionalities that are included by installing this plugin are the following: • Import and Export RDF in multiple formats (Turtle, N-Triples, JSON, etc.) • Model mapping on import and export • Import and Export Ontologies in different vocabularies

• Graph validation • Basic inference
Cypher is the graph-optimized query language incorporated in Neo4j. It understands and takes advantage of connections (relationships) between data. It is inspired by SQL, with the addition of pattern matching borrowed from SPARQL and uses simple ASCII symbols to represent nodes and relationships, making queries easy to read and understand [22].

Summary
In the literature review presented above, we found several methodologies that can be applied for representing and enquiring scientometric indicators, but that must be integrated in a single solution. On the first place, we must select a semantic representation of statistical data. Some authors used BIDO for the representation of numerical data about citations and other specific indicators. We found an area of opportunity because it is a problem to talk about numerical data in one way and when looking at other works they handle it differently.
Using the SDMX will allow us to have a standardized way to represent any indicator. On the other hand, SDMX can be extended in terms of dimension, attributes and values appropriate for describing publications, citations, researchers, etc. This solves the problem of having data with a certain level of information such as properties of authors. In this way, we have flexibility for defining dimensions proper of Scientometric Indicators. Finally, a semantic approach must be evaluated in order to demonstrate its advantages over traditional approaches. SDMX has demonstrated to provide a consistent representation of multidimensional data, but it must permit to capture particular differences between indicator definitions (e.g. annual versus quinquennial time periods). Besides, we must assure that any person familiar with SDMX is capable of discover, enquire and compose scientometric indicators encoded with this data model. We also must provide high-performance on query answering so we used the Neo4j platform for this purpose.

Methodology
In this work, we will follow a methodology commonly used in data science projects. The name of the methodology is CRISP-DM and it stands for Cross Industry Standard Process for Data Mining. It is a process model with six phases (Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment) that naturally describe the data science life cycle. The Business Understanding phase focuses finding a common motivating business goal to maximize the uptime and efficiency of machines by using predictive analytics. The Data Understanding phase hypotheses for hidden information regarding the data mining project goal are formed based on experience and qualified assumptions. In the Data Preparation phase, the engineer collects the relevant data and prepares it for the actual data mining task. This includes the preprocessing, e.g. data reduction and filtering, as well as feature generation with respect to the data mining project goal.
The Modeling phase consists on a data mining workflow that is constructed to find the desired parameter settings for the selected algorithms and to execute the data mining task on the preprocessed data. In the Evaluation phase, the trained model is tested against real data sets within a production scenario and the data mining results are assessed according to the underlying business objectives. After successful evaluation of the trained model, it is deployed into production in the Deployment phase [23].

Data understanding
Data for this research work is taken from an official workbook of the Research Office that stores data in a tabular way. The file is frequently updated and we took the last version of March 2021, historical data stays the same unless an error is found in a past calculation. The list of Scientometric Indicators that appear on this worksheet is of more than 100 indicators and each one of them belongs to a category. Among these categories, we can find the following: Publications and Cites, Patents, Students, Researchers and Rankings.
The workbook lists each Scientometric Indicator with the person responsible for calculating it, historical data per year, if the indicator is evaluated by quinquennium it stores also the range of years evaluated and the actual value if it is already available. If the Scientometric Indicator also has a dimension such as school, level of education, researcher level, among others, the values of the indicator are described in terms of these dimensions.

Data preparation
For experimentation purposes, we took a sample of 10 Scientometric Indicators shown in Table 1. These indicators were selected to illustrate that both unidimensional and multidimensional indicators can be represented.
• Unidimensional. The first kind of indicators associate a value to a time period. • Multidimensional. The second kind of indicators associate a value to a time period and to another dimension(s) (e.g. Schools).
After selecting the sample of Scientometric Indicators to model, we will prepare the information as we need it to be able to automate its transformation. The first step is to make a manual conversion of the 10 Scientometric Indicators listed in the original worksheet and obtain as output 10 different CSV files with the values of each indicator stored in a tabular way. In Fig. 2 we can observe an example of the extraction of 3 Scientometric Indicators which are in turn stored in 3 different CSV files. The next step is to manually write the head of the RDF files. We are building an RDF file for each Scientometric Indicator listed in our sample. As Scientometric Indicators are different, the head and observations of the RDF file will differ from others. The format of the RDF file would be Turtle as is one of the valid formats that the Neo4j graph database accepts. A Turtle file allows writing down an RDF graph in a compact textual form. The RDF model uses triples consisting of a subject, a predicate and an object <s, p, o> to represent data [24]. We divided the RDF file format into Vocabularies, Dataset, Data Structure Definition (DSD), Measure and Dimension Properties, Concept Scheme, and Observations.

Vocabularies
Additionally to standard vocabularies for representing RDF/XML (rdf, rdfs, xsd) and ontologies (owl, skos), we incorporated the ontologies shown in Table 2. The last ontology (tec) was defined for publishing our definitions. Our objective is to create a model easy and flexible to expand in case of creating new scientometric indicators.

Dataset
In this section of the RDF file, we created a Universal Resource Identifier (URI) for each Scientometric Indicator. The object can also be identified by a literal for the representation of simple values such as strings or numbers [25]. An example is shown in Fig. 3. We can observe that we use the RDF property label to identify it quicker for future tasks like queries. The Data Structure Definition (DSD) is defined in the data cube ontology.

Data structure definitions (DSD)
The Data Structure Definition of a dataset, describes the components such as dimensions, attributes and measures [26]. In this section, the structure of the dimensions and measures of the Scientometric Indicator is defined. Here is where we define the components including the SDMX attribute as a measure unit; for instance, number of publications, number of cites, number of researchers, etc. We also defined the component of the SDMX dimension, which includes Schools. In Fig. 4 we can observe how both of them are defined for defining the quinquennial number of publications per school. We defined with a specific URI both measures and dimensions of the Scientometric Indicator. As we mentioned before, we have two kinds of Indicators, unidimensional (only time dimension) and multidimensional (time and other dimensions). We extended the property sdmx-measure:obsValue for representing measures such as the number of publications, citations, researchers and posdoctoral researchers in our indicators.
We also extended the property sdmx-dimension:refPeriod, used for representing temporal intervals. In our case we defined an annual and a quinquennial (5-years) time interval. Time interval instances were borrowed from the United Kingdom's reference data server (http:// refer ence. data. gov. uk/). For indicating that an indicator is calculated for a School we extended the basic SDMX dimension property and constrained its value to instances of the School class provided by the VIVO ontology. School instances are available at the VIVO website of our institution (https:// resea rch. tec. mx/).

Concept scheme
A concept scheme provides labels to concepts and realizes both hierarchical and associative links [27]. For this purpose we used SKOS, a concept-centric data model based on RDF that identifies concepts using URIs to make already available knowledge organization systems public on the Web in machine-readable formats [28]. SKOS is devoted to developing specifications and standards that support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading systems, and taxonomies within the framework of the Semantic Web. It provides a standard way to represent knowledge organization systems using the Resource Description Framework (RDF) [29]. In our case we reused common concept schemes such as the list of current Schools.

Observations
In this section, automation is performed to generate the observations of each Scientometric Indicator. Multidimensional data are generally referred to datasets characterized by more than two dimensions [30]. We generated two interactive python notebooks, one for multidimensional indicators and another one for indicators with one dimension. In the first step of Data Preparation, we extracted manually in different files each Scientometric Indicator. This python code receives as input the CSV file and reads it. The next step is to melt or unpivot the data to be able to iterate through it and generate the observations. In Fig. 5 we can observe an example of this procedure.
Before it enters the loop, it receives 3 parameters to build the observations. The parameters are the indicator name, measure name, and measure label. The label is parameterized in order to provide a self-contained description of the data point in natural language. Observations are built automatically and manually pass them to each file. An example of observations of the same Scientometric Indicator is shown in Fig. 6. After passing through all these steps for each Scientometric Indicator, we will have 10 RDF files ready for the modeling in a graph database.

Modeling
In this phase of the CRISP-DM methodology, we describe the load of these RDF files into the graph database Neo4j. We installed the plugin and proceed to initialize the graph with settings such as handle Vocab-Uris as shorten, overwrite multi-values, handle RDF types as labels and the remaining settings remained in their default value.
After the graph initialization we proceed to use a store procedure from the n10s plugin for importing the first RDF file containing the Scientometric Indicator. This store procedure is called import.fetch and we call it from the browser terminal of the graph database. The procedure receive as parameters the location of the RDF file and the RDF format, in our case Turtle. An example of the load is shown in Fig. 7. This procedure is repeated for all Scientometric Indicators. We did it manually to make sure all the triplets were loaded correctly, but this procedure can be automated. After completing all the list of Scientometric Indicators our graph database is ready to be evaluated with some queries. In Fig. 8 we show all the nodes and relationships stored in our graph database of Scientometric Indicators. Blue nodes represent observation nodes, i.e. indicator data points, hence predominating in the graph.

Results
In this section, we will proceed to evaluate our approach by demonstrating that knowing the SDMX data model is enough for enabling indicator discovery, enquiring and composition. To do so we define and test five queries using the Cypher language. At the end of this section, we also compare the number of nodes and relationships in our Neo4j graph database against the number of triplets uploaded to a graph database using RDF files in an Apache Jena Fuseki server to analyze the complexity of the data structure.

Indicator discovery
By knowing that indicators are encoded as SDMX datasets we can ask for what is measured, as shown in the Cypher query in Fig. 9. In this query is inspected the structure of all the indicators in order to list the available metrics (measures). The result of this query, shown in Table 3, has the four measures available in the indicator sample: counts of publications, citations, researchers and posdoc researchers.
Once we select an indicator we can investigate how it is reported, i.e. how it is broken down. For this we need to investigate along which dimensions is reported the measure. The query shown in Fig. 10 extends the previous query by selecting only those indicators that report publication counts and adding the dimensions associated  to it. The results show that publications are reported in annual or quinquennial periods, and at institutional level or broken down by school (see Table 4). Next we asked for the number of dimensions used for describing each indicator (see Fig. 11). In this way we can distinguish unidimensional from multidimensional indicators (see Table 5).

Indicator retrieval
Now we formulated a query for retrieving the values stored in a specific indicator. Figure 12 shows the query used for asking the number of publications made by the institution, reported in quinquennial periods. The results, shown in Table 6, include the quinquennial periods and the corresponding number of publications. Quinquennial   intervals are defined according to the specification provided by the United Kingdom's reference data server (http:// refer ence. data. gov. uk/).

Indicator composition
Finally we evaluated the capability of our approach for calculating a new indicator from those currently stored. Figure 13 shows how both quinquennial publications and quinquennial citations are retrieved for calculating a scientific impact indicator: citations per   publication. The results (see Table 7), show the time interval, the number of publications (docs), the number of cites (citations), and the number of citations per publication (cites_per_doc). We had to cast both publications and citations to float for making a correct calculation.

Storage optimization in Neo4j
In order to observe the behavior of our graph structure in Neo4j, we decided to make a comparison of the number of nodes and relationships in Neo4j against the number of triplets in a default graph database created in Apache Jena Fuseki server with all the Scientometric Indicator RDF files created for this work. Both graphs received as input the 10 RDF files of Scientometric Indicators. In Table 8 we can observe that the Neo4j graph database has 287 nodes and 864 relationships and in Table 9 we observe that an RDF graph has 1578 triplets. The Neo4j is 45% smaller than    the original RDF graph in terms of relationships/triplets. The reduction in the number of relationships in Neo4j is due to the absorption of RDF literal values into nodes.

Discussion
Previous ontologies such as VIVO and BIBO provide semantic definitions for researchers, institutions, schools and publications, but they do not provide an appropriate representation of scientometric indicators. Whereas BiDo defines some scientometric indicators, using a vocabulary for statistical data such as SDMX allows to describe each indicator and compose new ones using their description. By extending SDMX with definitions borrowed from the VIVO ontology we provide a mean for representing scientometric indicators. In this way, the detailed information used for calculating these indicators could be accessed and verified in the institutional VIVO instance. Unlike Fox's approach [11], we used the well-known SDMX data model, which would facilitate the adoption of our approach. On the other hand, we extended the work of Thyri et al. [12] by evaluation the representation of multidimensional indicators modeled with SDMX.
Furthermore, we evaluated the capability for discovering available indicators, retrieving the corresponding values and calculating new indicators using the existing ones. In contrast to a relational data model, a user only needs to know the basic structure of a SDMX model to discover which metrics are defined and how they are broken down.
In our deployment to Neo4j we only imported the basic definitions for time intervals and VIVO instances (e.g. Schools). Nevertheless, it is possible to import additional information available in the corresponding linked data platforms. This information includes, sequence order between time intervals, and the list of schools' faculty. The former can be used for calculating the increment from 1 year to another in a given indicator (e.g. publications), whereas the latter can be used for counting the number of faculty members and build a new indicator.

Conclusions
Scientometric Indicators are important for universities in terms of decision making because they are used for university rankings. By modeling scientometric indicators extending the Statistical Data and Metadata Exchange (SDMX) ontology we enabled indicator representation, discovery, enquiring and composition. We showed how to semi-automatically generate these indicators and link them to data published in institutional (VIVO) and international platforms (UK Government), which in turn can be used for making additional inferences. Using the Neo4j graph database we deployed an efficient solution for data access.
Future work of our approach include the development of a chatbot that answers natural language questions about Scientometric Indicators. A chatbot is are conversational agents that allow the user access to information and services through natural language dialogue, including text and voice [31]. The chatbot will provide natural language processing to obtain relevant information of the question such as the intent and its entities. The chatbot must be capable of identifying the indicator been asked for, extract the parameters of the query (dimensions) and build the Cypher query. The chatbot must support the incorporation of new indicators and new data points.