Internal dynamics of patent reference networks using the Bray–Curtis dissimilarity measure

Background: Patents are indicators of technological developments. The science & technology categories, to which they are assigned to, form a directed, weighted network where the links are the references between patents belonging to the respective categories. This network can be conceived as a kind of intellectual ecology, lending itself to mathematical analyses analogous to those carried out in numerical ecology. The non-metric Bray–Curtis dissimilarity, commonly used in quantitative ecology, can be used to describe the internal dynamics of this network. Results: While the degree-distribution of the network remained stable during the studied years, that of the sub-networks of with at least k links showed that k = 5 is a critical number of citations: this many are needed that the bias towards already highly cited works come into effect (preferential attachment). Using the d ij Bay-Curtis dissimilarity between nodes i and j , a surprising pattern emerged: the log-probability of a change in d ij during a quarter of year depended linearly, with a negative coefficient, on the magnitude of the change itself. Conclusions: The developed methodology could be useful to detect emerging technological developments, to aid decisions, for example, on resource allocation. The pattern found on the internal dynamics of the system depends on the categorisation of the patents, therefore it can serve as an indicator when comparing different catego-risation methods.


Introduction
With the advent of "Big Data", the bottleneck in predictive sciences is not finding relevant data but how to make sense of them.Studying patent databases may reveal patterns in technological advances that can be used to aid decision making, for example during resource allocations.
Patents are one of the most important indicators of human technological advancements.They are connected by references, thus generating links between the technological categories, too, to which the patents are assigned.This network can be conceived as a kind of intellectual ecology, lending itself to mathematical analyses analogous to those carried out in numerical ecology [1].
One of the most important concepts in trend-and pattern-recognition is the dissimilarity between two objects.The choice how to quantify it is not obvious and does not necessarily result in a distance-concept, for which the criteria are: (i) d(A,B) ≥ 0 and d(A,A) = 0 (i.e. the dissimilarity between two different objects A and B is a non-negative number and it is zero if A = B).

(ii) (A,B) = d(B,A) (symmetry) (iii) d(A,B) < d(A,C) + d(C,B), for any third C (triangle inequality)
The last requirement does not necessarily hold for many commonly used dissimilarity measures.It is difficult to achieve, for example, for transport routes, due to their typical hub-centred organisations, if the distance is measured by the duration of the travel between nodes, as a journey to a hub is generally faster than between non-hubs.
A record in a patent database contains certain attributes of a given patent, among others the scientific-technological category to which it has been assigned.Patents typically refer to other patents, which can be conceived as links between the respective technological categories.If a patent assigned to category c j refers to patents assigned to categories c 1 …c n , then we say that the latter categories influence c j .If we quantify the weight of this influence by the number of the respective references, then an 'influence-vector' can be assigned to each category, showing how much it is affected by the other categories.
Patent data have been used by several authors [2][3][4] to describe, possibly predict, technological changes.Érdi et al. [2] defined a citation vector v = [v 1 … v n ] for each patent, where an entry v i represented how many times patents from category i, were cited in the patent.After suitable normalisation, v i was seen as a weight to what extent the category c i influenced the patent.The authors grouped the patents according to the Euclidean distance between their citation vectors, then, by means of the Ward-algorithm [5], produced a dendrogram, the temporal dynamics of which were used to predict significant changes in the system.
A critical point in this approach is the use of the Euclidean metric, i.e., the distance concept was derived from the scalar product of vectors.We demonstrate, in what follows, the advantages of considering a non-metric dissimilarity measure instead.We show how the state and internal dynamics of technological developments can be described by network science methods and using the dissimilarity measure of Bray-Curtis [6].The transformation of Gower [7] will be used to visualize the state and temporal behaviour of the system in two dimensions.

Method
The Bray-Curtis (BC) dissimilarity measure Let the {c i } (i = 1…n) be technological categories and define the S = [s ij ] nxn citation matrix as follows: let s ij be the total number of patents that was put in the c j category and cited patents assigned to c i where i ≠ j.This s ij number, a non-negative integer, can be conceived as a quantification of the extent to which the c i category influenced c j , i.e., the weight of the c i → c j directed edge.As these can change with time, in fact we can also use the S(t) = [s ij (t)] nxn notation.Our rule sets the diagonal entries of the matrix S(t) to zero.
Quantify the dissimilarity between the c j and c k technological categories by means of their respective influence vectors following the idea of Bray and Curtis [6]: where, for the i = 1…n index of the summation, i ≠ j, and i ≠ k.
This way, again, we excluded the references of the categories to each other, as we want to see how similar category c j is to category c k , in terms of the composition of their influence vectors, from which we left out the direct link between them.We call d jk ∈ [0,1], the BC-dissimilarity assigned to the (c j ,c k ) category-pairs.The focus of our analysis is the temporal variation of the D BC = [d jk ] nxn dissimilarity matrix.
A subset of the categories will be called contracting in a time interval, if the above dissimilarity between any two members of the subset consistently decreases during that interval.This means that the composition of the respective influence vectors of the categories belonging to this subset is becoming more and more similar to each other.Note that the direct link between two categories do not affect their BC-dissimilarity, as the summations in Eq. ( 1) excludes the i = j and i = k cases.In other words, if the (c j ,c k ) categories get closer to each other, that does not necessarily mean that they would refer to each other at higher probability, but it is the composition of their two respective (1) influence-vectors (that excludes the direct c j → c k and c k → c j references) that is becoming similar with time.

Gower-visualisation of the Bray-Curtis dissimilarity between categories
Consider the G = [g kl ] nxn Gower-transformation of a D BC = [d kl ] nxn dissimilarity matrix: where d' kl = − 0.5 d kl . 2  The eigenvectors of G form a basis, with properties that can be utilized for visualization [1].Namely, as the G matrix has non-negative and different eigenvalues (λ 1 , λ 2 ….), consider the V matrix of its v k column-eigenvectors and take the W matrix of the ).As proved by Gower [7], the scalar product of any two row-vectors of the W = [w 1 , w 2 …] matrix will be equal to the dissimilarity between the two respective categories that can be visualized by these rowvectors with ordinary Euclidean distance between them.Besides, their first 2-3 components are much greater than the rest, which can therefore be omitted, for the sake of representing the categories by these first few components of the row vectors, i.e., in 2 or 3 dimensions.
The temporal evolution of the obtained points demonstrates how the dissimilarities increase or decrease with time, with the potential of identifying what combinations of technological areas emerge or tend to form clusters (Fig. 4.).

Data
We analysed a freely available dataset on patents registered in 2018-19 of the United States Patent and Trademark Office (USPTO).The database is updated quarterly on the PatentsView Web portal.Accordingly, in what follows, we use the notation q 1 = quarter 1 of 2018; … q 8 = quarter 4 of 2019.Data were downloaded from the 'Data Downloads Tables' (tables named 'ipcr' and 'uspatentcitation').Relevant data from the two tables were merged to establish the International Patent Classification (IPC) category of each patent.Each record of the compiled table represented a patent with (i) a unique ID of the actual patent; (ii) the IDs of the patents cited by the actual one; and (iii) the ID of the patents citing the actual one.Besides, the record also contained the IPC categories to which the actual patent was assigned and the date when the patent was approved.As we found different levels of patent classification, for the sake of simplicity, only the first level categories have been exploited.If the category or the approval date was not available, then the record was omitted.Altogether 115 categories (c i , where i = 1 … 115) were distinguished and linked by reference lists of the patents as described above.The categories were divided, in the database, into 8 groups denoted by A-H, and we followed this notation. (2)

Results
In each quarter year of 2018-19, more than 80,000 patents were processed and each of them was assigned to one of the 115 categories, so the number of nodes in the network is N = 115.Their total number of references were more than 300,000 in each quarteryear, about third of which was omitted because they cited patents that were assigned to the same category.This way, we constructed a weighted, directed network of categories, where the direction of an edge is from a cited category to a citing one and its weight is the number of respective citations.As follows from above, the sum of the weights (the entries of our S matrix) is ca.200,000.The weights span from one to several hundreds, the smaller numbers being much more frequent (the lowest number, 1, occurring > 1000 times), than the higher ones.The degree-distribution of the network nodes (i.e., the frequency diagram of the number of citations in the categories) follows the powerlaw as shown by Fig. 1.This indicates that the overall weighted citation network of categories falls in the commonly observed scale-free group [8], throughout all the 8 quarter of years.
Note that the non-weighted version of the network would have ca.5-6000 edges, out of the possible 13,000, so it is relatively dense.In this network, the average indegree of a node (i.e.how many other categories it cited at least once) would be ca.45-50, out of the possible maximum 114.
For the weighted network, which is our focus, we studied the properties of some weight-defined subnetworks.Figure 2 shows the distribution of the in-degrees of the nodes (i.e. the proportion, or relative frequency, of the incoming edges, with weights, within the total number of weighted edges), if only those edges count that represent exactly k citations, where k = 1, 2, 5.The distribution strongly depends on how many citations are needed to form an edge.With k increasing, the in-degree distribution transforms from unimodal to close-to-exponential.
That this finding is valid through the studied two years is demonstrated in Fig. 3.The in-degree distribution of the subnetwork with edges representing exactly 1 citation (Fig. 3a) is close to Poissonian for all the eight quarter-years.Recall that, while the degree distribution of an Erdős-Rényi random graph is Poissonian, that of a scale-free network follows the power-law [8].This suggests that citing a patent from another category only once can be just random, resulting in the Erdős-Rényi option.However, when the edges represent several (at least ca. 5 citations; see Fig. 3b), then the pattern is more reminiscent to that generated by the power-law.A logical explanation for this is that, if a category cites another one via at least five patents, then the principle of preferential attachment [8] is more detectable and the influence of the cited category increases according to this bias.The "preferential attachment" can be translated for our case as: the probability of citing a category is proportional with the number of citations that this category already has.It has been proven [8], that this mechanism leads to the linear pattern on the log-log scale shown by Fig. 1, towards which the exponential distribution shown in Fig. 3b is an intermediate step.
For each quarter-year, we calculated the Bray-Curtis dissimilarities between the category-pairs, thus creating the D BC (q i ) = [d BC (q i )] dissimilarity matrices as a function of the quarter-years and their differences with the Δd BC (q i ) entries: For demonstration, Fig. 4. shows the movement of two contracting subgroups, with two and three members respectively.The movement is projected onto a two-dimensional Gower-space.The dynamics of the points can be followed in 3D, too, on Additional file 1, as a time-lapse simulation.
Figures 1, 2 and 3 showed that the degree distribution of our network follow consistent patterns through 2018-19.However, this does not mean that the network is stationary.In fact, it hides significant internal dynamics as Fig. 5 shows.It suggests (3) Fig. 4 The internal dynamics of five categories, when their dissimilarity is quantified by the Bray-Curtis measure, during the studied period (8 quarter-years).The Bray-Curtis space is projected onto a two-dimensional, Euclidian, Gower-space Fig. 5 Log-probability, or log(relative frequency), histogram of the Δd BC changes in dissimilarities between categories in the patent database from the q 1 to the q 2 quarter-years.The figure shows that the further away two categories are in the BC-space, the more probable that they move closer to each other.So far this is expectable, but the linearity between the logarithm of the relative frequency of a change in the BC-dissimilarity and the BC-distance itself is surprising that, in a quarter-year, the log-probability of a category getting closer to / further from another category by a given Δd BC -measure is a linear function of that measure.The more intriguing this is because the BC-transformation is non-linear, and the BC-measure does not qualify for a distance concept.
It is an open question, answer to which is out of the scope of this paper, whether there is a mechanistic explanation behind the observed linear pattern.What we established was that the "significant" connections (i.e., edges representing at least five references) generated a close-to scale-free network through the 8 quarteryears, presumably driven by the mechanism of preferential attachment.The internal dynamics however is far from stationary and for all the seven histograms of the transitions made between the eight quarter-years showed the log-linear pattern of Fig. 5.

Discussion
Intellectual achievements are being built on each other, and we took the US Patent & Trademark Office database to analyse these interactions.Two critical simplifications were made when analysing the data: A/We only considered the first level of categorisation.B/We did not differentiate between the times of patent filing and patenting.These simplifications may affect the findings, but here our focus was the methodology rather than higher resolution analysis.
We constructed a weighted, directed network of categories where the weighted edges represent references between the patents belonging to the categories.For each node (category), an "influence vector" was assigned, composition of which characterising how other categories affect that node.The temporal changes in the (dis)similarity of the composition of these influence vectors were used to identify the dynamics of the constructed network, representing this way a sort-of evolving intellectual ecology.
A critical concept here is the measure of dissimilarity between categories.For this, we chose a non-metric dissimilarity measure, that of Bray-Curtis [6], which is commonly used in numerical ecology.
The developed methodology could be used for example to describe the emergence of new technological developments, or to support decisions on research and development resource allocation [9].

BC-differences between citation numbers
Quarter-years G05 G06 Fig. 6 Influence of the categories G05 and G06 on B25 and H05.Their component-wise BC-differences between the citation numbers show a decreasing trend over the studied five quarter-years A method for this could start with identifying contracting subnetworks, as demonstrated in the 2D Gower-space in Fig. 4. To that end, the categories H05 ("Electric techniques not otherwise provided for") and B25 ("Hand-held tools; Portable, power-driven tools; Handles for hand implements; Workshop equipment; Manipulators") were further explored using a multi-step filtering method.We searched for the categories they referred to, then for patents within the search results that could explain the observed convergence over the two-year-long observation time.
We found that the categories G05 (controlling, regulating) and G06 (computing, calculating, counting) had a significant impact on their H05 and B25 citing categories.For both cited categories (G05, G06), the BC dissimilarity measure from H05 and B25 showed a steady decrease during the studied period (Fig. 6).This trend my be expectable by an expert; however, the role of quantitative modelling is not only to find new patterns but also rank various scenarios, thus give an objective tool to technologically less experienced managers, who nonetheless may be responsible for making decisions, for example about investing in new areas.
When exploring the patent database, the linear pattern demonstrated by Fig. 5 is probably the most surprising finding.Intrigued, we downloaded the data of Microsoft Academic Graph containing-among others-scientific publication records, citation relationships and fields of study.We carried out the same analysis as in case of patent data, but no linear pattern was observed for the analogous distribution shown in 5. Therefore, our observation is not due to some properties of the BC-dissimilarity measure.The reason might be the way how the patents were categorised, though it is not clear why.
Nonetheless, as an application, the observed linearity could be a reference for other categorisation methods.However, while the preferential attachment principle [8] is an elegant explanation for the scale-free pattern shown by some of the previous results, analogous mechanistic reason for this last one seems neither straightforward nor intuitively predictable.

Fig. 1 Fig. 2
Fig.1The distribution of the weighted degrees of nodes (categories) throughout 2018-2019, on the loglog scale, follows the power-law

Fig. 3
Fig. 3 In-degree-distribution of the categories (nodes) for all the eight quarter-years (q1-q8).Only those edges were considered, which represent (a) exactly one citation; (b) exactly five citations.The dotted lines represent the (a) Poisson-(b) exponential distribution, each with a mean-representing parameter that was estimated by the average of the respective dataset