Large scale analysis of violent death count in daily newspapers to quantify bias and censorship

In this work we develop a series of techniques and tools to determine and quantify the presence of bias and censorship in newspapers. These algorithms are tested analyzing the occurrence of keywords ‘killed’ and ‘suicide’ (‘morti’’, ‘suicidio’ in Italian) and their changes over time, gender and reported location on the complete online archives (42 million records) of the major US newspaper (The New York Times) and the three major Italian ones (Il Corriere della Sera, La Repubblica, La Stampa). Using these tools, since the Italian language distinguishes between the female and male cases, we find the presence of gender bias in all Italian newspapers, with reported single female deaths to be about one-third of those involving single men. Analyzing the historical trends, we show evidence of censorship in Italian newspapers both during World War 1 and during the Italian Fascist regime. Censorship in all countries during World Wars and in Italy during the Fascist period is a historically ascertained fact, but so far there was no estimate on the amount on censorship in newspaper reporting: in this work we estimate that about 75% of domestic deaths and suicides were not reported. This is also confirmed by statistical analysis of the distribution of the least significant digit of the number of reported deaths. We also find that the distribution function of the number of articles vs. the number of deaths reported in articles follows a power law, which is broken (with fewer articles being written) when reporting on few deaths occurring in foreign countries. The lack of articles is found to grow with geographical distance from the nation where the newspaper is being printed. Whereas the assessment of the truth of a single article or the debunking of what are now called ‘fake news’ requires specific fact-checking and becomes more difficult as time goes by, these methods can be used in historical analysis and to evaluate quantitatively the amount of bias and censorship present in other printed or online publication and can thus contribute to quantitatively assess the freedom of the press in a given country. Furthermore, they can be applied in wider contexts such as the evaluation of bias toward specific ethnic groups or specific accidents.


Introduction
Violent death is a dramatic event that can be objectively quantified in terms of the number of lives lost. Words such as 'killed' , 'dead' , 'casualty' , 'suicide' etc... can thus be used to assess the presence of bias or censorship in reporting incidents involving specific locations, groups of people, or time.
Coverage of death in the mass media has been studied for the insight it offers on how societies perceive this dramatic event [1]; at the same time, selective death reporting can be used to shape and distort public opinion. A 6-months period study [2] of 1975 issues of the Register Guard and the Standard Times showed that newspapers tend to over-report more news-worthy death causes and overlook others that are considered less interesting by the public. The work considered the number of occurrences, the number of deaths reported, and the text surface area of the articles dealing with death and compared them with the statistical occurrences, finding that all forms of disease were under-reported whereas violent or catastrophic events were overrepresented. An even more biased selection along these lines was found analyzing the news selection by television and radio [3]. Studies on the presence of geographical bias in reporting accidental deaths have been conducted manually on small data sets: a 51-issues study (two months) of two German (Frankfurter Allgemeine Zeitung and Süddeutsche Zeitung) and two Australian newspapers (The Australian and The Sydney Morning Herald) [4] have shown a weak relationship with social proximity in the number of articles reporting deaths. Similarly, a 6-months study of the German tabloid Die Abendzeitung [5] reported evidence of a correlation between the number of accidental deaths and the length of the article as well as a negative correlation of the distance between the town where the newspaper was printed and the event reported. In this case, the analysis involved the study of the amount of printed space occupied by an article reporting of a fatal event. The authors did not find any significant correlation between the area taken by the article and the number of deaths reported [6] but showed an inverse relationship (linear in logarithmic scale) of the area of printed text normalized to the number of victims and the distance (in km) of the event from Munich, the town where the Die Abendzeitung was printed. They also found a minimum threshold of victims-increasing with distance-needed to elicit a report in the newspaper. They thus likened the response of the paper to the logarithmic response between stimulus strength and response intensity in human perception [7].
On the opposite, [8] analyzed the occurrence of pictures in Time and Newsweek magazines. The study involved ten issues in three years for each magazine, for a total of 60 issues. The work involved measuring the area occupied by the picture, determine the country of origin of the picture, and classifying its violent or non-violent nature. The study found that-although domestic pictures were more common than foreign ones ( Time = 66.4% , Newsweek=71.2%)-images depicting death were printed more often when the event took place abroad ( Time foreign : 31% ; Newsweek foreign : 25% ) compared to when it occurred in domestic USA ( Time domestic : 12% ; Newsweek domestic : 13%).
A more recent study [9] showed that direct pictographic representations of death are very rare in western newspapers, amounting only to 4.5% of the articles (997 stories in a two-month time frame, 357 with pictures).
Most of these studies have been performed by manual scanning of the newspapers or magazines, therefore restricting the scope of the possible analysis. With the digitization of printed literature, it is possible to perform systematic analysis on the entire archives of newspapers, magazines, and books. A large-scale analysis of the number of occurrences of notable personalities in books has shown evidence of censorship of various individuals during the Nazi regime [10]. The books have been digitally converted via OCR (Optical Character Recognition) [11] as part of the Google effort to digitize books and the occurrence of individual names has been compared with the lexicon of the various languages [12].
In this work, we have developed several algorithms that allow a systematic study of the occurrence of the articles involving fatal accidents and their gender/location. This allows to assess the distribution laws and determine the presence of bias and censorship in reporting. We have used daily newspapers since they have the advantage of addressing wider audiences with higher frequency and a longer temporal consistency, often covering events usually ignored by books.
We have applied these algorithms on the full historical archives of the three major Italian newspapers: Il Corriere della Sera (Milan, CDS, 1876-2017), La Repubblica (Rome, REP, 1985REP, -2017, and La Stampa (Turin, STA, 1867-2005) and the major US newspaper: The New York Times (NYT, 1955(NYT, -2017, searching for the articles containing selected keywords: 'killed' ('morti' in the Italian language, the mixed-gender plural term, as well as the singular female and male gender, 'morta' and 'morto' respectively), 'suicide' ('suicidio'). These words have been chosen for their high rate of occurrence (3.1 million articles out of 42 million comprising the newspaper archives) in respect to similar lemmas (dead, casualties...) and correspondence of usage since both are used to indicate violent deaths. Most importantly, in the case of the plural form they contain the number of people involved. Therefore, in addition to a simple word count, it is also possible to study the distribution of the number of people killed, the location, and the category of victims.
The rest of the paper is organized as follows: in "Methods" section is present the description of the archives, the retrieval and data processing steps, and the various algorithms. In "Results and discussion" section are shown the results on the time behaviour, gender, censorship, the scaling law, and the last digit analysis. Perspectives and future work are discussed in "Conclusions" section. Figure 1 shows the steps taken to access, parse, and identify the articles, from the access of the online archives to the creation of the various data sets, one for each newspaper and lemma considered. All these tasks are accomplished using Python scripts. The access to the online archive of each newspaper takes a few days to complete over the full time span, since this is mediated by a web-based interface that usually returns ten articles at a time. The storage and handling of the data sets are performed using the Root [13] framework and require a few minutes to complete. This C++ based environment was developed at CERN to deal efficiently with high volumes ( ≃ Pbyte) of data produced from accelerator-space-and ground/underground-based detectors. As such, it is especially suited to create the ntuples (Ttrees in Root nomenclature) containing the Fig. 1 Scheme of the processing steps from the query on the newspaper archive to the creation of the database. Each query of a specific lemma on a newspaper results in a data set which is then subsequently analyzed data extracted from the newspaper articles. All the selections, histogramming, and fitting have also been performed in this environment.

Methods
A scheme of the various types of algorithms used to analyze the data sets and the main information they provide is shown in Fig. 2.

Newspaper archives
Only printed editions have been considered and online articles have been excluded for consistency with pre-internet years. The newspapers have freely accessible online archives that cover a major part of their printing time. Out of the T ≃ 42 million articles comprising the four newspaper online archives, all the articles which matched the considered lemmas ('killed' , 'morti' , 'morta' , 'morto' , 'suicide' , '(suicidio' ≃ 3.2 million ) have been extracted. See Table 1 for details on the data size of each newspaper. Details of the newspapers and archives are as follow:

Lemmas considered
This work concerns the study of the occurrence of the following terms: • 'Killed' (in English) and 'morti' (plural for Italian). Note that the literal translation of killed in Italian language is uccisi, but this word is less frequent than morti and usually (but not always) employed in association with murder. Morti is more often associated with violent accidental death and thus has a wider correspondence in usage with killed. Articles with the keywords casualties/vittime, occur with a lower frequency than killed/morti (10% to 30% less). The English term is more strongly associated with dead or wounded during armed conflicts and has peaks during major US wars (Civil, WW1, WW2, Korea, Vietnam etc...). In case of the Italian newspapers, the correlation is lower, since vittime is more often used for people killed in accidents, natural disasters, and so on. • 'Suicide' (in English) and 'suicidio' (in Italian). • 'morta' and 'morto' (in Italian), respectively feminine and masculine singular form of 'morti' .
Some methods discussed below can be applied to any lemma, although the distribution law and last digit analysis is usable in other 'quantitative' keywords such as 'dead' , 'victims' , and the already mentioned casualties, will be the subject of future work.

Construction of the data sets
1. Query and pre-processing: Each data set has been constructed by querying the web servers of the four newspapers with Python scripts that emulated manual user input ( Fig. 1). The input had to be configured for each archive in order to enter the desired keyword and iterate the requests over the time interval of the newspaper archive. The html pages received in reply to the query were then saved locally. Their overall number depends from the number of articles present in each page: we range from the ≃ 58, 000 of NYT to the ≃ 300 of CDS. As mentioned, this data acquisition/preprocessing phase over the newspaper archives takes 5 to 7 days to complete, depending on the newspaper considered. 3 We could not access the full articles since they are often present only in image form (especially in the years 1850-1910) and the necessary work and resources would have been almost equivalent to those required for the digitization of the newspaper itself. Furthermore, the work would only have resulted in a marginally higher efficiency in event detection and would not have changed the results. We define an 'event' as a newspaper article retrieved upon the query of the keyword killed/morti. In our data sets, each event contains the newspaper name, the date, and the page (not present in NYT) of the article, the text of the title, and-if present-a part of the text of the article. 2. Parse: The html pages are then subsequently parsed with Python regular expressions that extract the relevant article information (title, snippet, page number, date, etc...). This phase takes a few minutes to complete. 3. Filter: Sometimes the query only returns a link with no usable text: for instance a title can be empty (especially for issues of the nineteenth century), incomplete, or does not contain the requested word. These events have been discarded (in the case of STA, this restricts the database to the period 1910-2005 for most purposes). 4. Text analysis: The events are then analyzed to find the number k of people killed. This is searched in numeric, text, and hybrid form. To reduce classification errors, the value of k is searched close to the keyword in the forms: 'k killed' , killed k' , 'k attribute killed' , 'killed attribute k' . Title and text are also parsed to search for the type of event (e.g. car, airplane crash, war, illness), and the people involved (e.g. children, women, ethnicity), that will be the subject of a following paper. 5. Geographical analysis: Several databases of world places have been used to associate the location event to its country of origin. Cases of homonymy (Florence, Paris, Cairo...) have been resolved assigning the location of the more famous ones at that time. For instance, 'Cairo' entries in the years 1861-1865 have been assigned to the US since the location appeared often during the American Civil War. 6. Exclusions: Duplicate entries, that is articles identical in title, text, and date are then removed, but different reports on the same event are considered as separate since it means that they have been considered worthy of more than one article. See Table 1 for the size of the datasets according to the various selections.
All events reporting deaths of animals (fish, herd, cows, etc.), usually associated with high k have been discarded.
From foreign events we also excluded all articles where the words 'Italian' or 'US'/' American' appear associated with foreign countries, thus removing foreign events where citizens from the corresponding newspaper-printing country were involved. This amounted to less than 1% in peacetime.
Once the parsing of the archives is completed, the remaining processing steps in this phase take a few minutes to execute and produce the database/root files for the subsequent analysis. This represents an increase of several orders of magnitude in speed in respect to any traditional, manual-scanning method which had to be necessarily constrained to a limited amount of time and newspaper issues.

Errors
Statistical errors are due to fluctuations in the number of events present in a given bin of a given selection. Fitting algorithms take into account the errors associated with each bin to calculate the errors of the fitted parameters.
Sources of systematic errors can be due to OCR (optical character recognition) misidentification. This is more frequent for old issues where the quality of the scanned pages is lower and can result in a lower efficiency for the first years of the newspapers.
This can be assumed to be independent of the number, type, or location of people killed k so that the temporal behaviour and distribution laws should not be affected. See Additional file 1 for a discussion on systematic errors.

Results and discussion
The analysis of the data sets created above can yield information on how the newspapers consider the various events depending on geographic location, historical period, or gender. In this section, we describe the main algorithms employed and the results they provide. The various algorithms, the processing steps, and the key results derived, are also schematized in Fig. 2. Figure 3 shows the yearly total number of articles, T(t). Major historical events can also affect this value: for instance, it increases in NYT during the US Civil War due to more articles being published and decreases in CDS and STA during the two World Wars due to due to shortage of materials resulting in fewer pages being published.

Historical events
In the same Figure the number of articles with the keyword 'killed' , K(t), is also shown. From it, we can derive the normalized fraction of 'killed' events in respect to the total: R = K /T (shown in the same figure), a value more affected by historical events.
For the events with a determined location, it is possible to separate domestic ( K d ) from foreign ( K f ) occurrences and assess how their relative importance evolved with time (Fig. 3). NYT reporting on foreign deaths grows over time to become more frequent than domestic at the onset of WW1 and permanently from WW2 on. The Italian newspapers divide the reporting between domestic and foreign cases roughly equally, with CDS covering more foreign events in the more recent years and STA and REP the domestic cases. The other relevant features are the foreign peaks during the World Wars and the drop in the domestic deaths between 1923 and 1945 for STA and CDS, due to censorship from the Fascist government (see below).   During the American Civil war, the total number of NYT articles increases by 18.5% in respect to the interpolated values of T between 1861 and 1865. K increases to a maximum of K 1863 /K 1860 = 2.80 the pre-conflict value (the keyword 'casualties' has an increase of C 1863 /C 1860 = 4.23 ). Conversely, the number of suicides drops to a minimum of S 1864 /K 1860 = 0.36.

World War 1 (28/7/1914-11/11/1918, Italy from 23/5/1915, US from 6/4/1917)
Censorship was very strong in all countries involved in both World Wars, with the removal of all information who could be beneficial to the enemy: letters, reporting of battles and defeats, casualties etc. [18].
In During wars the rate of suicides is known to decrease: this phenomenon is usually explained by the higher sense of purpose during the bellic effort [19][20][21] and is found to occur both when one's country or other countries are at war. However, the decrease of suicides reporting by newspapers is more prompt and intense (dropping to 1/3 of the pre-war value) than that recorded by statistics. Since in both countries this occurred in 1914, before either country was at war we can conclude that this was not directly related to censorship. In Italy, the number of articles on suicides passes from The drop before the US or Italy entered the war suggests to attribute the lack of suicide reporting to a reduced interest by the editorial rooms rather than to censorship.

Fascist government in Italy (31/10/1922 -25/7/1943)
Different is the case during the Fascist government in Italy. The Italian government of the time exercised a strong censorship on printed press and radio. On 14/7/1924 a circolare (note) from the then Minister of Interior Federzoni allows the sequestering of copies of newspapers to 'prevent stirring up public opinion' . On 31/12/1924 all newspapers are sequestered and the directors replaced with ones affiliated with the regime. In October 1926, several daily newspapers were closed until the end of WW2. Among them L'Unità, L'Avanti! and L'Ora [14,15].
Government censorship aimed to present an efficient state and thus had to remove all negative news. Censorship involved all media of the time: radio, theater, movies, books, and newspapers. Authors, especially those of Hebraic origin but also those who were against the regime for political reasons, fell in disfavour. This 'targeted' censorship was similar to what occurred in Germany and reported in [10], where prominent individuals were mentioned to a greater or lesser extent according to their race or standing in respect to the Nazi government.
On a wider scale, government guidelines [16,17] to newspapers required that crime reporting had to be compressed in a few lines, and suicides had to be ignored, with the result that articles involving domestic deaths and accidents almost disappeared from newspapers. With the datasets of CDS and STA it is possible to quantify the overall effect of Fascist censorship in the reporting of violent deaths [22,23].
As We estimated the amount of censorship for articles with morti interpolating the value of R ′ between 1922 and 1946: between 1923 and 1943 there were 2,800±200 domestic articles with at least two casualties missing for CDS and 2,900±300 for STA. In both cases, they amount to 75% of the articles featuring domestic deaths printed in the same period.
A similar analysis on the k = 1 events (morto, morta, see Fig. 5) yields about 22, 000 ± 1300 articles censored by STA ( 60% of those printed in the period considered) and 29, 000 ± 500 by CDS ( 20% of those printed in the period considered).
In the case of suicides (another forbidden topic during the regime [15]), Fig. 6 gives S missing = 17, 000 ± 600 ( S missing /S present = 2.8 ) for CDS and 4,100±300 ( S missing /S present = 1.1 ) for STA, in contrast with the growing rate of suicides at the time [20,24].

World War 2 (1/9/1939-2/9/1945, Italy from 10/6/1940 as part of the Axis and from 8/9/1943 as part of the Allies, USA from 7/12/1941)
Censorship in US during WW2 relied mostly upon the self-censorship of news outlets: forbidden topics included weather and crop reports, correspondence, travel schedules and naturally troop movements [25]. The Office of Censorship released on January 15,   [18]. The total number of articles of NYT remains essentially unchanged, with R increasing only in 1940, from R 1938 = 2.02 ± 0.04% to R 1940 = 2.50 ± 0.04% . This is due to an increased amount of reporting of casualties prior to the entrance in the conflict. This is confirmed by the value of R ′ , from R ′ 1938 = K foreign /K domestic = 0.63 ± 0.01 to a maximum of R ′ 1940 = K foreign /K domestic = 0.76 ± 0.01 , again before Pearl Harbor. Afterward, since US soil was not attacked, the amount of reporting of domestic events was not affected so we can conclude it was not affected by censorship.
The effect of WW2 in Italy is mostly evident in the sharp drop in T and K in the later years of the war. This was both due to paper shortage and the bombings on Milano and Torino, where the newspapers were printed. After the armistice with the Allied forces (8/9/1943), Italy was split between the South, controlled by the Allies and La Repubblica di Salò in the North. After a short period free from Fascist government, CDS and STA are then aligned to the Nazi-controlled government of the North so the definition of 'domestic' and 'foreign' becomes fuzzier and the ratio R ′ increases.

Recent decades
If we consider the reporting of deadly accidents in the last decades we see that it is  For REP, the amount of articles reporting violent death K increases with time but its percentage after 1995 decreases by dR/dt 1995−2005 = (− 0.2 ± 0.04)%/year . We note that this trend of decreasing coverage given to violent events in all three Italian newspapers is opposite to the growth of the perceived threat of violence in Italy [26], so this phenomenon cannot be ascribed directly to press coverage.

Gender bias
In Italian newspapers, where the language allows to distinguish gender, we have also queried the databases for the term 'morto'/'morta' (M), respectively the singular ( k = 1 ) masculine, and feminine form in the Italian language. 'Morte', the feminine plural could not be used since it also means 'death' in Italian and it would be difficult to semantically separate the two meanings. Besides, if both males and females are involved the term 'morti' is used in Italian language, making the lemma 'morte' of little use.
From Fig. 5 we see that the amount of reporting of female deaths (morta) is only 30% of all k = 1 deaths. This has to be compared with the fatal accident standardized death rate (in 2005) in Italy of 36.1 (male) and 19.2 (female) per 100,000 deaths [27], and the probability for a 15-year old individual to die within 45 years, before reaching the age of 60 (45q15) [28] in 2010 of 7.9% for men and 4.1% for women.
Therefore, even accounting for the fact that male violent deaths are more frequent than female ones 4 and that these events would be more likely to be reported in the news, this still hints to some amount of gender bias in reporting. The female/male ratio of ≃ 30% is present in all newspapers and roughly constant over the years, with only REP showing an increase of reporting of female deaths of about 3%/year , from 22.6% in 1985 to 37.65% of 2007.

Scaling laws
The analysis of keywords associated with the number of persons involved allows to build the distribution function of the number of articles, N k , reporting k people killed. As shown in Fig. 7, the distributions for all four newspapers considered can be described by a single power-law: in the range 2 ≤ k ≤ 10 6 . The sharp peaks in N k for values of k that are multiples of 10, 100, 1000... are due to the rounding in excess to the nearest multiple of a power of 10 of the number of people reported (see below).
The Minuit [30] package (now in its second release, Minuit2) has been used to perform the fits.
Fitting of the power laws has been analogous to the methods that we used in the fitting of cosmic ray spectra of the PAMELA space-borne detector [31][32][33] (see also ext. (1) N k = N (k) = A · k −γ data therein). Tests with varying bin sizes have shown no significant change in the fitting results and values of γ.
See Abbreviations section for the mathematical details; a discussion of the fitting methods and error systematic can be found in Additional file 1.
Power-law statistics has been found to describe the distribution of various natural phenomena, e.g. earthquakes [34], forest fires [35,36], the cluster size of tropical trees [37], and is usually thought to arise through positive growth feedback [38,39].  Table 2), the domestic/foreign ones with two power laws γ L ( 2 ≤ k ≤ 10 ) and γ H ( k > 10 ). For all foreign events there is a break in the spectral index γ L < γ H due to missing events not being reported (from 91% for REP to 122% for CDS). In domestic events γ L > γ H for Italian papers (over-reporting of low k events, from 47% in STA to 71% for CDS). In NYT whereas the decrease in γ L for domestic events is lower (over-reporting of 4% ) than for foreign ones (under-reporting of 98% ), hinting to a higher degree of internationalization of this publication (Table 3) Galactic cosmic ray spectra also follow a power law, as a result of the statistical process of acceleration. Changes in the spectral indexes show the presence of additional sources or production phenomena [32,33].
Power law distributions are also encountered in many human-related activities [40], from language distribution (Zipf law [41,42]), the number of casualties in wars [43,44], and ethnic violence [45]. Also in these cases, they have been shown to arise through a 'winner takes all' type of a competitive network where a few elements grow to acquire a very large size [46,47].
In newspapers, the distribution N k can be explained as the result of two main phenomena: • The convolution of various violent events and accidents. Road, train, air accidents, natural disasters and catastrophes have each their frequency and probability distribution, usually unknown but decreasing as k increases. Articles with k ≃> 1000 often do not describe a specific accident, but rather summarize global phenomena such as war, illness deaths (cancer, heart attack...) or automobile accidents per year, etc... . • The selection by the newsroom. The publishing criteria can change with time, location or censoring: events with higher k will have a higher probability of being selected for their importance. Conversely, foreign events, occurring in countries physically or socially far from the country where the newspaper is printed, will be more likely to be ignored, especially for low k.
Both phenomena can be approximated by a power law, with probability P ∝ k −γ 1 for an event to involve k casualties, and a probability (or efficiency) ǫ ∝ k +γ 2 for the event to be picked by the newsroom. The overall probability is then P tot = ǫ · P ∝ k −γ 1 +γ 2 with γ = γ 1 + γ 2 . The values of γ of the four newspapers ( γ NYT = 1.44 ± 0.06 , γ CDS = 1.61 ± 0.01 , γ REP = 1.42 ± 0.07 , γ STA = 1.52 ± 0.09 , Table 2) show that the editorial processes are similar in Italy/US and converge to a narrow range of values, although the differences among them reflect the different emphasis to high/low k events in the newspapers. A 'steep' spectrum, with a high γ , results in a higher number of articles with low k and vice-versa. Defining W = N H /N L = N k>10 /N 2≤k≤10 we see ( Table 2) that it ranges between W CDS = 0.57 (more articles with low k) and W REP = 1.00 (fewer articles with low k) with W STA = 0.72 and W NYT = 0.93 having intermediate values (see Abbreviations section and Fig. 12 for a calculation of the value of W as a function of γ).
The trend of the spectral index can be used to estimate the state of belligerence reported by the newspapers: the running average (current and four preceding years) of γ (t) (Fig. 8) decreases in wartime due to the higher abundance of high k events results in a flatter distribution. Vice-versa in peacetime, when the distributions are dominated by low k events, γ increases due to a steeper distribution. Thus, local minima in γ (t) are present in NYT during the US Civil War ( γ = 1.21 ), the two World Wars ( γ WW 1 = 1.26 , γ WW 2 = 1.30 ) and the Vietnam War ( γ = 1.31 ). Also, CDS and STA show similar local minima during WW1 ( γ CDS ≃ 1.2 and γ STA = 0.8 ) and WW2 ( γ STA = 1.76 , γ CDS = 1.91 ), followed by a sharp increase in STA (and more gradual in CDS) after 1945. It is also interesting to note that-notwithstanding the differences in γ-the trends of the spectral indexes of the newspapers are in good agreement among each other. This suggests that they all tend to react similarly to the conflicts occurring in the world (and vice-versa a discrepancy in the trend would indicate the presence of censorship).

Geographical bias
In our case, using the articles where a location could be determined, we built distribution laws for domestic and foreign events. In Fig. 7 it is possible to see that the latter have a strong spectral break for 2 ≤ k ≤ 10 (absent or less prominent in the former). Fitting the distribution function with two power laws γ H = γ (k > 10) and γ L = γ (2 ≤ k ≤ 10) we see that �γ = γ L − γ H is always negative for foreign events, ranging from �γ REP = −0.511 ± 0.005 to �γ CDS = − 1.09 ± 0.01 . For Italian newspapers featuring domestic events, the distributions are closer to a pure power law (highest �γ = + 0.28 ± 0.02 for CDS). In all cases �γ is positive, implying that a higher emphasis is given to low k events. NYT has a smaller discrepancy between foreign  ( �γ = − 0.672 ± 0.003 ) and domestic events ( �γ = −0.42 ± 0.01 ). A plot of the value of γ vs �γ (Fig. 9) shows that the foreign and domestic categories are clearly separated in all four newspapers (Table 3). This suggests a difference in the editorial behaviour due to a lack of press coverage of accidents involving a small number of persons in foreign countries, considered to be not interesting enough to be reported in the press. An estimation of the under-or over-reporting of low k events can be provided by the extrapolation of γ H to 2 ≤ k ≤ 10: Table 3 Spectral index for the four newspapers separating domestic and foreign datasets γ L = γ (2 ≤ k ≤ 10) , γ H = γ (10 ≤ k ≤ 10 6 ) . A negative �γ = γ L − γ H implies a lack of events for 2 ≤ k ≤ 10 events, and vice-versa. The higher the absolute value the higher the excess or defect of events. M is the excess (+)/defect (−) of articles with respect to what expected from a pure power-law (see text)

Newspaper
Loc.  with N L = 10 2 N k and α H coming from the power law fit of k ≥ 10 . Thus, M is the fraction of events with 2 ≤ k ≤ 10 missing ( M < 0 ) or exceeding ( M > 0 ) the value expected from a pure power law distribution ( M = 0 ). All four newspapers have M for ≃ −100% in the foreign case, meaning that the editorial room decides to print only one event out of two if it involves ten or fewer casualties in a foreign country. We also note that Italian newspapers tend to print more news of domestic events (from M dom = 16% of REP to M dom = 71% of CDS) than what expected from a pure powerlaw distribution. This can be attributed in part to a large domestic and local news section. Overall CDS has the largest difference in dealing with foreign and domestic events ( M CDS for = −122%-M CDS dom = +71% ) and the NYT the smallest ( M NYT for = −98% -M NYT dom = +4% ). See Abbreviations section and Fig. 13 for a calculation of the value of M as a function of δγ and γ H .
In many nations, there are too few events reported to perform an acceptable fit with a power law, therefore-for countries having at least 30 entries in a given newspaperwe used the ratio the W = N 2≤k≤10 /N k>10 as an indicator of the intrinsic importance assigned to a given country.
In Fig. 10, W i is plotted as a function of the distance D i between the capital of the country i and Rome/Washington, according to the newspaper. It is possible to see (2)  how the value of W tends to increase with the distance (fewer events with low k and more with high k). The geographical bias appears to be stronger in Italian newspapers: a linear fit (Table 4) shows that the slopes are similar in Italian newspapers and higher by a factor ≃ 5 compared to NYT, a sign of an higher internationalization of the US paper: dW /dx CDS = (14 ± 1)%/1000 km ; dW /dx REP = (11 ± 1)%/1000 km and dW /dx STA = (17 ± 1)%/1000 km , dW /dx NYT = (2.7 ± 0.3)%/1000 km. Social proximity effects also play a role in defining the values of the various countries: as also visible in Fig. 10, American countries have lower W than equally distant Asian and African ones.
If we limit the fit to countries in Europe (for Italian newspapers) and in America (for NYT) we find that the slopes are higher by a factor 2 to 4 than for those considering

Editorial rounding by excess of casualties as an additional tool to detect censorship
Newspapers often round up the number of casualties reported: this can be due to lack of knowledge, to simplify the headline, or to purposely increase emphasis to attract the attention of the reader. In the absence of tampering, we would expect the least significant digit of k, l k to follow a uniform distribution ( P(l k ) = 1/10 ). However, in Fig. 11, which shows the distribution l k for 10 < k ≤ 100 , it is possible to see how the values '0' and '5' are overabundant ('5' only in the Italian papers) and the others under-abundant in respect to the flat distribution expectation. The number of defects in the digits '6' to '9' are close to the excess of '0' (and similarly for '1' to '4' with '5'), confirming the artificial nature of the reported number of casualties. Overall, Italian newspapers have a value of l 0 + l 5 = 40 − 46% and NYT has a value of 32% (with respect to the 20% expected for the sum of the k 0 + k 5 bins). All distributions (see Additional file 1: Table S1) are incompatible with the null hypothesis of the flat distribution with p > 0.01. This distribution is found in all newspapers and historical periods with one notable exception: during the Fascist government, the domestic distributions of STA and CDS do not exhibit the peaks for k 0 and k 5 , still present in the corresponding foreign distributions of the same period. A χ 2 test allows us to reject the hypothesis that foreign and domestic histograms of Italian newspapers follow the same distributions with p > 0.01 . Furthermore, the domestic distributions of STA and CDS are the only ones that are not incompatible with the equiprobable one (Additional file 1: Figure S3 and Table S1). This implies that during the Fascist regime the editorial practice was to report more faithfully the number of domestic casualties as an additional way of suppressing these events and at the same time exaggerating the number of foreign casualties.
This phenomenon is similar but specular to Benford's law [48,49], which describes the statistical distribution of the most significant digit in several natural and man-made datasets [50]. Benford's law has been used to determine the presence of accounting [51] or election fraud [52], since values that are artificially altered do not follow it.

Conclusions
In this work we have developed a series of techniques to automatically scan the complete historical databases of daily newspapers for the occurrence of specific keywords related to accidents and death. We also devised various algorithms to analyze the dependence of these keywords from the geographical location and historical period. Over traditional analysis, consisting of manual scanning of the newspaper articles, these tools offer the advantage of being automatic and thus being applicable on larger datasets spanning longer time periods. Indeed, these tools are suited for historical analysis to evaluate quantitatively the presence of bias or censorship in a given publication and its variation over time and political environment. This paper considered printed daily newspapers, but these techniques can be used also on online publications, news outlets, etc... Although the usage is limited only to articles with quantitative keywords, where to the event (accident, casualty, death...) is associated the number of persons involved (The simpler word occurrence methods can be used on a wider word set), this type of analysis is complementary to the assessment of 'fake news' since has the advantage of being automatic and operating on large data sets. These tools can also contribute to the assessment of the freedom of the press in a given country.
Future work will extend the application of these tools to other lemmas such as casualties, wounded, victims. The analysis will be applied to differences in reporting between the type of accidents, both of man-made origin (e.g. train/airplane/ship) and natural calamities. Also the structure of reporting as a function of the day of the week and the page location will be considered. On a larger scope. also other newspapers, magazines, online publications will be considered, extending the analysis to look for the presence and evolution of ethnic or national bias. A more long-term goal can also be the use of speech recognition methods to study the occurrence of these lemmas on radio and TV.