Large scale analysis of violent death count in daily newspapers to quantify bias and censorship

In this work we develop a series of techniques to quantify the presence of bias and censorship in newspapers. These algorithms are tested analyzing the occurrence of keywords `killed' and `suicide' (`morti', `suicidio' in Italian) and their changes over time, gender and reported location on the complete online archives (42 million records) of the major US newspaper (The New York Times) and the three major Italian ones (Il Corriere della Sera, La Repubblica, La Stampa). Since the Italian language distinguishes between the female and male cases, we find the presence of gender bias in all Italian newspapers, with reported single female deaths to be about one-third of those involving single men. We show evidence of censorship in Italian newspapers both during World War 1 and during the Italian Fascist regime. Censorship in all countries during World Wars and in Italy during the Fascist period is a historically ascertained fact, but so far there was no estimate on the amount on censorship in newspaper reporting: in this work we estimate that about $75\%$ of domestic deaths and suicides were not reported. This is also confirmed by statistical analysis of the distribution of the least significant digit of the number of reported deaths. We also find that the distribution function of the number of articles vs. the number of deaths reported in articles follows a power law, which is broken (with fewer articles being written) when reporting on few deaths occurring in foreign countries. The lack of articles is found to grow with geographical distance from the nation where the newspaper is being printed.


Introduction
Violent death is a dramatic event that can be objectively quantified in terms of the number of lives lost. Words such as 'killed', 'dead', 'casualty', 'suicide' etc... can thus be used to assess the presence of bias or censorship in reporting incidents involving specific locations, groups of people, or time.
Coverage of death in the mass media has been studied for the insight it offers on how societies perceive this dramatic event [1]; at the same time, selective death reporting can be used to shape and distort public opinion. A 6-months period study [2] of 1975 issues of the Register Guard and the Standard Times showed that newspapers tend to over-report more news-worthy death causes and overlook others that are considered less interesting by the public. The work considered the number of occurrences, the number of deaths reported, and the text surface area of the articles dealing with death and compared them with the statistical occurrences, finding that all forms of disease were under-reported whereas violent or catastrophic events were overrepresented. An even more biased selection along these lines was found analyzing the news selection by television and radio [3]. Studies on the presence of geographical bias in reporting accidental deaths have been conducted manually on small data sets: a 51-issues study (two months) of two German (Frankfurter Allgemeine Zeitung and Süddeutsche Zeitung) and two Australian newspapers (The Australian and The Sydney Morning Herald) [4] have shown a weak relationship with social proximity in the number of articles reporting deaths. Similarly, a 6-months study of the German tabloid Die Abendzeitung [5] reported evidence of a correlation between the number of accidental deaths and the length of the article as well as a negative correlation of the distance between the town where the newspaper was printed and the event reported. In this case, the analysis involved the study of the amount of printed space occupied by an article reporting of a fatal event. The authors did not find any significant correlation between the area taken by the article and the number of deaths reported [6] but showed an inverse relationship (linear in logarithmic scale) of the area of printed text normalized to the number of victims and the distance (in km) of the event from Munich, the town where the Die Abendzeitung was printed. They also found a minimum threshold of victims -increasing with distance -needed to elicit a report in the newspaper. They thus likened the response of the paper to the logarithmic response between stimulus strength and response intensity in human perception [7].
On the opposite, [8] analyzed the occurrence of pictures in Time and Newsweek magazines. The study involved ten issues in three years for each magazine, for a total of 60 issues. The work involved measuring the area occupied by the picture,  Table 1: Size of the online newspaper archives consulted and of the resulting datasets according to the various selection criteria. The totality of the archives has been considered in this work. Figure 1 shows the steps taken to access, parse, and identify the articles, from the access of the online archives to the creation of the various data sets, one for each newspaper and lemma considered. All these tasks are accomplished using Python scripts. The access to the online archive of each newspaper takes a few days to complete over the full time span, since this is mediated by a web-based interface that usually returns ten articles at a time. The storage and handling of the data sets are performed using the Root [13] framework and require a few minutes to complete. This C++ based environment was developed at CERN to deal efficiently with high volumes ( Pbyte) of data produced from accelerator-spaceand ground/underground-based detectors. As such, it is especially suited to create the ntuples (Ttrees in Root nomenclature) containing the data extracted from the newspaper articles. All the selections, histogramming, and fitting have also been performed in this environment. A scheme of the various types of algorithms used to analyze the data sets and the main information they provide is shown in Figure 2.

Newspaper Archives
Only printed editions have been considered and online articles have been excluded for consistency with pre-internet years. The newspapers have freely accessible online archives that cover a major part of their printing time. Out of the T 42 million articles comprising the four newspaper online archives, all the articles which matched the considered lemmas ('killed', 'morti', 'morta' 'morto' 'suicide' ('suicidio' 3.2 million ) have been extracted. See Table 1 for details on the data size of each newspaper. Details of the newspapers and archives are as follow:

Lemmas considered
This work concerns the study of the occurrence of the following terms: • 'Killed' (in English) and 'morti' (plural for Italian). Note that the literal translation of killed in Italian language is uccisi, but this word is less frequent than morti and usually (but not always) employed in association with murder.
Morti is more often associated with violent accidental death and thus has a wider correspondence in usage with killed. Articles with the keywords casualties/vittime, occur with a lower frequency than killed/morti (10% to 30% less). The English term is more strongly associated with dead or wounded during armed conflicts and has peaks during major US wars (Civil, WW1, WW2, Korea, Vietnam etc...). In case of the Italian newspapers, the correlation is lower, since vittime is more often used for people killed in accidents, natural disasters, and so on.
Some methods discussed below can be applied to any lemma, although the distribution law and last digit analysis is usable in other 'quantitative' keywords such as 'dead', 'victims', and the already mentioned casualties, will be the subject of future work.

Construction of the Data sets
1. Query and pre-processing. Each data set has been constructed by querying the web servers of the four newspapers with Python scripts that emulated manual user input (Figure 1). The input had to be configured for each archive in order to enter the desired keyword and iterate the requests over the time interval of the newspaper archive. The html pages received in reply to the query were then saved locally. Their overall number depends from the number of articles present in each page: we range from the 58, 000 of NYT to the 300 of CDS. As mentioned, this data acquisition / preprocessing phase over the newspaper archives takes 5 to 7 days to complete, depending on the newspaper considered 3 .
We could not access the full articles since they are often present only in image form (especially in the years 1850-1910) and the necessary work and resources would have been almost equivalent to those required for the digitization of the newspaper itself. Furthermore, the work would only have resulted in a marginally higher efficiency in event detection and would not have changed the results.
We define an 'event' as a newspaper article retrieved upon the query of the keyword killed/morti. In our data sets, each event contains the newspaper name, the date, and the page (not present in NYT) of the article, the text of the title, andif present -a part of the text of the article.
2. Parse. The html pages are then subsequently parsed with Python regular expressions that extract the relevant article information (title, snippet, page number, date, etc...). This phase takes a few minutes to complete.
3. Filter. Sometimes the query only returns a link with no usable text: for instance a title can be empty (especially for issues of the XIX century), incomplete, or does not contain the requested word. These events have been discarded (in the case of STA, this restricts the database to the period 1910-2005 for most purposes).
4. Text Analysis. The events are then analyzed to find the number k of people killed. This is searched in numeric, text, and hybrid form. To reduce classification errors, the value of k is searched close to the keyword in the forms: 'k killed', 'killed k', 'k attribute killed', 'killed attribute k'. Title and text are also parsed to search for the type of event (e.g. car, airplane crash, war, illness), and the people involved (e.g. children, women, ethnicity), that will be the subject of a following paper.
5. Geographical analysis. Several databases of world places have been used to associate the location event to its country of origin. Cases of homonymy (Florence, Paris, Cairo...) have been resolved assigning the location of the more famous ones at that time. For instance, 'Cairo' entries in the years 1861-1865 have been assigned to the US since the location appeared often during the American Civil War.
6. Exclusions. Duplicate entries, that is articles identical in title, text, and date are then removed, but different reports on the same event are considered as separate since it means that they have been considered worthy of more than one article. See Table 1 for the size of the datasets according to the various selections.
All events reporting deaths of animals (fish, herd, cows, etc.), usually associated with high k have been discarded.
From foreign events we also excluded all articles where the words 'Italian' or 'US'/'American' appear associated with foreign countries, thus removing foreign events where citizens from the corresponding newspaper-printing country were involved. This amounted to less than 1% in peacetime.
Once the parsing of the archives is completed, the remaining processing steps in this phase take a few minutes to execute and produce the database / root files for the subsequent analysis. This represents an increase of several orders of magnitute in speed in respect to any traditional, manual-scanning method which had to be necessarily constrained to a limited amount of time and newspaper issues.

Errors
Statistical errors are due to fluctuations in the number of events present in a given bin of a given selection. Fitting algorithms take into account the errors associated with each bin to calculate the errors of the fitted parameters.
Sources of systematic errors can be due to OCR (Optical Character Recognition) misidentification. This is more frequent for old issues where the quality of the scanned pages is lower and can result in a lower efficiency for the first years of the newspapers.
This can be assumed to be independent of the number, type, or location of people killed k so that the temporal behaviour and distribution laws should not be affected. See the Supplementary Information for a discussion on systematic errors.

Results and Discussion
The analysis of the data sets created above can yield information on how the newspapers consider the various events depending on geographic location, historical period, or gender. In this section, we describe the main algorithms employed and the results they provide. The various algorithms, the processing steps, and the key results derived, are also schematized in Figure 2. Figure 3 shows the yearly total number of articles, T (t). Major historical events can also affect this value: for instance, it increases in NYT during the US Civil War due to more articles being published and decreases in CDS and STA during the two World Wars due to due to shortage of materials resulting in fewer pages being published.

Historical events
In the same Figure the number of articles with the keyword 'killed', K(t), is also shown. From it, we can derive the normalized fraction of 'killed' events in respect to the total: R = K/T (shown in the same Figure), a value more affected by historical events.
For the events with a determined location, it is possible to separate domestic (K d ) from foreign (K f ) occurrences and assess how their relative importance evolved with time ( Figure 3). NYT reporting on foreign deaths grows over time to become more frequent than domestic at the onset of WW1 and permanently from WW2 on. The Italian newspapers divide the reporting between domestic and foreign cases roughly equally, with CDS covering more foreign events in the more 8 recent years and STA and REP the domestic cases. The other relevant features are the foreign peaks during the World Wars and the drop in the domestic deaths between 1923 and 1945 for STA and CDS, due to censorship from the Fascist government (see below).
In Figure 4 are shown the relative contributions of the various continents and their evolution over time, showing a gradual reduction of the coverage of European events and a growing importance of Asia after WW2. Some major occurrences are: Censorship was very strong in all countries involved in both World Wars, with the removal of all information who could be beneficial to the enemy: letters, reporting of battles and defeats, casualties etc. [18].
In NYT the value of R = K/T increases from R 1913 = 0.361 ± 0.001 to R 1914 = 0.457 ± 0.001. The domestic event ratio R = K dom /K total drops from R N Y T 1913 = 0.56±0.02 to R N Y T 1914 = 0.39±0.01, reaching a minimum of R N Y T 1917 = 0.32 ± 0.01, when the US declared war to Germany.
In Italy, the shortage of resources resulted in a reduction of the number of pages of CDS and STA from 8 (two double sheets) to 4 (one double sheet of paper). Consequently, T decreased to T CDS 1918 /T CDS 1915 = 0.48 and T ST A 1918 /T ST A 1915 = 0.67. The effect of censorship is evident in STA: its value of R drops from R 1915 = 0.32±0.02 to a minimum of R 1917 = 0.13 ± 0.01 (for CDS is more constant). In both newspapers, there is an even higher drop in domestic events (not necessarily only due to censorship but also to a lack of interest): from During wars the rate of suicides is known to decrease: this phenomenon is usually explained by the higher sense of purpose during the bellic effort [19,20,21] and is found to occur both when one's country or other countries are at war. However, the decrease of suicides reporting by newspapers is more prompt and intense (dropping to 1/3 of the pre-war value) than that recorded by statistics. Since in both countries this occurred in 1914, before either country was at war we can conclude that this was not directly related to censorship. In Italy, the number of articles on suicides passes from S The drop before the US or Italy entered the war suggests to attribute the lack of suicide reporting to a reduced interest by the editorial rooms rather than to censorship. (31/10/1922 -25/7/1943) Different is the case during the Fascist government in Italy. The Italian government of the time exercised a strong censorship on printed press and radio. On 14/7/1924 a circolare (note) from the then Minister of Interior Federzoni allows the sequestering of copies of newspapers to 'prevent stirring up public opinion'. On 31/12/1924 all newspapers are sequestered and the directors replaced with ones affiliated with the regime. In October 1926, several daily newspapers were closed until the end of WW2. Among them L'Unità, L'Avanti! and L'Ora [15,14].

Fascist government in Italy
Government censorship aimed to present an efficient state and thus had to remove all negative news. Censorship involved all media of the time: radio, theater, movies, books, and newspapers. Authors, especially those of Hebraic origin but also those who were against the regime for political reasons, fell in disfavour. This 'targeted' censorship was similar to what occurred in Germany and reported in [10], where prominent individuals were mentioned to a greater or lesser extent according to their race or standing in respect to the Nazi government.
On a wider scale, government guidelines [16,17] to newspapers required that crime reporting had to be compressed in a few lines, and suicides had to be ignored, with the result that articles involving domestic deaths and accidents almost disappeared from newspapers. With the datasets of CDS and STA it is possible to quantify the overall effect of Fascist censorship in the reporting of violent deaths [22,23].
As a result, even though the value of R = K/T remains more or less constant, We estimated the amount of censorship for articles with morti interpolating the value of R between 1922 and 1946: between 1923 and 1943 there were 2,800±200 domestic articles with at least two casualties missing for CDS and 2,900±300 for STA. In both cases, they amount to 75% of the articles featuring domestic deaths printed in the same period.
A similar analysis on the k = 1 events (morto, morta, see Figure 5) yields about 22, 000 ± 1300 articles censored by STA (60% of those printed in the period considered) and 29, 000 ± 500 by CDS (20% of those printed in the period considered).
In the case of suicides (another forbidden topic during the regime [15]), Figure 6 gives S missing = 17, 000±600 (S missing /S present = 2.8) for CDS and 4,100±300 (S missing /S present = 1.1) for STA, in contrast with the growing rate of suicides at the time [20,24].
Overall we estimate that CDS censored 41, 800±1000 articles and STA 36, 000± 1, 900 during this period, with an average of 1990 ± 50/year for CDS and 1700 ± 90/year for STA.  [18]. The total number of articles of NYT remains essentially unchanged, with R increasing only in 1940, from R 1938 = 2.02 ± 0.04% to R 1940 = 2.50 ± 0.04%. This is due to an increased amount of reporting of casualties prior to the entrance in the conflict. This is confirmed by the value of R , from R 1938 = K f oreign /K domestic = 0.63 ± 0.01 to a maximum of R 1940 = K f oreign /K domestic = 0.76 ± 0.01, again before Pearl Harbor. Afterward, since US soil was not attacked, the amount of reporting of domestic events was not affected so we can conclude it was not affected by censorship.
The effect of WW2 in Italy is mostly evident in the sharp drop in T and K in the later years of the war. This was both due to paper shortage and the bombings on Milano and Torino, where the newspapers were printed. After the armistice with the Allied forces (8/9/1943), Italy was split between the South, controlled by the Allies and La Repubblica di Salò in the North. After a short period free from Fascist government, CDS and STA are then aligned to the Nazi-controlled government of the North so the definition of 'domestic' and 'foreign' becomes fuzzier and the ratio R increases. For REP, the amount of articles reporting violent death K increases with time but its percentage after 1995 decreases by dR/dt 1995−2005 = (−0.2±0.04)%/year. We note that this trend of decreasing coverage given to violent events in all three Italian newspapers is opposite to the growth of the perceived threat of violence in Italy [26], so this phenomenon cannot be ascribed directly to press coverage.

Gender Bias
In Italian newspapers, where the language allows to distinguish gender, we have also queried the databases for the term 'morto'/'morta' (M ), respectively the singular (k = 1) masculine, and feminine form in the Italian language. 'Morte', the feminine plural could not be used since it also means 'death' in Italian and it would be difficult to semantically separate the two meanings. Besides, if both males and females are involved the term 'morti' is used in Italian language, making the lemma 'morte' of little use.
From Figure 5 we see that the amount of reporting of female deaths (morta) is only 30% of all k = 1 deaths. This has to be compared with the fatal accident standardized death rate (in 2005) in Italy of 36.1 (male) and 19.2 (female) per 100,000 deaths [27], and the probability for a 15-year old individual to die within 45 years, before reaching the age of 60 (45q15) [28] in 2010 of 7.9% for men and 4.1% for women.
Therefore, even accounting for the fact that male violent deaths are more frequent than female ones 4 and that these events would be more likely to be reported in the news, this still hints to some amount of gender bias in reporting. The female/male ratio of 30% is present in all newspapers and roughly constant over the years, with only REP showing an increase of reporting of female deaths of about 3%/year, from 22.6% in 1985 to 37.65% of 2007.

Scaling laws
The analysis of keywords associated with the number of persons involved allows to build the distribution function of the number of articles, N k , reporting k people killed. As shown in Figure 7, the distributions for all four newspapers considered can be described by a single power-law: in the range 2 ≤ k ≤ 10 6 . The sharp peaks in N k for values of k that are multiples of 10, 100, 1000... are due to the rounding in excess to the nearest multiple of a power of 10 of the number of people reported (see below).
The Minuit [30] package (now in its second release, Minuit2) has been used to perform the fits.
Fitting of the power laws has been analogous to the methods that we used in the fitting of cosmic ray spectra of the PAMELA space-borne detector [31], [32], [33] (see also ext. data therein). Tests with varying bin sizes have shown no significant change in the fitting results and values of γ. See Section 7 for the mathematical details; a discussion of the fitting methods and error systematic can be found in the Supplementary Information.
Power-law statistics has been found to describe the distribution of various natural phenomena, e.g. earthquakes [34], forest fires [35], [36], the cluster size of tropical trees [37], and is usually thought to arise through positive growth feedback [38], [39].
Galactic cosmic ray spectra also follow a power law, as a result of the statistical process of acceleration. Changes in the spectral indexes show the presence of additional sources or production phenomena [32], [33].
Power law distributions are also encountered in many human-related activities [40], from language distribution (Zipf law [41], [42]), the number of casualties in wars [43], [44], and ethnic violence [45]. Also in these cases, they have been shown to arise through a 'winner takes all' type of a competitive network where a few elements grow to acquire a very large size [46], [47].
In newspapers, the distribution N k can be explained as the result of two main phenomena: • The convolution of various violent events and accidents. Road, train, air accidents, natural disasters and catastrophes have each their frequency and probability distribution, usually unknown but decreasing as k increases.
Articles with k > 1000 often do not describe a specific accident, but rather summarize global phenomena such as war, illness deaths (cancer, heart attack...) or automobile accidents per year, etc... .
• The selection by the newsroom. The publishing criteria can change with time, location or censoring: events with higher k will have a higher probability of being selected for their importance. Conversely, foreign events, occurring in countries physically or socially far from the country where the newspaper is printed, will be more likely to be ignored, especially for low k.
Both phenomena can be approximated by a power law, with probability P ∝ k −γ 1 for an event to involve k casualties, and a probability (or efficiency) ∝ k +γ 2 for the event to be picked by the newsroom. The overall probability is then P tot = · P ∝ k −γ 1 +γ 2 with γ = γ 1 + γ 2 .
The trend of the spectral index can be used to estimate the state of belligerence reported by the newspapers: the running average (current and four preceding years) of γ(t) (Figure 8) decreases in wartime due to the higher abundance of high k events results in a flatter distribution. Vice-versa in peacetime, when the distributions are dominated by low k events, γ increases due to a steeper distribution. Thus, local minima in γ(t) are present in NYT during the US Civil War (γ = 1.21), the two World Wars (γ W W 1 = 1.26, γ W W 2 = 1.30) and the Vietnam War (γ = 1.31). Also, CDS and STA show similar local minima during WW1 (γ CDS 1.2 and γ ST A = 0.8) and WW2 (γ ST A = 1.76, γ CDS = 1.91), followed by a sharp increase in STA (and more gradual in CDS) after 1945. It is also interesting to note that -notwithstanding the differences in γ -the trends of the spectral indexes of the newspapers are in good agreement among each other. This suggests that they all tend to react similarly to the conflicts occurring in the world (and vice-versa a discrepancy in the trend would indicate the presence of censorship).

Geographical bias
In our case, using the articles where a location could be determined, we built distribution laws for domestic and foreign events. In Figure 7 it is possible to see that the latter have a strong spectral break for 2 ≤ k ≤ 10 (absent or less prominent in the former). Fitting the distribution function with two power laws γ H = γ(k > 10) and γ L = γ(2 ≤ k ≤ 10) we see that ∆γ = γ L − γ H is always negative for foreign events, ranging from ∆γ REP = −0.511 ± 0.005 to ∆γ CDS = −1.09 ± 0.01. For Italian newspapers featuring domestic events, the distributions are closer to a pure power law (highest ∆γ = +0.28 ± 0.02 for CDS). In all cases ∆γ is positive, implying that a higher emphasis is given to low k events. NYT has a smaller discrepancy between foreign (∆γ = −0.672 ± 0.003) and domestic events (∆γ = −0.42 ± 0.01). A plot of the value of γ vs ∆γ (Figure 9) shows that the foreign and domestic categories are clearly separated in all four newspapers ( Ta-14 Newspaper Loc.  Table 3: Spectral index for the four newspapers separating domestic and foreign datasets. γ L = γ(2 ≤ k ≤ 10), γ H = γ(10 ≤ k ≤ 10 6 ). A negative ∆γ = γ L − γ H implies a lack of events for 2 ≤ k ≤ 10 events, and vice-versa. The higher the absolute value the higher the excess or defect of events. M is the excess (+)/defect (-) of articles with respect to what expected from a pure power-law (see text).
ble 3). This suggests a difference in the editorial behaviour due to a lack of press coverage of accidents involving a small number of persons in foreign countries, considered to be not interesting enough to be reported in the press. An estimation of the under-or over-reporting of low k events can be provided by the extrapolation of γ H to 2 ≤ k ≤ 10: with N L = Σ 10 2 N k and α H coming from the power law fit of k ≥ 10. Thus, M is the fraction of events with 2 ≤ k ≤ 10 missing (M < 0) or exceeding (M > 0) the value expected from a pure power law distribution (M = 0). All four newspapers have M f or −100% in the foreign case, meaning that the editorial room decides to print only one event out of two if it involves ten or fewer casualties in a foreign country. We also note that Italian newspapers tend to print more news of domestic events (from M dom = 16% of REP to M dom = 71% of CDS) than what expected from a pure power-law distribution. This can be attributed in part to a large domestic and local news section. Overall CDS has the largest difference in dealing with foreign and domestic events (M CDS In many nations, there are too few events reported to perform an acceptable fit with a power law, therefore -for countries having at least 30 entries in a given newspaper -we used the ratio the W = N 2≤k≤10 /N k>10 as an indicator of the intrinsic importance assigned to a given country.  Table 4: Slope of the linear fit of the ratio N 2≤k≤10 /N k>10 as a function of the distance between Rome/Washington and the various world countries. In Figure 10, W i is plotted as a function of the distance D i between the capital of the country i and Rome/Washington, according to the newspaper. It is possible to see how the value of W tends to increase with the distance (fewer events with low k and more with high k). The geographical bias appears to be stronger in Italian newspapers: a linear fit (Table 4) shows that the slopes are similar in Italian newspapers and higher by a factor 5 compared to NYT, a sign of an higher internationalization of the US paper: dW/dx CDS = (14 ± 1)%/1000 km; dW/dx REP = (11 ± 1)%/1000 km and dW/dx ST A = (17 ± 1)%/1000 km, dW/dx N Y T = (2.7 ± 0.3)%/1000 km.
Social proximity effects also play a role in defining the values of the various countries: as also visible in Figure 10, American countries have lower W than equally distant Asian and African ones.
If we limit the fit to countries in Europe (for Italian newspapers) and in America (for NYT) we find that the slopes are higher by a factor 2 to 4 than for those considering all world nations ( Table 4) a sign that geographical distance plays a stronger role for countries that are socially closed to either Italy or US (although the different geography of the American continent plays a role in the different behaviour).

Editorial rounding by excess of casualties as an additional tool to detect censorship
Newspapers often round up the number of casualties reported: this can be due to lack of knowledge, to simplify the headline, or to purposely increase emphasis to attract the attention of the reader. In the absence of tampering, we would expect the least significant digit of k, l k to follow a uniform distribution (P (l k ) = 1/10). However, in Figure 11, which shows the distribution l k for 10 < k ≤ 100, it is possible to see how the values '0' and '5' are overabundant ('5' only in the Italian papers) and the others under-abundant in respect to the flat distribution expectation. The number of defects in the digits '6' to '9' are close to the excess of '0' (and similarly for '1' to '4' with '5'), confirming the artificial nature of the reported number of casualties. Overall, Italian newspapers have a value of l 0 + l 5 = 40 − 46% and NYT has a value of 32% (with respect to the 20% expected for the sum of the k 0 + k 5 bins). All distributions (see Table T1 in the suppl. mat.) are incompatible with the null hypothesis of the flat distribution with p > 0.01. This distribution is found in all newspapers and historical periods with one notable exception: during the Fascist government, the domestic distributions of STA and CDS do not exhibit the peaks for k 0 and k 5 , still present in the corresponding foreign distributions of the same period. A χ 2 test allows us to reject the hypothesis that foreign and domestic histograms of Italian newspapers follow the same distributions with p > 0.01. Furthermore, the domestic distributions of STA and CDS are the only ones that are not incompatible with the equiprobable one (Figure S3 and Table T1 in the Supplementary Information). This implies that during the Fascist regime the editorial practice was to report more faithfully the number of domestic casualties as an additional way of suppressing these events and at the same time exaggerating the number of foreign casualties.
This phenomenon is similar but specular to Benford's law [48,49], which describes the statistical distribution of the most significant digit in several natural and man-made datasets [

Conclusions
In this work we have developed a series of techniques to automatically scan the complete historical databases of daily newspapers for the occurrence of specific keywords related to accidents and death. We also devised various algorithms to analyze the dependence of these keywords from the geographical location and historical period. Over traditional analysis, consisting of manual scanning of the newspaper articles, these tools offer the advantage of being automatic and thus being applicable on larger datasets spanning longer time periods. Indeed, these tools are suited for historical analysis to evaluate quantitatively the presence of bias or censorship in a given publication and its variation over time and political environment. This paper considered printed daily newspapers, but these techniques can be used also on online publications, news outlets, etc... Although the usage is limited only to articles with quantitative keywords, where to the event (accident, casualty, death...) is associated the number of persons involved (The simpler word occurrence methods can be used on a wider word set), this type of analysis is complementary to the assessment of 'fake news' since has the advantage of being automatic and operating on large data sets. These tools can also contribute to the assessment of the freedom of the press in a given country.
Future work will extend the application of these tools to other lemmas such as casualties, wounded, victims. The analysis will be applied to differences in reporting between the type of accidents, both of man-made origin (e.g. train/airplane/ship) and natural calamities. Also the structure of reporting as a function of the day of the week and the page location will be considered. On a larger scope. also other news-papers, magazines, online publications will be considered, extending the analysis to look for the presence and evolution of ethnic or national bias. A more long-term goal can also be the use of speech recognition methods to study the occurrence of these lemmas on radio and TV.

Abbreviations
• CDS -Il Corriere della Sera The author also wishes to thank the four newspapers considered (The New York Times, Il Corriere della Sera, La Repubblica, La Stampa) for having their historical archives freely available for consultation: without these resources the paper could not have been written.

Appendix. Estimation of the missing events
For a power-law distribution of the number of articles N (k) reporting that k people have been killed: the total number of articles with two or more people killed is: for a given N tot we have therefore that: In a single power law, the ratio W of articles with less (N L ) and more (N H ) than 10 people dead, is: A high value of γ implies an emphasis on articles with low k and vice-versa. For γ = 1.43 we have an equal number of articles with 2 ≤ k ≤ 10 and k > 10 ( Figure  12). For a broken power law distribution: so The excess (M > 0) or defect (M > 0) in respect to a pure power law (M = 0) thus depends on the two values γ H and γ L . They are shown in Figure 13.

Supplementary online material 8.1 Systematic error associated to finite sample of newspapers
To asses the systematic error arising from the finite sample of the events for a given selection we have performed the power law fitting on N test = 1000 different subsets, each obtained removing a percentage P cut of random events from those passing the cut. Fitting has been performed on each of these distributions and the resulting values of γ histogrammed. The sigma resulting from a Gaussian fit of each histogram of values of γ at a given P cut can be used as an estimation of the systematic error associated to the spectral index γ.
Values of 5 ≤ P cut ≤ 50% have been removed to test the robustness of the algorithm to the finite data set. As expected, the error grows more slowly with the increase of P cut for larger samples. For instance, for the overall spectral index of the newspapers (see Figure 14) this goes from 0.002 for P cut = 5% to 0.02 for P cut = 50%. If we assume a very conservative value of P cut = 30% as incompleteness of the data set, we can estimate the systematic error due fitting to be 0.07 for the full dataset. Unless otherwise noted, this value has been added to the statistical errors in the plots.

Distributions of least significant digit
For NYT ( Figure 15 shows the χ 2 test comparing various distributions of the least significant digit for NYT: domestic -foreign, domestic -flat, foreign -flat, all -flat (full dataset). From the resulting Chi2 and p value it is possible to see how none of the distributions are compatible with a flat hypothesis. The analysis of the residuals and the QQ plot show how the highest deviations occur in the digits corresponding to '0' and '5'). Figure 17 shows  Figure 16). Figure 18 shows the χ 2 for REP.
See also Table 5 for the P test values.  Magenta is the ratio (k = 1)/(k ≥ 2), (morto+morta)/morti. The blue ratio refers to the percentage of female deaths, morta/(morto+morta). Both ratios are ×10 2 ). The percentage of k = 1 events halves for CDS over time and is roughly constant for STA after 1890, with drops during WW1 and Fascism. The reporting of female deaths is constant to 30% for CDS and STA, and doubles from 20% to 40% for REP.   Table 2), the domestic/foreign ones with two power laws γ L (2 ≤ k ≤ 10) and γ H (k > 10). For all foreign events there is a break in the spectral index γ L < γ H due to missing events not being reported (from 91% for REP to 122% for CDS). In domestic events γ L > γ H for Italian papers (over-reporting of low k events, from 47% in STA to 71% for CDS). In NYT whereas the decrease in γ L for domestic events is lower (over-reporting of 4%) than for foreign ones (under-reporting of 98%), hinting to a higher degree of internationalization of this publication (Table  3).  : ∆γ = γ L − γ H vs γ H obtained from the fit of power-law for the distributions of domestic and foreign events. ∆γ 0 means a pure power law over the whole distribution whereas a negative (positive) ∆γ implies a lack (excess) of articles with a small (2 ≤ k ≤ 10) number of deaths. An higher (lower) value of γ H implies more emphasis on low (high) values of k. In all newspapers, foreign distributions have a lower ∆γ than domestic ones and a lower value of γ H , showing that high k events have an higher importance over the low k ones. NYT has the smallest differences in ∆γ, a sign of greater uniformity of treatment between domestic and foreign events. The New York Times