Infoveillance of infectious diseases in USA: STDs, tuberculosis, and hepatitis

Big Data Analytics have become an integral part of Health Informatics over the past years, with the analysis of Internet data being all the more popular in health assessment in various topics. In this study, we first examine the geographical distribution of the online behavioral variations towards Chlamydia, Gonorrhea, Syphilis, Tuberculosis, and Hepatitis in the United States by year from 2004 to 2017. Next, we examine the correlations between Google Trends data and official health data from the ‘Centers for Disease Control and Prevention’ (CDC) on said diseases, followed by estimating linear regressions for the respective relationships. The results show that Infoveillance can assist with exploring public awareness and accurately measure the behavioral changes towards said diseases. The correlations between Google Trends data and CDC data on Chlamydia cases are statistically significant at a national level and in most of the states, while the forecasting exhibits good performing results in many states. For Hepatitis, significant correlations are observed for several US States, while forecasting also exhibits promising results. On the contrary, several factors can affect the applicability of this forecasting method, as in the cases of Gonorrhea, Syphilis, and Tuberculosis, where the correlations are statistically significant in fewer states. Thus this study highlights that the analysis of Google Trends data should be done with caution in order for the results to be robust. In addition, we suggest that the applicability of this method is not that trivial or universal, and that several factors need to be taken into account when using online data in this line of research. However, this study also supports previous findings suggesting that the analysis of real-time online data is important in health assessment, as it tackles the long procedure of data collection and analysis in traditional survey methods, and provides us with information that could not be accessible otherwise.

Google Trends [7], the most popular tool for retrieving online information, is highly used in health care research [8]. Google Trends data main advantages are that they are real-time data, and that they provide us with the revealed and not the stated preferences [9]. Google Trends has been a useful tool for the analysis, monitoring, forecasting, and nowcasting of many health topics; in seasonal [2,10], chronic [11][12][13][14], and infectious diseases [15][16][17], as well as in outbreaks and epidemics, such as in AIDS [18], Measles [19], Ebola [20,21], MERS [22], and the Zika Virus [23][24][25]. Online queries have been much employed up to this point for the analysis and forecasting of Influenza Like Illness, i.e., the flu [6,[26][27][28], while an emerging interest in analyzing Google queries for vaccination related topics has been increasing over the last couple of years [19,[29][30][31]. Other topics that Google Trends data have found significant applicability, include the monitoring of cancer types and screenings [32][33][34][35], the relation between online queries and suicide rates [36][37][38][39], as well as the analysis of the online interest and its association with both legal [40][41][42] and illegal drugs [43,44].
Though Google Trends data have been much employed in forecasting, a gap exists in forecasting diseases' cases using said data. This gap could be mainly attributed to low official health data openness and availability, as well as regional limitations that are due to Internet penetration and restrictions. Τraditional methods, e.g., surveys and questionnaires, are time consuming for both collecting and analyzing data, therefore the results are available long after the period to which they refer. In addressing this drawback, online data have exhibited promising results up to this point in this line of research, i.e., showing that Internet data correlate with official health data and further examining the possibility of monitoring and forecasting diseases using data from online sources.
Towards the direction of examining novel, alternative methods of disease surveillance, this study provides an overview of the Infoveillance of five diseases, i.e., Chlamydia, Gonorrhea, Syphilis, Tuberculosis, and Hepatitis, using Google Trends data. Following, we explore the possibility of forecasting said diseases cases in the US at both national and state level. All examined diseases are in the 2018 list of National Notifiable Conditions for Infectious Diseases, i.e., included in the CDC list for Surveillance Case Definitions [45], defined as: "a set of uniform criteria used to define a disease for public health surveillance. Surveillance case definitions enable public health officials to classify and count cases consistently across reporting jurisdictions" [46].
For the diseases included in the National Notifiable Infectious Diseases list, the monitoring and analysis of the effects and trends of said diseases is achieved via public health surveillance. Despite provisional data being available in shorter time frames, the official data on the diseases are published annually. This is a long procedure involving a chain of several health officials; hence the data are far from being real time [45].
Out of the notifiable diseases, Chlamydia is the most common one, and is also the most common sexually transmitted disease (STD). It is most frequently met amongst young females, while most of infected people have no symptoms. Chlamydia can have serious effects in a woman's health, even causing infertility. There are increased risks with Chlamydia, such as getting HIV infection, or passing the disease to the baby during delivery. There is a lack of awareness on the subject, while testing does not reach as many women as it should [47].
Gonorrhea is a very common STD, transmitted through the reproductive male and female parts, but also through the mouth and anus. As in the case of Chlamydia, Gonorrhea is mostly asymptomatic, can be passed from mother to child during childbirth, and could even result in infertility. It is prevalent in young adults and African Americans. Gonorrhea also increases the risk of getting HIV [48].
Syphilis is an STD with very serious effects on human health, mainly transmitted through sexual contact or direct contact with infected genitals, anus, and mouth. Congenital Syphilis, i.e., passing the disease from mother to baby, mostly occurs in black and hispanic mothers, which is a very serious complication of the disease and can result in stillbirth or death of the baby. As in Chlamydia and Gonorrhea, the infection of Syphilis increases the risk of HIV transmission. As the symptoms can point to several other diseases, diagnosis of Syphilis can take several months, or even years. The progression of the disease consists of three stages, i.e., Primary Stage. Secondary Stage, and the Latent Stage. Tertiary Syphilis can occur even 30 years after the initial infection and could result in death, while Neurosyphilis and Ocular Syphilis can occur at any stage of the infection, causing serious complications [49].
Tuberculosis (TB) is an infectious disease that mainly affects the lungs and could result in serious complications or death. The risk of TB is higher amongst those with weakened immune systems, as, for example, those with HIV. Tuberculosis is divided in the TB disease and the latent TB infection, i.e., the disease does not develop [50].
Hepatitis is an infectious disease resulting in the inflammation of the liver. It is mainly caused by one of the three most common viruses, i.e., Hepatitis A (HAV), Hepatitis B (HBV), or Hepatitis C (HCV). Hepatitis A is a vaccine preventable, highly contagious disease, and can be transmitted through food, drinks, stool, or through close contact with an infected person. It cannot result in a chronic disease, while it is usually not fatal. On the contrary, Hepatitis B and Hepatitis C can be either acute or chronic, while they can result in serious health issues, even death. Hepatitis B is also vaccine preventable, while for Hepatitis C there is no vaccine yet. Hepatitis B is most commonly transmitted through blood, semen, sexual contact, and needles, while Hepatitis C is most commonly met amongst those who share needles or other drug related equipment [51].
The rest of the paper is structured as follows: In "Data and methods", the data collection procedure and analysis are detailed, and in "Results", the results are presented. "Discussion" consists of the discussion of the analysis, while "Conclusions" presents the overall conclusions and further research suggestions.

Data and methods
Data used in this study are retrieved online by Google Trends [7] and are normalized over the selected period as follows: "Search results are proportionate to the time and location of a query: Each data point is divided by the total searches of the geography and time range it represents, to compare relative popularity. Otherwise places with the most search volume would always be ranked highest. The resulting numbers are then scaled on a range of 0-100 based on a topic's proportion to all searches on all topics. Different regions that show the same number of searches for a term will not always have the same total search volumes" [52].
Data on diseases cases and rates are retrieved by CDC's AtlasPlus [53]. This database contains data for 6 infectious diseases, i.e., HIV/AIDS, Chlamydia, Gonorrhea, Syphilis, Tuberculosis, and Hepatitis. Following the well performing forecasting results for AIDS [18], in this study we use data on the rest of the diseases included in AtlasPlus. The data retrieved for Hepatitis are from January 1st, 2004 to December 31st, 2015, while for the rest of the examined diseases; the examined time frame is from January 1st, 2004 to December 31st, 2016. Note that the data may very slightly vary depending on the time of retrieval.
The steps towards examining the possibility of forecasting said diseases using Google Trends data are as follows: First, we provide an overview of the online interest variations on each of these diseases for the respective examined periods. Next, we visualize the geographical distribution of the online interest in each disease for all states for each individual year from 2004 to 2017. Following, we calculate the Pearson correlations between Google Trends data and the respective CDC data on each disease's cases. Finally, we estimate linear regressions for the examined diseases at both national and state level, in order to examine the possibility of forecasting said diseases using Google Trends data.

Results
This section consists of the analysis of the results for the five examined diseases, i.e., Chlamydia, Gonorrhea, Syphilis, Tuberculosis, and Hepatitis.  It is evident that the online interest in the term 'Chlamydia' is significant throughout the examined period, i.e., from 2004 to 2017. In the US, the top related searches for the term 'Chlamydia' from 2004 to 2016 include: 'chlamydia symptoms' (100), 'chlamydia gonorrhea' (50), 'symptoms of chlamydia' (38), 'chlamydia men' (36), 'std chlamydia' (34), 'std' (33), 'chlamydia treatment' (33), 'treatment chlamydia' (33), 'chlamydia in men' (28), 'chlamydia infection' (26), 'chlamydia in women' (25), 'what is chlamydia' (24), 'chlamydia test' (22), 'chlamydia symptoms women' (19), 'chlamydia symptoms men' (18), 'chlamydia symptoms in women' (16), 'chlamydia symptoms in men' (16), 'chlamydia discharge' (15), 'chlamydia signs' (14), 'chlamydia cure' (13). Table 1 consists of the Pearson correlation coefficients between Google Trends data on the term 'Chlamydia' and official Chlamydia cases in each US State from 2004 to 2016. At national level, the correlation between the yearly averages of Google Trends data and yearly cases of Chlamydia from 2004 to 2016 is statistically significant (r = 0.9096, p < 0.01). The correlations are also statistically significant for all states, apart from Arkansas, Mississippi, Hawaii, North Dakota, and West Virginia. The next step is to identify the relationship between Chlamydia cases and the online interest on the term. Table 2 consists of the coefficients α, β, and the respective R 2 for each of the linear regressions of the form y = αx + β estimated for the relationships  between Chlamydia cases (dependent variable) and Google Trends data (independent variable). For the US, the equation describing the relationship is y = 9012x + 681655 with an R 2 of 0.8277. Most of the respective models at state level are also performing well, indicating that the forecasting of Chlamydia cases is possible using online search traffic data.  41), 'gonorrhea std' (40), 'treatment gonorrhea' (35), 'syphilis' (30), 'gonorrhea men' (28), 'herpes' (25), 'what is gonorrhea' (24), 'gonorrhea in women' (23), 'chlamydia and gonorrhea' (22), 'gonorrhea in men' (22), 'gonorrhea symptoms women' (19), 'gonorrhea discharge' (19), 'gonorrhea symptoms men' (18), 'gonorrhea test' (15), 'throat gonorrhea' (15), 'stds' (15). Table 3 consists of the Pearson correlation coefficients between Google Trends data on the term 'Gonorrhea' from 2004 to 2016 and data on Gonorrhea cases from the CDC for the same period. Contrary to Chlamydia, no statistically significant correlation is observed for USA (r = 0.0974, p > 0.1), while significant correlations are only observed in the states of Michigan, South Carolina, Alabama, California, Kentucky,  Table 4 consists of the coefficients α, β, and the respective R 2 for each of the linear regressions. For the US, the estimated model is y = 325.28x + 334069 with an R 2 of 0.0095. In the three States for which significant correlations with p < 0.01 are observed, i.e., in Illinois, Michigan, and South Carolina, the respective R 2 for the linear regressions for Gonorrhea cases are 0.6867, 0.5966, and 0.6556.

Gonorrhea
The R 2 of the estimated equations are not very high even in the states with significant correlations between online and official data on Gonorrhea, while for the US, the results are significantly low. Thus the forecasting of Gonorrhea cases using this method cannot be performed at this point.   The top related queries for the term 'Syphilis' from 2004 to 2016 in the US include: 'symptoms syphilis' (97), 'herpes' (37), 'gonorrhea' (36), 'symptoms of syphilis' (34), 'chlamydia' (33), 'std syphilis' (33), 'std' (32), 'what is syphilis' (31), 'syphilis pictures' (28), 'syphilis treatment' (27), 'tuskegee' (25), 'tuskegee syphilis' (25), 'syphilis rash' (24), 'syphilis test' (21), 'hiv' (17), 'tuskegee syphilis study' (16), 'syphilis penis' (15), 'syphilis disease' (15), 'syphilis in men' (14), 'stds' (14), 'gonorrhea symptoms' (13), 'chlamydia symptoms' (12), 'herpes symptoms' (12). Table 5 consists of the Pearson correlation coefficients between Google Trends data and numbers of Syphilis cases for each examined state. Data on Syphilis cases for calculating the Pearson correlations are retrieved from CDC AtlasPlus [30] by adding the 'Primary and Secondary Syphilis' cases to 'Early Latent Syphilis' cases. Congenital Syphilis' cases are not included, as data are not available for most of the states for most of the years. However, by adding the Congenital Syphilis cases to the analysis, the correlations and the respective results remain significant in the same states. For the years where data for Early Latent Syphilis are not available, only data from 'Primary and Secondary Syphilis' cases are used.
For the US, the correlation between online data and Syphilis cases is statistically significant (r = 0.6478, p < 0.05). At state level, significant correlations are only observed in California, Illinois, Massachusetts, Utah, in Arkansas, Colorado, DC, Minnesota, Nevada, New Hampshire, North Carolina, Iowa, Michigan, New York, Ohio, and Washington. The states of North Dakota, South Dakota, and Wyoming are excluded from further analysis due to lack of complete datasets in all Syphilis subcategories.  Table 6 consists of the coefficients α, β, and the respective R 2 for each of the linear regressions for Syphilis cases. For the US, the equation describing the linear relationship between online data and official Syphilis cases is y = 748.65x − 26929 with an R 2 of 0.4196, which is indicating that, though at this point the model is not performing well, we could see promising results in the future when more data are available.
The states where the estimated models perform relatively well are Illinois and Massachusetts, for both of which the estimated correlations between online and official data were high (p < 0.01). It is thus evident that, as in the case of Gonorrhea, Syphilis cases cannot be forecasted using this method at this point.     (38), 'tuberculosis treatment' (32), 'symptoms of tuberculosis' (29), 'tuberculosis disease' (29), 'tb test' (19), 'tuberculosis vaccine' (18), 'tuberculosis causes' (14), 'who tuberculosis' (13), 'tuberculosis skin test' (13). Table 7 consists of the Pearson correlation coefficients (r) between Google Trends data and Tuberculosis cases for each of the states, while Table 8 consists of the coefficients α, β, and the respective R 2 for each of the linear regressions for Tuberculosis cases.

Tuberculosis
For the US, statistically significant correlations are observed (r = 0.5672, p < 0.05) between the online interest on the term 'Tuberculosis' and official Tuberculosis cases. Statistically significant correlations with p < 0.01 are observed for the states of DC, Louisiana, and Wisconsin, with p < 0.05 for Illinois, Kentucky, Maryland, New Hampshire, Rhode Island, and Virginia, and with p < 0.1 for Alabama and California. Based on the calculated correlations, the respective estimated models are not expected to perform well in most of the states. For the US, the relationship between Google Trends data and Tuberculosis cases is described by y = 147.51x + 3787 with an R 2 of 0.3217. The only state that shows promising results that forecasting could be possible at this point is Michigan, with an R 2 of 0.6840. Therefore, as in the case of Gonorrhea and Syphilis, Tuberculosis forecasting is not possible at this point using this method in all states.    (27), 'hepatitis c treatment' (26), 'what is hepatitis c' (26), 'what is hepatitis a' (23), 'hepatitis b symptoms (22), 'viral hepatitis' (21), 'what is hepatitis b' (20), 'hepatitis a symptoms' (20), and 'hepatitis transmission' (17). Table 9 consists of the Pearson correlation coefficients (r) between Google Trends data and Hepatitis cases for each of the states. For calculating the correlations, the sum of the cases for Hepatitis A, Hepatitis B, and Hepatitis C are used. Where data are not available for a category, the sum of the remaining ones is used.
For the US, statistically significant correlation was observed between Hepatitis cases and Google Trends data (r = 0.9583, p < 0.01). For Hepatitis A, statistically significant correlations were observed between Google data in the US (r = 0.9045, p < 0.01); the same for Hepatitis B (r = 0.8922, p < 0.01). On the other hand, for Hepatitis C cases, no correlation was observed with Google Trends data (r = − 0.3089, p > 0.1), indicating that the latter does not contribute significantly to the high correlation between all Hepatitis cases and Google data. Table 10 consists of the coefficients α, β, and the respective R 2 for each of the linear regressions for Hepatitis cases for all US States, apart from DC where full datasets are not available.
For the US, the equation describing the linear relationship between Hepatitis cases and Google Trends data is y = 261.44x − 8197.4 with an R 2 of 0.9184. The states of Arizona, Florida, Hawaii, New York, Pennsylvania, and Wisconsin exhibit good performing forecasting results. Several other states have R 2 that are relatively high, indicating that they will exhibit better results once more years' data are available. As depicted in Fig. 10, in 2016 the online interest in all states but Hawaii is very low. This can be attributed to the Hepatitis A outbreak in Hawaii in August 2016, possibly linked to raw scallops that were served at a Hawaiian restaurant [54]. This is why the interest is so low in the rest of the states, constituting a good example of how an unexpected event can (negatively) affect this method of forecasting, but also how real life events are immediately and accurately depicted in online searches. The latter is very significant for the real-time examining of epidemics and outbreaks.

Discussion
The surveillance of diseases using information available online, i.e., Infoveillance, has become an integral part of Health Informatics over the past years. Internet data can provide a large amount of information that could not be accessed through traditional surveillance methods, such as questionnaires, surveys, and registries. New methods and approaches are constantly discovered and used in order to take advantage of what the Internet has to offer.   In this study, we assessed the online interest in the US at both national and state level in five infectious diseases, in order to show how Internet data can be used in the Infoveillance of said diseases, and explore the possibility of forecasting cases using online search traffic data.
Yearly Data from the Atlas CDC website [53] were used, which are available for up to 2015 or 2016 (depending on the disease) for Chlamydia, Gonorrhea, Syphilis, Tuberculosis, and Hepatitis. In the case of AIDS, the estimated forecasting models of AIDS Prevalence in the US exhibited very good performance [18], supporting previous work on the subject suggesting that empirical relationships between online data and official health data exist, and highlighting the usefulness of this tool in health assessment.
As is evident from the geographical distribution of the online interest towards the examined diseases in each state per year since 2004, Google Trends data are an accurate and valuable way to measure public interest and awareness on the subject. This is essential especially for STDs, since new innovative public surveillance methods, preventive measures, and increased public information via traditional and new channels can increase awareness, particularly in the regions where said diseases' rates are higher. Table 11 consists of the US CDC reported cases for the diseases included in Atlas for the year 2016, apart from Hepatitis for which data refer to the year 2015. As is evident, Chlamydia cases are by far the most. The latter could explain why statistically significant correlations are observed between Google Trends data and reported Chlamydia cases in most US States, and the forecasting models are performing well. All diseases apart from Tuberculosis are experiencing an increase since the previous year, indicating that probably better-and for more diseases-forecasting will be possible in the future using this method. Table 12 consists of the USA yearly rates (per 100,000) for Chlamydia, Gonorrhea, Syphilis, Tuberculosis from 2004 to 2016, and Hepatitis from 2004 to 2015. For Hepatitis, the reported rate is the sum of rates from Hepatitis A, Hepatitis B, and Hepatitis C, while for Syphilis, the rate is the sum of Primary and Secondary Syphilis, Early Latent Syphilis, and Congenital Syphilis.
As shown in Table 12, Chlamydia rates in the US are significantly higher than the rates for the rest of the examined diseases. This partly explains why Chlamydia cases exhibit so high correlations with online search traffic data and why the forecasting of Chlamydia is possible in many states using Google Trends data. For Syphilis and Tuberculosis, the rates included in Table 12 show that said diseases have very decreased rates, with Tuberculosis showing a downward trend since 2004. The low rates can partly explain why this method does not apply to these diseases. This is contrary to the case of Hepatitis, which may have the lowest numbers of reported cases (Table 11) and a downward rate trend (Table 12), but it shows more promising results in forecasting. Based on the observations for Tuberculosis and Syphilis, however, and as in 29 out of 50 states significant correlations are observed for Hepatitis cases and online queries, there is a slight possibility that what is observed is a decrease in significance of the reported results instead of a projected increase in the future. For Gonorrhea, the online behavioral assessment is not trivial, as it is a word that is often misspelled, mostly for 'Gonorrea' , contrary to e.g., AIDS, which is a word that is not misspelled, and for which the forecasting results exhibit good performance.
Many factors should be taken into account when using online search traffic data in health assessment, and the results should be interpreted carefully. This study is an overview of how infoveillance methods can be applied in monitoring and forecasting diseases cases using online search traffic data. In this analysis, we highlight not only what studies in this field normally highlight, i.e., the usefulness of Internet data in the monitoring and forecasting of diseases' prevalence, but also provide examples of cases where this method does not work. In fact, we emphasize on how the suitability of this method along with the respective forecasting results can be affected by low rates or other factors.
However, despite previous concerns on the reliability using Google data as a means for disease monitoring [55], including the case of Google Flu Trends [56] which is now not available [57], the use of Google Trends data in health and medicine has exhibited very promising results so far. Nevertheless, it is essential to understand that this method cannot be applied in every case, and, more importantly, that the methodology should be designed cautiously and that the results must always be interpreted accordingly. Taking into account these limitations, future research should focus on employing more detailed and complicated mathematical modeling in order to improve diseases' and epidemics' forecasting, as, in order for all available information to be integrated in health research, both online data and data from traditional sources should be combined [56].
The overall assessment of the diseases examined in this study indicate the usefulness of Google Trends as a tool for disease surveillance, providing real-time data and thus