Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A comparison of prospective space-time scan statistics and spatiotemporal event sequence based clustering for COVID-19 surveillance

  • Fuyu Xu,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation School of Computing and Information Science, University of Maine, Orono, ME, United States of America

  • Kate Beard

    Roles Conceptualization, Formal analysis, Investigation, Software, Supervision, Validation, Writing – review & editing

    kate.beard@maine.edu

    Affiliation School of Computing and Information Science, University of Maine, Orono, ME, United States of America

Abstract

The outbreak of the COVID-19 disease was first reported in Wuhan, China, in December 2019. Cases in the United States began appearing in late January. On March 11, the World Health Organization (WHO) declared a pandemic. By mid-March COVID-19 cases were spreading across the US with several hotspots appearing by April. Health officials point to the importance of surveillance of COVID-19 to better inform decision makers at various levels and efficiently manage distribution of human and technical resources to areas of need. The prospective space-time scan statistic has been used to help identify emerging COVID-19 disease clusters, but results from this approach can encounter strategic limitations imposed by constraints of the scanning window. This paper presents a different approach to COVID-19 surveillance based on a spatiotemporal event sequence (STES) similarity. In this STES based approach, adapted for this pandemic context we compute the similarity of evolving daily COVID-19 incidence rates by county and then cluster these sequences to identify counties with similarly trending COVID-19 case loads. We analyze four study periods and compare the sequence similarity-based clusters to prospective space-time scan statistic-based clusters. The sequence similarity-based clusters provide an alternate surveillance perspective by identifying locations that may not be spatially proximate but share a similar disease progression pattern. Results of the two approaches taken together can aid in tracking the progression of the pandemic to aid local or regional public health responses and policy actions taken to control or moderate the disease spread.

Introduction

The first reported case of Coronavirus disease 2019 (COVID-19) appeared in the US in Washington State in January 2020. Cases then began to appear around the country, creating an outbreak more severe than that experienced in the city of Wuhan, China, where the initial outbreak occurred [1], as well as in many European countries [2, 3]. By mid-March 2020 the outbreak had spread to many states and by late April over one million confirmed cases had been reported in the US.

To anticipate and detect outbreaks, the World Health Organization (WHO), many national and local health departments, academic or other non-profit organizations continuously collected information about occurrences of COVID-19. Incidence cases were cumulatively added to different online repositories [46]. Quick detection of emerging geographical clusters or space-time clusters of COVID-19 can aid public health agencies in prioritizing spatial locations for allocation of different kinds of medical resources including testing kits and applying efficient and publicly acceptable interventions. Versions of space-time scan statistics have been widely used to identify significant clusters of various diseases [711] as well as in the current COVID-19 crisis [12, 13]. Space-time scan statistics use circular or elliptical scanning windows of a series of sizes in combination with varying time intervals to systematically scan a study area to detect clusters of disease cases. The Poisson based space-time scan statistic evaluates each scan window for numbers of cases and tests for locations exceeding the number of expected cases under a Poisson distribution.

The prospective Poisson space-time scan statistic has been successfully used for space-time surveillance of different epidemic diseases. As Kulldorff et al. proposed [9, 10], this method focuses on detecting emerging clusters that start at any time during the study period and remain identifiable at the current time (i.e., active or alive), which is the major difference compared to the retrospective space-time scan statistic. Jones et al. used this method to detect twelve “live” or emerging statistically significant (p-value ≤ 0.05) clusters of shigellosis in the city of Chicago [14], the results of which helped local health departments to prioritize the assignment and investigation of shigellosis cases. The prospective Poisson space-time scan statistic has also been utilized to identify emerging clusters in other diseases such as thyroid cancer among men in New Mexico (1973–1992) [9], syndromic surveillance [15], measles [16], and dengue fever [17]. More recently, it has been used to detect “active” clusters of COVID-19 confirmed cases in the United States [12, 18].

While the prospective space-time scan statistic is a good option for detecting emerging space-time clusters of infectious diseases, there remain some limitations. The effectiveness of the circular scan window decreases as the shape of emerging clusters becomes more irregular. Detected clusters may contain locations without confirmed cases or with low relative risk due to the artifact of the scanning process [10, 12, 19], although this limitation can be minimized by reporting the individual relative risk for the included locations in each cluster. For the Poisson model, the results depend on accurate data on the population at risk, which may be hard to obtain. Furthermore, the prospective space-time scan statistic as an exploratory method, should be followed with other surveillance measures and more detailed investigation of transmission dynamics and pathogenic mechanics of COVID-19 to better understand detected emerging clusters [12].

While the prospective space-time scan statistic has demonstrated value for COVID-19 surveillance, the objective of this study was to demonstrate a different but complementary view of COVID-19 outbreak patterns. The space time scan statistic detects hotspots but does not inform about locations that may be spatially disparate yet may be exhibiting highly similar patterns in disease case count evolution. To capture this dynamic, we employed an event sequence similarity metric on the sequences of daily COVID incidence rates by county. This event sequence similarity metric was then used to cluster counties exhibiting similarly evolving COVID -19 case histories. The resulting identification of locations exhibiting similar evolutionary patterns in the disease provides another aid for public health responses and understanding of disease dynamics. In the remainder of this paper, we describe this event sequence similarity metric as applied to COVID-19 daily incidence rates and compare it with results of the prospective Poisson space-time scan statistic. We use four time periods to illustrate progression of COVID-19 outbreaks through the lens of prospective space-time scan statistic generated clusters and event sequence similarity clusters. The two approaches provide different but complementary aids to COVID-19 surveillance. One tells us of emerging spatial hotspots, the other tells us of collections of locations that for some reasons have statistically similar evolving COVID-19 incidence patterns.

Materials and methods

Data acquisition and processing

We accessed COVID-19 raw daily global collection data from the GitHub repository (https://github.com/CSSEGISandData/COVID-19) created and maintained by the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) [20]. The specific time series dataset for this research contains FIPS codes, state names, geolocations, and confirmed cumulative cases, starting from January 22, 2020 through selected ending dates. JH CCSE continues to semi-automatically or automatically update their site daily (https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/).

County level population data for the USA were obtained from the national US Census with estimates for 2019. The ESRI ™ shapefiles of US states and counties used for Geographic Information System (GIS) mapping were downloaded from the TIGER geography portal (US Census Bureau) (https://www.census.gov/cgi-bin/geo/shapefiles/index.php).

We focused the analysis on the 48 contiguous states and Washington D. C.. The dataset was cleaned by filtering out the records without “FIPS” codes and names of counties, and with “FIPS” > 8000 (assigned with “Out of AL”, “Out of AK”, …, “Out of WY”). We combined the cleaned COVID-19 dataset with the U.S. census data at the county level through the “FIPS” codes and double checked the correctness of the spatial information (Latitude and Longitude). Because the COVID-19 dataset only contains cumulative case counts, we obtained the daily confirmed cases by subtracting the previous day’s number from the current day’s reported cumulative cases. The daily incidence rate for each county was obtained as daily confirmed cases divided by county population and multiplied by 10,000. We chose the data from the first wave of the COVID-19 pandemic in the US in 2020 for this study. The entire duration of the first wave is further divided into four analysis periods considering the incubation time for the disease mostly ranging from 1–14 days with the average of 5 days [21] and the slow case increment at the beginning time in January and February, 2020. The four analysis periods each start from January 22 and cover roughly 2–4 week separations corresponding to an early period 1) March 13, and spiking periods 2) March 31, 3) April 19 and 4) May 20.

Prospective Poisson space-time scan statistic

We used the prospective Poisson space–time scan statistic as implemented in SaTScan (http://www.satscan.org/) to detect clusters of COVID-19 cases that remained active at the end of each study period. The space–time scan statistic (STSS) is briefly introduced here, and more details can be obtained from [9, 10, 12, 22]. With spatial scan statistics we can identify the locations of clusters of cases. A cluster can be defined as a set of points or regions, at a user defined granularity, with either high or low rates of incidence. For this study, the focus was high rates of COVID-19 incidence. Conceptually the STSS uses a cylinder as the scanning window, where the circular base of the cylinder captures the spatial dimension while the height represents a temporal interval. To identify space-time clusters at the county level, the center of the circular base is co-located with the centroid of each county. As the scan progresses, the radius of the circular base and the height of the cylinder changes from lower bounds to spatial and temporal upper limits. Similar to [12] we set the maximum scanning window base to include up to 10 percent of the total population to avoid the potential of extremely large clusters (ie. covering a quarter of the country) especially as may occur at the beginning stage of the epidemic, and the upper temporal bound to 50% of the entire study period. As each cylinder moves over the study area, it covers a different set of cases for different time intervals, which can be considered as potential emerging space-time cluster candidates. We set the cluster’s duration to a minimum of 2 days and required at least 5 incidents or confirmed cases of COVID-19 as described in [12].

The age structure of a population will influence the incidence of disease, and deaths from COVID-19 are several times higher in older age groups as noted by others [12]. However, we were unable to access age and sex data at this time for cases in this study, so we could not adjust for age and sex. Assuming that COVID-19 incidence follows a Poisson distribution according to the county population, e.g. the assumed population at risk [9], the likelihood ratio test statistic and the relative risk for each scan cylinder was calculated based on the description in [79, 12]. The cylinder with the maximum likelihood ratio identifies the location with the most likely elevated risk for COVID-19. We used Standard Monte Carlo simulations (999) in the SaTScan setting to calculate the statistical significance of detected clusters with a p-value equal or less than 0.05 being considered statistically significant. SaTScan computes the relative risk (RR) for each cluster and individual counties. The RR for a county within a cluster can be calculated as in [18]: Where, c is the total number of cases in a county, C is the total number of observed cases in the conterminous US, and e is the expected number of cases in a county calculated as (pcty is the population in a county, P is the total population). We used ESRI ArcGIS 10.6 (www.esri.com) GIS software to create cartographic representations for these detected emerging clusters at the county level.

Event sequence similarity-based cluster analysis

Our event sequence similarity approach focuses on the temporal evolution of events occurring at fixed locations. In this study, an event corresponds to the COVID-19 daily incidence rate for a county and a COVID-19 event sequence for a county is the sequence of daily incidence rates covering a specific study period. We compute the similarity of these county level COVID-19 event sequences using a time ordered Jaccard measure [2325]. Briefly, this measure uses all co-occurrence time points between two event sequences es1 and es2, and calculates the similarity between two events at the co-occurrence timestamp based on their level of measurement. The similarity between two counties’ COVID-19 event sequences is calculated as below: where,

simcounty(es1, es2)–Similarity between county level event sequences es1 and es2,

es1j, es2j–the event values for two corresponding co-occurring events in es1 and es2 at timestamp j.

lev(es1j), lev(es2j)–the relative event levels of two corresponding co-occurring events in es1 and es2 at timestamp j, respectively: C –the total number of co-occurring timestamps,

Abs(lev(es1j)–lev(es2j))–absolute value of difference between relative event levels of two corresponding co-occurring events in es1 and es2 at timestamp j,

|es1es2|–Cardinality of the union of two event sequences es1 and es2.

We then used the computed COVID-19 event sequence similarity measures between counties as the metric for hierarchical clustering [26]. All similarity computations and clustering tasks were implemented in R. The hierarchical clustering was performed using the hclust R function with the linkage method of Ward.D2. The optimal number of clusters was evaluated using the elbow method [2729]. This method supports selection of the number of clusters at which the total within-cluster sum of square (WSS) no longer improves. In a plot of number of clusters versus WSS, the optimal cluster number is visually associated with the point at which the WSS value flattens.

Comparison of prospective space time scan and event sequence similarity-based clusters

To support comparison of the two methods we used the counties identified in the prospective Space time scan statistics as having relative risk > 1 as the counties for analysis with the sequence similarity metric. All other counties not included in this set were labeled as OC meaning outside clusters. We include them in Figs 3, 6 and 9 in the graphs of incidences curves for each study period to show their temporal incidence pattern as a baseline.

Results

Space-time clusters and sequence similarity-based clusters at county level: Study period 1 (1/22-3/13/2020)

In this early period, COVID-19 was just appearing in the US with the first case reported in Snohomish County Washington on January 19. For this period, the prospective space-time scan statistic identified 11 statistically significant (p-value < 0.05) clusters shown graphically in Fig 1 and summarized in Table 1. These clusters, aside from one in California and two in New York, are generally quite large and counties within them with RR > 1 are few and generally spatially dispersed. Because of the generally large size of these clusters, identifying the spatial specificity of an outbreak is limited.

thumbnail
Fig 1. COVID-19 space-time scan hotspots in the United States at the county level from 1/22/-3/13/2020.

https://doi.org/10.1371/journal.pone.0252990.g001

thumbnail
Table 1. Attributes of prospective space-time clusters (hotspots) for COVID-19 from 1/23-3/13/2020 at the county level.

https://doi.org/10.1371/journal.pone.0252990.t001

Based on the elbow evaluation method, 8 event sequence similarity-based clusters were defined for this period (Fig 2). Fig 3 shows the map representation of these clusters along with their temporal profiles. Members of Cluster 3 that include counties in Washington State, California and New York show the earliest onset and the fastest case accumulation. Members of Cluster 5 show an early onset that initially tracks Cluster 3 but then abruptly flattens and then decreases in early March. Members of this cluster include 3 counties in California and one in Minnesota. Cluster 2 members show a delayed occurrence in cases but an extremely fast case accumulation over a few days. The 8 members of this cluster are generally in isolated rural settings in Colorado, Oklahoma, Wyoming, South Dakota, Wisconsin, Louisiana and Indiana. Members of Cluster 6 showed initiation of cases at approximately the same time as Cluster 2 but levelled off quickly at a lower incidence rate. The cluster containing counties in New York suggests initial points of entry and situations conducive to rapid acceleration of cases such as high density or tight knit communities. A pairwise comparison of cluster numbers for the 1st study period from these two approaches can be found in S1 Table.

thumbnail
Fig 2. Elbow method evaluation and hierarchical clustering results for the 1st period.

Notice that the numberings and colors of STES clusters match with those of corresponding clusters on the map and the temporal trend graph in Fig 3.

https://doi.org/10.1371/journal.pone.0252990.g002

thumbnail
Fig 3. Sequence similarity-based COVID-19 clusters along with average temporal trends at the county level through 3/13/2020.

This map includes the counties with higher relative risk (RR>1) contained in all the clusters detected by scan statistics in Fig 1. The average temporal trends of cumulative cases for STES clusters 1–8 on the map appear at the bottom right. Notice that the colors of STES clusters match with correspondingly colored dots on the map and with the colors of the STES cluster curves on the graph. OC includes all counties not included in the clusters.

https://doi.org/10.1371/journal.pone.0252990.g003

Space-time clusters and sequence similarity-based clusters at county level: Study period 2 (1/22-3/31/2020)

Results from the prospective space-time scan statistics analysis for the second study period (through March 31) identified twenty-four space-time clusters of COVID-19 as statistically significant (Fig 4 and Table 2). This period shows a growing emergence of spatial clusters across the US, but generally more consolidated clusters as the number of cases grow. The space-time clusters are smaller than in the first period and several detected clusters contain a single county (cluster radius = 0). This period shows a shift toward more clusters appearing in the interior US relative to the coasts.

thumbnail
Fig 4. COVID-19 space-time scan statistic detected hotspots in the United States at county level through 3/31/2020.

https://doi.org/10.1371/journal.pone.0252990.g004

thumbnail
Table 2. Attributes of prospective space-time clusters (hotspots) for COVID-19 from 1/23-3/31/2020 at the county level.

https://doi.org/10.1371/journal.pone.0252990.t002

For this second study period the sequence similarity clustering resulted in 8 clusters based on the elbow method evaluation (Fig 5). Fig 6 shows the map of these clusters and their temporal signatures. For this period, only three clusters deviate from the outside cluster (OC) set pattern. Cluster 7 shows the most rapid increase in cases. Members of this cluster include Miami, San Jose, Los Angeles area counties, Chicago, Detroit, New Orleans and New York metropolitan counties. Members of Cluster 8 show a slower and less rapid increase in cases. Some of these members appear in a group across New Jersey and Pennsylvania, around Baltimore, Denver and Seattle. Cluster 4 follows a similar trajectory with some concentrations around New Orleans, Columbus Georgia, and Indianapolis. Members of this cluster also appear in more isolated rural settings in Arizona, Oklahoma and South Dakota. A pairwise comparison of cluster numbers for the 2nd study period from these two approaches can be found in S2 Table.

thumbnail
Fig 5. Elbow method evaluation and hierarchical clustering results for the 2nd period.

Notice that the numberings and colors of STES clusters match with those of corresponding clusters on the map and the temporal trend graph in Fig 6.

https://doi.org/10.1371/journal.pone.0252990.g005

thumbnail
Fig 6. Sequence similarity-based COVID-19 clusters along with average temporal trends at county level during 1/22/2020-3/31/2020.

This map includes the counties with higher relative risk (RR>1) contained in all the clusters detected by scan statistics in Fig 3. The average temporal trends of cumulative cases for STES clusters 1–8 on the map appear at the bottom right. Notice that the colors of STES clusters match with correspondingly colored dots on the map and with the colors of the STES cluster curves on the graph. OC includes all counties not included in the clusters.

https://doi.org/10.1371/journal.pone.0252990.g006

Space-time clusters and sequence similarity-based clusters at county level: Study period 3 (1/22-4/19/2020)

For the third study period, the prospective space-time cluster statistic detected 47 statistically significant clusters (p≤0.05) as shown in Fig 7. Associated cluster characteristics are shown in Table 3. In this period more clusters are emerging in the southern US, with additional new pockets in Montana and a cluster covering Nebraska and South Dakota. Metropolitan New York remains an active cluster and a more condensed Mid-Atlantic coast cluster has emerged. We see additional consolidation in the size of clusters with 25 appearing as a single county.

thumbnail
Fig 7. COVID-19 space-time scan statistic detected hotspots in the United States at county level through 4/19/2020.

https://doi.org/10.1371/journal.pone.0252990.g007

thumbnail
Table 3. Attributes of prospective space-time clusters (hotspots) for COVID-19 from 1/23-4/19/2020 at the county level.

https://doi.org/10.1371/journal.pone.0252990.t003

For the third study period, ten sequence similarity-based clusters were selected using the elbow method (Fig 8). Fig 9 shows the map of these clusters and their temporal profiles. Cluster 8 shows a distinct early and more rapid accumulation of cases. Many members of this cluster were members of Cluster 7 in the previous study period. These members include Chicago, Detroit metropolitan area, Miami, Philadelphia, and metropolitan New York counties. Some significant missing members in Cluster 8 from the previous period Cluster 7 are San Jose, Los Angeles and Las Vegas. Cluster 9 shows a group with the next most rapidly developing number of cases. Within this group, some members appear concentrated around metropolitan New York, Philadelphia, Baltimore and Washington DC, and Denver. Cluster 10, as the third most rapidly merging cluster for this period, has members in a halo like pattern around metropolitan New York, Philadelphia and New Orleans. Other members, however, appear in more isolated rural settings in New Mexico, Utah, and Washington State. This group includes the Hopi, Zuni, Navajo and Yakima national reservations. Two other clusters to note in this group are Cluster 7 and Cluster 2 which show later initiation times in terms of case accumulation but appear to be accelerating at the end of the study period. Many of these members show a concentration in southern Indiana and western Kentucky respectively, with another grouping of Cluster 7 members appearing in southwestern Georgia on the border with Alabama. A complete pairwise comparison of cluster numbers for the 3rd study period from these two approaches can be found in S3 Table.

thumbnail
Fig 8. Elbow method evaluation and hierarchical clustering results for the 3rd period.

Notice that the numberings and colors of STES clusters match with those of corresponding clusters on the map and the temporal trend graph in Fig 9.

https://doi.org/10.1371/journal.pone.0252990.g008

thumbnail
Fig 9. Sequence similarity-based COVID-19 emerging clusters along with average temporal trends at county level during 1/22/-4/19/2020.

This map includes the counties with higher relative risk (RR>) contained in all the clusters detected by scan statistics in Fig 5. The average temporal trends of cumulative cases for STES clusters 1–10 on the map appear at the bottom right. Notice that the colors of STES clusters match with correspondingly colored dots on the map and with the colors of the STES cluster curves on the graph. OC includes all counties not included in the clusters.

https://doi.org/10.1371/journal.pone.0252990.g009

Space-time clusters and sequence similarity-based clusters at county level: Study period 4 (1/22-5/20/2020)

For the fourth study period ending on May 20, 2020 the prospective space-time scan statistic identified 87 statistically significant clusters. Table 4 provides the characteristics of these 87 active space-time clusters at the end of May 20, 2020. From Fig 10 we can observe that in this period clusters continued to emerge in southern states and more clusters emerge in the mountain west. The previous cluster covering Nebraska and South Dakota has expanded into Iowa, North Dakota and Minneapolis. The metropolitan New York cluster has consolidated and the prior period mid-Atlantic cluster has consolidated to an emerging cluster around Philadelphia.

thumbnail
Fig 10. Prospective space-time scan statistic detected clusters of COVID-19 incidents during the study period of 1/22/2020-5/20/2020.

https://doi.org/10.1371/journal.pone.0252990.g010

thumbnail
Table 4. Attributes of prospective space-time clusters (hotspots) for COVID-19 from 1/23-5/20/2020 at the county level.

https://doi.org/10.1371/journal.pone.0252990.t004

In this fourth period, using the sequence similarity-based clustering, we selected 10 clusters based on the elbow method evaluation (Fig 11). Fig 12 presents a map of these clusters and their temporal signatures. In this period, Cluster 8 which includes Miami, Chicago, Detroit, Los Angeles, Philadelphia and New York metropolitan counties is the fastest growing in term of cases. Clusters 7 and 9 start out with similar increases in cases but Cluster 7 members show a levelling off in early May relative to Cluster 9. Cluster 10 shows a delayed start but steady increase starting in early April. Cluster 5 shows a different trajectory in that it shows a much slower start to case accumulation but then exhibits a sharp increase starting in mid-April, increasing more rapidly than Clusters 10 and 7. Cluster 4 initially falls below the outside cluster “OC” group but then shows a sharp jump and more rapid accumulation. More detailed information on pairwise comparison of cluster numbers for the 4th study period from these two approaches can be found in S4 Table.

thumbnail
Fig 11. Elbow method evaluation and hierarchical clustering results for the 4th period.

Notice that the numberings and colors of STES clusters match with those of corresponding clusters on the map and the temporal trend graph in Fig 12.

https://doi.org/10.1371/journal.pone.0252990.g011

thumbnail
Fig 12. Sequence similarity-based COVID-19 clusters along with average temporal trends at county level during 1/22/-5/20/2020.

This map includes the counties with higher relative risk (RR>1) contained in all the clusters detected by scan statistics in Fig 10. The average temporal trends of cumulative cases for STES clusters 1–10 on the map appear at the bottom right. Notice that the colors of STES clusters match with correspondingly colored dots on the map and with the colors of the STES cluster curves on the graph. OC includes all counties not included in the clusters.

https://doi.org/10.1371/journal.pone.0252990.g012

Discussion

For this study we compared two approaches for COVID-19 surveillance. In combination, the two approaches provide complementary views that can offer a more comprehensive picture of surveillance information to further aid public health analysis and monitoring. The space-time scan statistic identifies emerging clusters as locations where the observed number of cases most exceeds the expected number of cases in space-time based on the underlying population. This approach provokes questions of why the disease is emerging at such a location during a period of time. For disease progression, where the temporal pattern is equally important, similarity in the sequence of daily incidence rates adds valuable information as it points to locations where the disease is progressing in a similar fashion. This view provokes questions of why these sometimes spatially dispersed locations are behaving in a similar way.

An initial working hypothesis for the STES sequence similarity metric in an environmental monitoring context was that locations that are spatially close are more likely to exhibit similar event sequences. While this is born out in some instances in this pandemic context, we found that in all study periods, similar sequence patterns of COVID-19 cases can be quite spatially separated. This result suggests that spatial proximity is not always a driver of sequence similarity. It has been reported that socio-economic or demographic characteristics could explain the different transmission rates or patterns between communities and locations [30]. Because members of these clusters share similar temporal disease progressions, questions arise as to whether they share some similar underlying characteristics such as similar population density, similar populations at risk, similar changes in surveillance programs, or possibly similar intervention strategies at work.

Sequence similarity Cluster 3 in the first study period which covers the first appearance of COVID-19 in the US shows the earliest and fastest accumulating number of cases suggesting initial points of entry. As members of this cluster include Snohomish and King counties in Washington State, several California counties in the San Francisco Bay area, and Bronx, Kings, Queens, Wassau, and New York counties in New York state these do align with the known entry points on the east and west coasts. Seemingly unusual members in this cluster are Johnson County Iowa; Kershaw County, South Carolina; Williamson, Tennessee; and Douglas, Nebraska. An interesting question is why this last subgroup of locations shares a similar profile with the coastal points of entry. Sequence similarity-based Cluster 2 in the first period is another interesting collection which is very spatially dispersed. Most of the members are rural communities that include Sheridan Wyoming, Davison South Dakota, Jackson Oklahoma, Hancock Indiana, Pitkin Colorado, Caddo Louisiana and Pierce Wisconsin. The temporal profile for this group is initially flat until mid-March at which point it shows a very rapid accumulation of cases. Such spatially dispersed cluster members that exhibit similar behaviours are targets for further investigation of potential contextual similarities. Of particular interest from epidemiological and health policy perspectives are spatially dispersed cluster members that exhibit similar flattening or decreasing patterns as these would be interesting to explore to understand if they have similar demographic characteristics or if they shared similar intervention measures.

We note that the sequence similarity clusters suggest some connections which are not conveyed by the scan statistic clusters. For example, in the third study period the scan statistic results indicate several new clusters. An examination of the sequence similarity clusters in this period indicate that several members of Cluster 10 were first nation or tribal reservations. In other words, several of the spatially dispersed reservations across the west showed a similar onset and progression in COVID-19 cases.

Another difference between the two approaches is that the sequence similarity-based clusters starting in the third period begin to show evidence of a spatial diffusion effect. For example, members of Cluster 8 with the earliest and fastest accumulating sequence similarity often appear to be surrounded by or in close spatial association with the next closest lagging group, Cluster 9. A similar pattern appears between Cluster 8 and Cluster 9 members in the fourth study period.

Recent research has pointed to different continents of origin for the introduction of COVID-19 into the US [31, 32]. Genomic epidemiology research supports the belief that isolates from China primarily seeded the original COVID-19 outbreak on the US West Coast and that European isolates seeded the pandemic in New York (and the US East Coast) [33]. Given some connectivity suggested by the sequence similarity based approach there may exist opportunities for productive combination with phylogenetic tracing and transmission pathway studies [34].

We recognize that both approaches can be impacted by limitations in data collection. Several publications have noted reporting lags although these are most problematic with respect to death reports rather than daily reported case counts [3538]. There is clearly the potential for inaccuracies in data collection covering many different jurisdictions. If for example, reports of new cases are delayed by a day or two from a jurisdiction this could potentially change the similarity in the sequences of county daily case counts. However, given the length of the study periods here we expect lags of one to two days to have minor impact.

Supporting information

S1 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-3-13/2020.

This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

https://doi.org/10.1371/journal.pone.0252990.s001

(XLSX)

S2 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-3-31/2020.

This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

https://doi.org/10.1371/journal.pone.0252990.s002

(XLSX)

S3 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-4-19/2020.

This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

https://doi.org/10.1371/journal.pone.0252990.s003

(XLSX)

S4 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-5-20/2020.

This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

https://doi.org/10.1371/journal.pone.0252990.s004

(XLSX)

S5 Table. The minimal data set underlying the results described in this manuscript.

https://doi.org/10.1371/journal.pone.0252990.s005

(CSV)

References

  1. 1. Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet. 2020;395(10223):497–506. pmid:31986264
  2. 2. Saglietto A, D’Ascenzo F, Zoccai GB, De Ferrari GM. COVID-19 in Europe: the Italian lesson. Lancet. 2020;395(10230):1110–1. pmid:32220279
  3. 3. Danon L, Brooks-Pollock E, Bailey M, Keeling MJ. A spatial model of CoVID-19 transmission in England and Wales: early spread and peak timing. medRxiv [Preprint]. 2020 medRxiv 20022566 [posted 2020 Feb 14; cited 2020 Sept 10]. Available from: https://www.medrxiv.org/content/10.1101/2020.02.12.v1.
  4. 4. Alamo T, Reina DG, Mammarella M, Abella A. Open data resources for fighting covid-19. arXiv 200406111 [Preprint]. 2020 [posted Apr 13; last revised May 11; cited Sept 10]. Available from: https://arxiv.org/abs/04.06111.
  5. 5. Latif S, Usman M, Manzoor S, Iqbal W, Qadir J, Tyson G, et al. Leveraging data science to combat covid-19: A comprehensive review. IEEE Transactions on Artificial Intelligence. 2020;1(1):85–103.
  6. 6. Moorthy V, Restrepo AMH, Preziosi M-P, Swaminathan S. Data sharing for novel coronavirus (COVID-19). Bulletin of the World Health Organization. 2020;98(3):150. pmid:32132744
  7. 7. Kulldorff M. A spatial scan statistic. Communications in Statistics-Theory and methods. 1997;26(6):1481–96.
  8. 8. Kulldorff M. Spatial scan statistics: models, calculations, and applications. Scan statistics and applications: Springer; 1999. p. 303–22.
  9. 9. Kulldorff M. Prospective time periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society: Series A (Statistics in Society). 2001;164(1):61–72.
  10. 10. Kulldorff M, Heffernan R, Hartman J, Assuncao R, Mostashari F. A space-time permutation scan statistic for disease outbreak detection. PLoS Med. 2005;2(3):e59. pmid:15719066
  11. 11. Khan D, Rossen LM, Hamilton BE, He Y, Wei R, Dienes E. Hot spots, cluster detection and spatial outlier analysis of teen birth rates in the U.S., 2003–2012. Spatial and Spatio-temporal Epidemiology. 2017;21:67–75. pmid:28552189
  12. 12. Desjardins MR, Hohl A, Delmelle EM. Rapid surveillance of COVID-19 in the United States using a prospective space-time scan statistic: Detecting and evaluating emerging clusters. Applied Geography. 2020;118:102202. pmid:32287518
  13. 13. Qi H, Xiao S, Shi R, Ward MP, Chen Y, Tu W, et al. COVID-19 transmission in Mainland China is associated with temperature and humidity: A time-series analysis. Science of the Total Environment. 2020:138778. pmid:32335405
  14. 14. Jones RC, Liberatore M, Fernandez JR, Gerber SI. Use of a prospective space-time scan statistic to prioritize shigellosis case investigations in an urban jurisdiction. Public Health Reports. 2006;121(2):133–9. pmid:16528945
  15. 15. Yih WK, Deshpande S, Fuller C, Heisey-Grove D, Hsu J, Kruskal BA, et al. Evaluating real-time syndromic surveillance signals from ambulatory care data in four states. Public Health Reports. 2010;125(1):111–20. pmid:20402203
  16. 16. Yin F, Li X, Ma J, Feng Z. The early warning system based on the prospective space-time permutation statistic. Wei Sheng Yan Jiu (in Chinese: Journal of Hygiene Research). 2007;36(4):455–8. pmid:17953214
  17. 17. Duczmal LH, Moreira GJ, Burgarelli D, Takahashi RH, Magalhães FC, Bodevan EC. Voronoi distance based prospective space-time scans for point data sets: a dengue fever cluster analysis in a southeast Brazilian town. International Journal of Health Geographics. 2011;10(1):29.
  18. 18. Hohl A, Delmelle E, Desjardins M, Lan Y. Daily surveillance of COVID-19 using the prospective space-time scan statistic in the United States. Spatial and Spatio-temporal Epidemiology. 2020:100354.
  19. 19. Li M, Shi X, Li X, Ma W, He J, Liu T. Sensitivity of disease cluster detection to spatial scales: an analysis with the spatial scan statistic method. International Journal of Geographical Information Science. 2019;33(11):2125–52.
  20. 20. Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infectious Diseases. 2020;20(5):533–4. pmid:32087114
  21. 21. He W, Yi GY, Zhu Y. Estimation of the basic reproduction number, average incubation time, asymptomatic infection rate, and case fatality rate for COVID-19: Meta-analysis and sensitivity analysis. Journal of Medical Virology. 2020;92(11):2543–50. pmid:32470164
  22. 22. Kulldorff M, Mostashari F, Duczmal L, Katherine Yih W, Kleinman K, Platt R. Multivariate scan statistics for disease surveillance. Statistics in Medicine. 2007;26(8):1824–33. pmid:17216592
  23. 23. Jaccard P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat. 1901;37:547–79.
  24. 24. Sun S-B, Zhang Z-H, Dong X-L, Zhang H-R, Li T-J, Zhang L, et al. Integrating Triangle and Jaccard similarities for recommendation. PloS One. 2017;12(8):e0183570. pmid:28817692
  25. 25. Ayub M, Ghazanfar MA, Maqsood M, Saleem A, editors. A Jaccard base similarity measure to improve performance of CF based recommender systems. 2018 International Conference on Information Networking (ICOIN), Chiang Mai, Thailand; 2018, pp. 1–6.
  26. 26. Ros F, Guillaume S. A hierarchical clustering algorithm and an improvement of the single linkage criterion to deal with noise. Expert Systems with Applications. 2019;128:96–108.
  27. 27. Syakur M, Khotimah B, Rochman E, Satoto B. Integration k-means clustering method and elbow method for identification of the best customer profile cluster. IOP Conference Series: Materials Science and Engineering. 2018;336(1):012017.
  28. 28. Gustriansyah R, Suhandi N, Antony F. Clustering optimization in RFM analysis based on k-means. Indones J Electr Eng Comput Sci. 2020;18(1):470–7.
  29. 29. Zambelli AE. A data-driven approach to estimating the number of clusters in hierarchical clustering. F1000Research. 2016;5. pmid:28408972
  30. 30. Dowd JB, Andriano L, Brazel DM, Rotondi V, Block P, Ding X, et al. Demographic science aids in understanding the spread and fatality rates of COVID-19. Proceedings of the National Academy of Sciences. 2020;117(18):9696–8. pmid:32300018
  31. 31. Gonzalez-Reiche AS, Hernandez MM, Sullivan MJ, Ciferri B, Alshammary H, Obla A, et al. Introductions and early spread of SARS-CoV-2 in the New York City area. Science. 2020;369(6501):297–301. pmid:32471856
  32. 32. Worobey M, Pekar J, Larsen BB, Nelson MI, Hill V, Joy JB, et al. The emergence of SARS-CoV-2 in Europe and North America. Science. 2020;370(6516):564–70. pmid:32912998
  33. 33. Deng X, Gu W, Federman S, Du Plessis L, Pybus OG, Faria NR, et al. Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California. Science. 2020;369(6503):582–7. pmid:32513865
  34. 34. Zhang W, Govindavari JP, Davis BD, Chen SS, Kim JT, Song J, et al. Analysis of genomic characteristics and transmission routes of patients with confirmed SARS-CoV-2 in Southern California during the early stage of the US COVID-19 pandemic. JAMA Network Open. 2020;3(10):e2024191–e. pmid:33026453
  35. 35. Casella F. Can the COVID-19 epidemic be controlled on the basis of daily test reports? IEEE Control Systems Letters. 2020;5(3):1079–84.
  36. 36. Aliprantis D, Tauber K. Measuring deaths from COVID-19. Economic Commentary. 2020;18:1–7.
  37. 37. Angelopoulos AN, Pathak R, Varma R, Jordan MI. On identifying and mitigating bias in the estimation of the COVID-19 case fatality rate. Harvard Data Science Review. 2020;Special Issue 1-COVID-19.
  38. 38. Kogan NE, Clemente L, Liautaud P, Kaashoek J, Link NB, Nguyen AT, et al. An early warning approach to monitor COVID-19 activity with multiple digital traces in near real time. Science Advances. 2021;7(10):eabd6989. pmid:33674304