Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation

Oscar Beijbom; Peter J. Edmunds; Chris Roelfsema; Jennifer Smith; David I. Kline; Benjamin P. Neal; Matthew J. Dunlap; Vincent Moriarty; Tung-Yung Fan; Chih-Jui Tan; Stephen Chan; Tali Treibitz; Anthony Gamst; B. Greg Mitchell; David Kriegman

doi:10.1371/journal.pone.0130312

Abstract

Global climate change and other anthropogenic stressors have heightened the need to rapidly characterize ecological changes in marine benthic communities across large scales. Digital photography enables rapid collection of survey images to meet this need, but the subsequent image annotation is typically a time consuming, manual task. We investigated the feasibility of using automated point-annotation to expedite cover estimation of the 17 dominant benthic categories from survey-images captured at four Pacific coral reefs. Inter- and intra- annotator variability among six human experts was quantified and compared to semi- and fully- automated annotation methods, which are made available at coralnet.ucsd.edu. Our results indicate high expert agreement for identification of coral genera, but lower agreement for algal functional groups, in particular between turf algae and crustose coralline algae. This indicates the need for unequivocal definitions of algal groups, careful training of multiple annotators, and enhanced imaging technology. Semi-automated annotation, where 50% of the annotation decisions were performed automatically, yielded cover estimate errors comparable to those of the human experts. Furthermore, fully-automated annotation yielded rapid, unbiased cover estimates but with increased variance. These results show that automated annotation can increase spatial coverage and decrease time and financial outlay for image-based reef surveys.

Citation: Beijbom O, Edmunds PJ, Roelfsema C, Smith J, Kline DI, Neal BP, et al. (2015) Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation. PLoS ONE 10(7): e0130312. https://doi.org/10.1371/journal.pone.0130312

Academic Editor: Chaolun Allen Chen, Biodiversity Research Center, Academia Sinica, TAIWAN

Received: December 21, 2014; Accepted: May 19, 2015; Published: July 8, 2015

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication

Data Availability: All images and data used in this work are submitted to, or linked from, Dryad. doi:10.5061/dryad.m5pr3.

Funding: This work was supported by the US National Science Foundation Division of Ocean Sciences (OCE) 0941760 to BGM. We gratefully acknowledge the support of the U.S. National Science Foundation's Long Term Ecological Research (LTER) program and the Moorea Coral Reef (MCR) LTER Site (through OCE 0417412, 1026851 and 1236905).

Competing interests: The authors have declared that no competing interests exist.

Introduction

Coral reefs provide habitat to a wide diversity of organisms, and substantial economic and cultural benefits to coastal communities [1,2]. These functions are threatened by global declines in coral cover caused by a wide diversity of natural and anthropogenic disturbances including global climate change and ocean acidification [3]. The decline in coral cover has been dramatic, with > 80% decrease in the Caribbean over the last four decades [4], and 1–2% loss each year in the Indo-Pacific between 1997 and 2003 [5]. These rates of decline are forecast to increase [6], and extensive surveys are urgently needed to better understand the “coral reef crisis” [7] and contextualize effective ecosystem-based management for this critical ecosystem [8].

Reef surveys have traditionally been performed in situ by scuba divers skilled in marine ecology and capable of identifying and counting taxa underwater. In situ surveys enable accurate observations, but they are time-consuming and allow only small areas of the reef to be surveyed. Additionally, they are dependent on the skill of the experts conducting the surveys, and the data are usually not available for re-analysis. In situ surveys were largely replaced by image-based surveys when high quality underwater cameras became available at an economic price in the early 1960’s [9], coincident with the proliferation of scuba as a tool for underwater research. Image-based surveys are advantageous as they allow faster data collection and provide a permanent record that can be analyzed for organism abundance [10] and demographic properties [11]. However, quantifying organisms in benthic photographs is more challenging than in situ inspection, due to limited image resolution, variable lighting conditions, water turbidity, and the inability to interact physically with the benthos. Although some efforts have been made to quantify accuracy and inter annotator variability in surveys conducted using underwater video transects [12,13], there is little information describing the variability associated with annotations of coral reef survey images.

Since the turn of the new millennium, the application of image-based tools for marine ecology has changed dramatically in three domains. First, improvements in digital photography have allowed large numbers of images to be gathered at increasing resolution. Second, the capabilities of computers and software to manipulate, store, and analyze images have increased by orders of magnitude. Third, advances in robotics and control theory have enabled the construction of autonomous underwater vehicles (AUVs), remotely operated vehicles (ROVs), and imaging sleds that can capture thousands of survey images in a single deployment [14–17]. However, the capacity to analyze images has not advanced at a pace commensurate with the capacity to collect them. This has created a ‘manual-annotation’ bottleneck between the rapid image acquisition and the quantitative data needed for ecological analysis.

Generating quantitative ecological information from underwater images typically involves random point annotation to estimate the percent cover of the substrata of interest. To generate these data, substratum types are identified, usually manually by an expert in that ecosystem, for a number of randomly selected locations in each image. Statistically valid sampling of benthic habitats with image- and point- based tools requires careful attention to the primary purpose of the study, choice of the experimental and statistical approaches necessary to answer the questions being addressed, and the statistical power (i.e., a function of sample size, variance, and desired difference to be detected) required to test the hypotheses of interest. These experimental design components are not the focus of this paper, and interested readers are referred to the many excellent texts on these subjects [10,18]. Instead, we focus on the means and accuracy by which quantitative information can be manually and automatically extracted from underwater images using random point annotation, specifically for near-shore tropical marine environments.

Point annotations are typically preformed using manual annotation software like Coral Point Count with Excel Extensions (CPCe) [19], photoQuad [20], pointCount99 [21], Biigle (biigle.de) or Catami (catami.org), which facilitate the annotation process by providing user interfaces and tools for the export of relevant data. Recently, there have been several efforts to automate point annotation of benthic survey images [22–26], but the implementation of these tools has been hindered by two issues. First, while human experts have been annotating benthic images for decades, the accuracy of this procedure (i.e., intra-annotator error) remains largely unknown, as does the extent to which results varies between annotators (i.e., inter-annotator error). This information is critical as it provides a baseline against which the efficacy of current and future automated annotation methods can be assessed. Second, there is scope to develop “hybrid” annotation modes, which automatically annotate a significant portion of the data, but defer the most uncertain identification decisions to human annotators. The value of such modes will depend on the quantitative means by which “uncertain” is evaluated, and on the appropriate trade-off between accuracy and efficiency for the question at hand.

In this paper we focus on codifying the criteria for implementing an existing automated image analysis tool [24] for coral reef survey images. Specifically, we sought to answer two questions: (1) what is the baseline variation among human experts in the analysis of benthic communities that we would hope to equal or improve upon through automated methods, and (2) what is the appropriate framework for optimizing the trade-off between the high accuracy of a human annotator and the high efficiency of an automated annotator? To answer the first question, we designed a study in which we quantified the variability among multiple experts in the analysis of benthic images from coral reefs in Moorea (French Polynesia), Nanwan Bay (Taiwan), the northern Line Islands, and Heron Reef (Great Barrier Reef), and compared their results to those obtained through computer-based automated analysis. To address the second question, we then altered the requirements for more (or less) human supervision in the modes of annotation in order to quantify the trade-off between accuracy and efficiency. Finally, a key contribution of this work is that the methods developed for automated annotations are incorporated in the random point annotation tool of CoralNet (coralnet.ucsd.edu), which is publically available.

Materials & Methods

Digital photoquadrats from coral reefs in four locations throughout the Pacific, with 671–3472 images per location, were used to provide a diverse model system for testing our analytical tools (Table 1). These images were originally annotated by experts using 24–200 random point locations superimposed on each image. From each image set at each location, 200 images were randomly selected and designated as the Evaluation set; the remainder were designated the Reference set. The four Evaluation sets were then re-annotated independently, and without prior knowledge of the initial annotations, by each of six human experts at 10 point locations in each image, sub-sampled at random from the original point locations. Following the human annotation, an implementation of a computer vision algorithm [24] was then used to automatically annotate the same 10 points per image. The four Reference sets were used to train the automated annotator, and were made available as training material for the human experts. The multiple sets of annotations were used to evaluate intra- and inter-annotator variability and to evaluate the proposed operational modes of automated annotation. All images and data used in this study are made available at (doi:10.5061/dryad.m5pr3).

Study locations

Digital photoquadrats were obtained from projects monitoring coral reef community structure in the outer and fringing reefs of Moorea (French Polynesia), Kingman, Palmyra, Tabuaeran and Kiritimati atolls (northern Line Islands), Nanwan Bay (Taiwan), and the platform reefs at Heron Reef (Great Barrier Reef, Australia). These study locations were selected because they offered legacy data involving large numbers of images that had been annotated with equivalent random point methodologies by experts with extensive experience in identifying benthic organisms from photographs at their respective locations. In each location, multiple species of scleractinian corals, macroalgae, crustose coralline algae, and various non-coral invertebrates densely populate benthic surfaces, and photoquadrats are characterized by complex shapes, diverse surface textures, and intricate boundaries between dissimilar taxa. Additionally, water turbidity and light attenuation degrade colors and image clarity to varying degrees for the four image sets, presenting a challenging task for both manual and automated annotation. It should be noted however, that these are all typical conditions, and these image sets represent typical survey images taken for purposes of coral reef ecology.

The four locations also represent the great variation commonly found within and among photographic surveys of shallow (< 20 m depth), Pacific coral reefs. This variation includes differences among locations in species diversity and their colony morphologies, variation in camera equipment (e.g., angle of view, and resolution), distance between camera and benthos (and whether the distance was constant among photographs), and the mechanism employed to compensate for the depth-dependent attenuation of sunlight (i.e., through the use of strobes and/or manual white balance adjustment of camera exposures). The photographs from Moorea, the Line Islands and Nanwan Bay were recorded using framers to hold the camera perpendicular to, and at a constant distance from, the sea floor. Underwater strobes were used in Moorea to restore surface color and remove shadows from images, and in the Line Islands, image-colors were adjusted through manual adjustment of the white balance for each series of images. Neither strobes nor color correction were used to record photoquadrats in Nanwan Bay. Finally, at Heron Island, the reef was recorded using a camera (without strobes or white balance correction) that was hand-held above the reef using a weighted line suspended below the camera to maintain an approximately fixed distance to the sea floor [27]. Refer to Fig 1, and S1 Fig for sample images from the locations, to Table 1 for a data summary, and to S1 Appendix for additional details on the survey locations.

Download:

Fig 1. Sample photoquadrats.

Sample photoquadrats acquired and annotated as part of long-term monitoring projects on shallow coral reefs (≤ 17-m depth). (A) Moorea, French Polynesia, acquired with strobes and framer (50 x 50 cm); (B) Palmyra, northern Line Islands, acquired using a manual white balance and framer (65 x 90 cm); (C) Nanwan Bay, Taiwan acquired with framer (35 x 35 cm) but neither strobes nor white balance; (D) Heron Reef, GBR, acquired without framer, strobes or white balance.

https://doi.org/10.1371/journal.pone.0130312.g001

Download:

Table 1. Summary of image resources.

https://doi.org/10.1371/journal.pone.0130312.t001

Label-set

When this study began, the photoquadrats from each of the four locations had already been manually annotated by the local coral reef experts using four different label-sets defined by the respective experts. To make comparisons of annotator accuracies between locations, a consensus label-set was created to which the four original label-sets were mapped (Table 2). The consensus label-set consisted of 8 scleractinian genera and 1 ‘other scleractinians’ label; 3 algal functional groups (macroalgae, crustose coralline algae (CCA), and turf algae); and 9 other labels that included sponges, sand, and the hydrozoan Millepora. In our analysis we also considered all coral labels together as a coral functional group. The labels of the consensus label-set were chosen to enable a one-to-one or many-to-one mapping from the original label-sets in Moorea, Line Islands, and Nanwan Bay. Corals were not resolved to genus level in the original Heron Reef label-set and all Archived coral annotations were therefore mapped to the generic ‘other scleractinians’ label for this location.

Download:

Table 2. Consensus label-set.

https://doi.org/10.1371/journal.pone.0130312.t002

Manual Annotations

As part of the present analysis, the photoquadrats from each of the four locations were manually re-annotated by six human experts. The local expert familiar with each location were designated as the ‘Host’, and the other five experts with less familiarity with the specific locations were designated ‘Visitors’. All six were experts in identify corals and benthic taxa at coral reefs in the tropical Pacific Ocean. The annotations completed by the Hosts prior to the present study as part of the original ecological analyses were denoted ‘Archived’. One to six years had passed since the original annotations were made, and thus we reasoned that the Hosts would not be biased by their own original annotations. We emphasize that there are four Hosts in this study; one for each location, and that the Hosts for each location re-annotated the same points in the same images that they had annotated previously themselves. Intra-annotator variability could thus be measured by comparing the Host and Archived annotations. All of the present annotations were performed using the random point annotation tool of CoralNet (S2 Fig). To assist the experts in the new annotations, the images and Archived annotations of the Reference sets from each location were stored in CoralNet and used as a virtual learning tool to improve identification of the benthic taxa (S2 Fig). All images were scored using 10 points per image, so with 200 images per location and 4 locations, each annotated by 6 experts, this study generated 48,000 manual point-annotations. The manual annotation effort required on average 1 minute per image, for a total annotation time of approximately 15 hours per expert. This annotation time was divided into several 1–3 hour sessions over 1–4 weeks depending on the preference of the annotator.

Automated Annotations

We previously developed an automated annotation system for coral reef survey images [24]. In this system, the texture and color of a local image patch around a location of interest (i.e., one of the randomly selected points) is encoded as a count of ‘visual words’, which is a quantization of the visual appearance space of the image. The encoded visual information is then used, together with a set of labels, to train the automated annotator, using the Support Vector Machines (SVM) algorithm [28].

The method of [24] was modified in two ways to increase the accuracy and reduce the runtime. First, the vector quantization step of [24] was replaced by Fisher Encoding [29] which was recently shown to improve classification accuracy for various image classification tasks [30]. Second, the kernel-based SVM of [24] was replaced by a linear SVMs which have significantly lower runtime [28–31]. The final method used in this work required 20 seconds to pre-process each image, the training of the SVM required 5–20 minutes for each location depending on the size of the Reference set for that location, and automated annotations of an unknown image from the Evaluation set required < 1 second. The aforementioned times were measured on a single computational core.

The color and texture information of the images in the Evaluation and Reference sets was encoded as a 1920 dimensional ‘feature’ vector for the annotated point in each image [29]. The feature vectors from the Reference set were paired with the Archived annotations to form a training-set, which was used to train a one-versus-rest linear SVM (S1 Appendix). This training was performed separately for each location so that the training-set from each location was used to train a SVM, which was then used to automatically annotate the images in the Evaluation set from the same location.

Since the images in the Evaluation set were selected randomly from the set of available images, the mean percent cover of each label was similar in the Reference and Evaluation sets. This means that approximately correct cover estimates of the Evaluation set could be generated by randomly annotating point locations in the same proportions as is present in the Reference set. However, in other situations where the automated annotator is trained on data from, for example, a previous year or another location, the proportions of the different types of annotations (e.g., percent coral) will likely differ between training data (here: Reference set) and test data (here: Evaluation set). The following pre-processing step was applied to ensure that any conclusions from this study, with respect to the efficacy of the automated annotator to estimate percent covers, would be valid in such situations. In this pre-processing step, a randomly selected subset of the non-coral annotations in the Reference sets was discarded, so that the proportion of coral annotations increased by 10%. For example, there were 19,566 coral, and 74,634 non-coral annotations in the Moorea Reference set, and therefore, 20.77% was coral. A random subset of 8,566 non-coral annotations was discarded so that 66,068 non-coral annotations remained. This resulted in a 10% increase in the estimate of coral abundance (22.85%).

Post-processing of annotations

Two inconsistencies in the Hosts’ and Visitors’ annotations were noted and corrected. First, some expert annotators mapped points on the images that fell on transect hardware to the ‘Transect hardware’ label, while others mapped them to substratum types that were inferred to occur beneath the hardware based on close proximity adjacent to the hardware. Such discrepancies were due to incomplete instructions to the experts rather than identification difficulties. Therefore, any point locations that two or more of the experts mapped to ‘Transect Hardware’ were considered to be transect hardware, and all annotations for that point location were changed to ‘Transect hardware’ in post-processing.

The second inconsistency arose with the scoring of images from Moorea because neither the original label-set in this location, nor the Archived annotations, contained the ‘Bare space’ label of the consensus label-set. Instead, in this location all “bare space” was effectively covered with CCA and therefore was mapped to a ‘CCA’ label. However, in the annotation of the present study, some experts mapped these areas to ‘Bare space’ as the best match to this state in the consensus label-set. To facilitate a comparison between the multiple sets of annotations, all ‘Bare space’ annotations for the Moorea location were changed to ‘CCA’ in post-processing.

Estimating Annotator Variability

Using the Archived annotations as baseline, the accuracy of annotator a was estimated using the Cohen’s kappa statistic, and denoted κ^a [32]. Additionally, denoted the Cohen’s kappa of the binary classification task between a set of labels, Ψ (e.g., corals, macroalgae, or Acropora), and the labels not in Ψ. For example, denoted the Hosts’ accuracy for the task of discriminating between coral and non-coral. For notational brevity, the super-script is henceforth dropped when it is clear from the context which annotator is considered. A one-sample Kolmogorov-Smirnov test was performed to determine normality of κ. This test yielded p-values < 0.001 for all labels, annotators and locations, indicating non-normality and the consequent need for non-parametric tests. Differences in κ (using the full label-set) between the Hosts and Visitors were assessed using non-parametric Mann-Whitney U tests, and difference between the four functional groups (coral, macroalgae, CCA, and turf algae) was assessed using a Kruskal-Wallis test, both using the four locations as repeated trials.

Additionally, using again the Archived annotations as baseline, a confusion matrix, Q was estimated for each location and annotator. Each confusion matrix Q was 20 rows by 20 columns and values at row r, column c in the matrices indicate the ratio of annotations originally labeled by Archived as label r now classified by the Host, Visitors, and automated annotator, respectively, as label c.

Modes of Automated Annotation

In this section, the proposed operational modes for semi- and fully- automated annotation are detailed.

The Alleviate Operational Mode.

A one-versus-rest automated classification procedure generates a vector of classification scores corresponding to each classification decision [33]. The score vector can be used to estimate the certainty of the automated classifier for an unseen example [34]. For example, if all scores are low and similar to each other, the classifier is more likely to assign the incorrect label [34]. In such situation, an alternative is to let the classifier abstain from making an automated annotation decision and instead defer to a human expert. Such procedure enables a trade-off between the amount of human effort and the accuracy of the final annotations.

The classification scores were used in an operational mode, which we denote ‘Alleviate’ because it alleviates the workload for the human annotator. The scores were denoted s_i,k(m) where subscript i indicate the image, subscript k the point location, and m the class label. For a given point in an image in the Evaluation set the vector of 20 scores, one for each class, was denoted [s_i,k]. Using Alleviate, an automated annotation was only assigned when, for a given point location, there existed a score s_i,k(m) such that s_i,k(m) > ϵ, for some threshold, ϵ. More formally, the Alleviate annotations, were defined as: where was the automated annotation of point k in image i, and the corresponding Host annotation. We denoted by λ(ϵ) the ‘level of alleviation’, the fraction of samples classified by the automated annotator for a certain threshold, ϵ. A level of alleviation of λ(ϵ) = 50% was used in the final analysis of Alleviate.

The same alleviation procedure was also used, in one analysis, to combine the Visitors’ and the automated annotations. However, unless explicitly stated otherwise, Alleviate denoted the combination of the Hosts’ and the automated annotations throughout this paper.

The Abundance Operational Mode.

While the automated annotations are, in general, less accurate than those provided by human experts [23,24,26], they have the benefit of being, in contrast to a human expert, deterministic, meaning that the automated annotator will always make the same classification decision for the same image. If the confusion matrix is known for the automated annotator, it can be used to generate annotations that are accurate in aggregate (such as abundance of taxa) even though individual decisions might be erroneous [35,36]. The corrected abundances will, by design, be unbiased, but with larger variation than those based on the automated annotations [35].

Using the abundance correction method of [35,36], a fully automated deployment mode, Abundance was defined. In this mode, cover estimates for each image i were calculated as: (1) where denoted the cover in image i of class m based on the automated annotations, Q' was a confusion matrix estimated through a 20 fold cross-validation on the Reference set, and the–T superscript denotes matrix transpose followed by inverse. Note the difference between Q', which was estimated for the automated annotator from the Reference set and used in Abundance, and Q, which was estimated for all annotators from the Evaluations sets and used to evaluate the final performance.

Performance evaluation based on cover estimates

The estimates of ecological composition based on the operational modes were contrasted with the estimates based on the Archived, Hosts’, and Visitors’ annotations. The cover of label m, for image i, was denoted for each annotator (or mode):

In addition, was derived from the automated annotations using Eq 1. The pairwise differences from the Archived cover estimates for each image were denoted:

The average estimation error (bias) for annotator (or mode) a, label m for the 200 images in a certain location was denoted:

The Mean Absolute Error (MAE) was denoted as the mean absolute value of e^o(m) calculated for a particular substratum and group of annotators across the four locations (e.g., the macroalgae cover as estimated by the Hosts).

The differences were also used to test the null hypothesis that, for each label, annotator and dataset, the mean of is zero. A one-sample Kolmogorov-Smirnov test was first performed to determine normality of . This test yielded p-values < 0.001 for all label (or label groups), annotations, and locations, indicating non-normality and the consequent need for a non-parametric test. A permutation t-test was therefore performed at the 95% significance level with a Bonferroni correction for eight comparisons (Alleviate, Abundance, Host, and 5 Visitors) [37]. The differences were also used to estimate a 95% confidence interval for all contrasts using the percentile-t bootstrap procedure [37].

CoralNet

As a key contribution of this work, we make implementations of the described modes of operation (Alleviate, Abundance, and Refine (S1 Appendix)) available on CoralNet (coralnet.ucsd.edu). CoralNet is a software module that has been designed as a repository and online annotation tool for benthic survey images, and it allows users to upload and annotate survey images with a user-defined label-set. Annotations are performed using a random point annotation interface similar to that of CPCe [19]. CoralNet also offers tools for browsing images and annotations as well as for viewing and exporting estimated cover statistics. The implementation of Alleviate allows users to determine the level of alleviation based on the accuracy of the automated annotator, and Refine is implemented with 5 suggested labels per point location.

Results

Annotator Accuracy

Substratum identification accuracies.

Annotator accuracy was measured as Cohen’s kappa, κ of the annotators, compared to the Archived annotations. The Hosts’ accuracies differed among the functional groups (H = 10.08, df = 3, p = 0.018). Specifically, the accuracy for each functional group was: κ_coral = 89.7±1.2%, κ_macroalgae = 71.0±4.0%, κ_CCA = 51.0±9.3%, and κ_{turf algae} = 61.6±6.1% (mean ± SE, n = 4 locations, Fig 2, S1 Table), with the majority of confusion occurring among algal groups (S3 Fig). The Hosts’ accuracy for common coral genera (i.e., with > 10 Archived annotations) was κ_{coral genera} = 79.4±4.2% (n = 19, S1 Table). Coral genera were most commonly confused with other coral genera, in particular with the general label of ‘other scleractinian’. However, there was also notable confusion between the turf algae and CCA (S3 Fig).

Download:

Fig 2. Annotation accuracy.

Accuracy as measured by Cohen’s kappa for the Hosts’ annotations with various level of alleviation, and for the Visitors’ (mean ± SD, n = 5) annotations. The left edge of the alleviation curve corresponds to automated annotations only, and the right edge corresponds to the Hosts’ annotations only. The first column shows accuracy calculated from the full label-set. Column two to five show accuracy for the task of discriminating functional groups: coral, macroalgae, crustose coralline algae (CCA), and turf algae, respectively, versus the rest. Columns six, seven and eight show accuracy for the three dominant coral genera. Genus-level identification for Heron Reef is not provided, as Archived annotations were not available to this resolution.

https://doi.org/10.1371/journal.pone.0130312.g002

Relationship between abundances and identification accuracies.

The Hosts’ accuracies were not correlated with the benthic covers (r = 0.089, df = 33, p = 0.61), although there were a few outliers recorded a low (< 5%) cover (Fig 3A). The annotation accuracy of corals (when considered as a group) was > 80% in all four locations; accuracy of algal groups was > 60% except for three samples with < 12% cover; and accuracy of coral genera was generally > 80%, except six samples with < 3% cover.

Download:

Fig 3. Comparison between accuracy and percent cover.

Comparison between mean percent cover and annotation accuracies displayed as (A) Cohen’s kappa for the Hosts, and (B) Difference in Cohen’s kappa between Alleviate and the Hosts, plotted against the percent cover of coral genera and functional groups. Data is drawn from all four locations except for the coral genera where data was not available for Heron Reef.

https://doi.org/10.1371/journal.pone.0130312.g003

Inter-annotator variability.

The Visitors’ annotation accuracies were lower than the Hosts’ (U = 26, n = 4, p = 0.029). Specifically, the accuracy for each functional group was: κ_coral = 84.0±1.0%, κ_macroalgae = 58.7±2.5%, κ_CCA = 35.5±3.7%, and κ_{turf algae} = 43.3±3.6% (mean ± SE, n = 20, Fig 2, S1 Table). The Visitors’ accuracy for common coral genera (i.e., with > 10 Archived annotations) was κ_{coral genera} = 58.6±2.9% (n = 95, S1 Table). As with the Hosts, the principal confusion occurred among the algal groups, and the principal confusion among coral genera occurred with the ‘other scleractinian’ label (S3 Fig). The lower accuracy of the Visitors was evident also in the lesser weight across the confusion matrix diagonals compared to the matrix diagonals of the Hosts (S3 Fig).