Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Overcoming the problem of multicollinearity in sports performance data: A novel application of partial least squares correlation analysis

  • Dan Weaving ,

    Roles Conceptualization, Formal analysis, Methodology, Project administration, Writing – original draft, Writing – review & editing

    d.a.weaving@leedsbeckett.ac.uk

    Affiliations Institute for Sport, Physical Activity and Leisure, Leeds Beckett University, Leeds, West Yorkshire, United Kingdom, Leeds Rhinos Rugby League club, Leeds, United Kingdom

  • Ben Jones,

    Roles Funding acquisition, Supervision, Writing – review & editing

    Affiliations Institute for Sport, Physical Activity and Leisure, Leeds Beckett University, Leeds, West Yorkshire, United Kingdom, Yorkshire Carnegie Rugby Union club, Leeds, United Kingdom, The Rugby Football League, Leeds, United Kingdom

  • Matt Ireton,

    Roles Data curation, Writing – review & editing

    Affiliations Institute for Sport, Physical Activity and Leisure, Leeds Beckett University, Leeds, West Yorkshire, United Kingdom, Warrington Wolves Rugby League club, Warrington, United Kingdom

  • Sarah Whitehead,

    Roles Data curation, Methodology, Writing – review & editing

    Affiliations Institute for Sport, Physical Activity and Leisure, Leeds Beckett University, Leeds, West Yorkshire, United Kingdom, Leeds Rhinos Rugby League club, Leeds, United Kingdom

  • Kevin Till,

    Roles Project administration, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Institute for Sport, Physical Activity and Leisure, Leeds Beckett University, Leeds, West Yorkshire, United Kingdom, Leeds Rhinos Rugby League club, Leeds, United Kingdom, Yorkshire Carnegie Rugby Union club, Leeds, United Kingdom

  • Clive B. Beggs

    Roles Conceptualization, Formal analysis, Methodology, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Institute for Sport, Physical Activity and Leisure, Leeds Beckett University, Leeds, West Yorkshire, United Kingdom

Abstract

Objectives

Professional sporting organisations invest considerable resources collecting and analysing data in order to better understand the factors that influence performance. Recent advances in non-invasive technologies, such as global positioning systems (GPS), mean that large volumes of data are now readily available to coaches and sport scientists. However analysing such data can be challenging, particularly when sample sizes are small and data sets contain multiple highly correlated variables, as is often the case in a sporting context. Multicollinearity in particular, if not treated appropriately, can be problematic and might lead to erroneous conclusions. In this paper we present a novel ‘leave one variable out’ (LOVO) partial least squares correlation analysis (PLSCA) methodology, designed to overcome the problem of multicollinearity, and show how this can be used to identify the training load (TL) variables that influence most ‘end fitness’ in young rugby league players.

Methods

The accumulated TL of sixteen male professional youth rugby league players (17.7 ± 0.9 years) was quantified via GPS, a micro-electrical-mechanical-system (MEMS), and players’ session-rating-of-perceived-exertion (sRPE) over a 6-week pre-season training period. Immediately prior to and following this training period, participants undertook a 30–15 intermittent fitness test (30-15IFT), which was used to determine a players ‘starting fitness’ and ‘end fitness’. In total twelve TL variables were collected, and these along with ‘starting fitness’ as a covariate were regressed against ‘end fitness’. However, considerable multicollinearity in the data (VIF >1000 for nine variables) meant that the multiple linear regression (MLR) process was unstable and so we developed a novel LOVO PLSCA adaptation to quantify the relative importance of the predictor variables and thus minimise multicollinearity issues. As such, the LOVO PLSCA was used as a tool to inform and refine the MLR process.

Results

The LOVO PLSCA identified the distance accumulated at very-high speed (>7 m·s-1) as being the most important TL variable to influence improvement in player fitness, with this variable causing the largest decrease in singular value inertia (5.93). When included in a refined linear regression model, this variable, along with ‘starting fitness’ as a covariate, explained 73% of the variance in v30-15IFT ‘end fitness’ (p<0.001) and eliminated completely any multicollinearity issues.

Conclusions

The LOVO PLSCA technique appears to be a useful tool for evaluating the relative importance of predictor variables in data sets that exhibit considerable multicollinearity. When used as a filtering tool, LOVO PLSCA produced a MLR model that demonstrated a significant relationship between ‘end fitness’ and the predictor variable ‘accumulated distance at very-high speed’ when ‘starting fitness’ was included as a covariate. As such, LOVO PLSCA may be a useful tool for sport scientists and coaches seeking to analyse data sets obtained using GPS and MEMS technologies.

Introduction

Professional sporting organisations invest considerable resources collecting and analysing data to better understand the factors that influence athletic performance. Recent advances in wearable technology and computing power mean that large volumes of data are now readily available to the applied practitioner [1]. However, while this data is becoming easier to collect, analysing it can be a challenging task, particularly when sample sizes are small (i.e. limited by squad size) and the data is highly correlated–something that can lead to instability when applying standard least squares regression techniques, making it difficult to draw firm inference [23]. With respect to this, global positioning system (GPS) and micro-electrical-mechanical-system (MEMS) data can be particularly problematic [45]. GPS and MEMS are often used to measure an athlete’s movement, from which speed, distance travelled, and acceleration can be computed using standard mathematical algorithms. For example, a player’s velocity and acceleration are simply the first and second derivatives of the distance travelled. Consequently, these variables are not independent, but instead are highly correlated. It is therefore not surprising that strong correlations have been reported between variables widely used to assess training load (TL) [45].

Whilst individual measured variables are collected when acquiring performance data, these are often grouped together to represent latent constructs such as ‘fitness’, ‘fatigue’ or ‘technical-tactical performance’. For example, a rugby league coach might represent ‘attacking performance’ using variables such as the ‘number of metres run’, ‘tackle breaks’ or ‘kick return metres’ [6]. Conversely, a sports scientist might assess the TL imposed using data acquired from GPS or MEMS [4]. However, quantifying the relationships within and between these various latent constructs can be challenging because the individual variables used to quantify, for example, TL in athletes, have been shown in a meta-analysis to possess substantial correlations [5], making them prone to multicollinearity issues [3,5]. Nevertheless, when attempting to understand relationships between the latent constructs used in sports performance, multicollinearity issues are often ignored, with the tendency being instead to rely on univariate analysis [710]. However, univariate analysis has the inherent drawback that it assumes that the variables are independent and does not allow for any covariance within the data, something that can be a major weakness. Furthermore, it is not possible using univariate analysis to ‘capture’ any information that may be associated with the covariance between the variables, something that might be particularly important when considering the effect that changes in prescribed training mode across a training period might have on the relationships between TL variables [45, 1112]. Consequently, when analysing TL data it is important to allow for covariance in the data in order to ensure that appropriate inference is drawn and that erroneous conclusions are not reached.

When more sophisticated analysis is used, multiple linear regression (MLR) is generally the tool used by practitioners to quantify the strength of relationships within TL data. It is common practice when performing MLR to retain only those predictor variables that ‘confidently explain’ the behaviour of the response (dependent) variable, with variables that fail to reach a required level of significance excluded. While this approach works well when the predictor variables are weakly correlated with each other, problems can occur when strong correlations are present and variable inflation factors (VIFs) are >10 [1315]. Multicollinearity over-inflates the standard errors associated with the respective regression coefficients, causing the p-values to become very sensitive to changes in model specification, resulting in the whole process becoming unstable. Consequently, multiple competing models may be produced, making it difficult to be confident about any inference drawn from the various models [2]. In contrast to MLR, singular value decomposition (SVD) is immune to multicollinearity because it produces a set of orthogonal composite variables that are completely uncorrelated [1618]. Although not a statistical technique per se, SVD underpins other techniques such as principal component analysis (PCA) and partial least squares correlation analysis (PLSCA) [1920]. PLSCA in particular appears to have considerable potential with regard to the analysis of small data sets that exhibit multicollinearity (e.g. sports performance data sets). Because PLSCA incorporates SVD, it has the great advantage that it is both immune to multicollinearity, and unlike MLR, can cope with situations where the number of predictor variables exceeds the number of observations. Unlike MLR, which involves one response (dependent) variable and multiple predictor (independent) variables, PLSCA sub-divides the data into two blocks (sub-groups) each containing one or more variables, and then uses SVD to establish the strength of any relationship that might exist between the two component sub-groups [21]. It does this by using SVD to determine the inertia (sum of the singular values) of the covariance matrix of the sub-groups under consideration [17,19,21]. Given that the singular values are proportional to the magnitude of any effect [17], the higher the value of the singular value inertia observed, the greater the amount of shared information between the respective sub-groups and the stronger the relationship between the two. PLSCA is usually accompanied by a one-tailed permutation test generally involving 10,000 random permutations in order to establish the null distribution of the possible inertias and the likelihood (the odds) of the observed relationship occurring by chance [21].

Although PLSCA has been widely used in neuroimaging [17,1920,22], it has not been used in sport science, with the result that its potential in this discipline has not been exploited. We therefore hypothesised that PLSCA might be a useful tool with which to analyse small sport performance data sets. As sport scientists regularly seek to identify the TL variables that best relate to training outcome measures, we designed the study presented here with the aim of investigating the relationship between the accumulation of TL (quantified using multiple collection methods), and 30–15 intermittent fitness test (30-15IFT) performance achieved by 16 professional youth rugby league players following a 6-week training period. The overall aim of the study was to establish whether or not PLSCA, coupled with a novel ‘leave one variable out (LOVO)’ adaptation, might be a useful tool for analysing TL data sets where multicollinearity issues are problematic. More specifically, we wanted to know if the above methodology could identify which accumulated TL variables best relate to 30-15IFT performance following 6-weeks of training.

Leave one variable out PLSCA example

In order to explain the linear algebra underpinning PLSCA and to highlight the LOVO adaptation, we shall first consider a small dataset (Table 1) containing publicly available data collected from the twelve teams competing in the European Super League during the 2017 season (https://rugby-league.com/superleague/stats/club_stats). In this data set we have two outcome variables for the season: total league points accumulated and total match score difference; and three performance related variables: the number of missed tackles; the number of tackle busts; and the number of clean breaks. If for example, we wanted to establish whether or not a relationship exists between these two groups of variables, we could perform PLSCA to assess the amount of shared information common to the two and use this to quantify the strength of any relationship.

thumbnail
Table 1. League outcome and match performance data for the teams in the European Super League (season 2017).

https://doi.org/10.1371/journal.pone.0211776.t001

PLSCA is generally performed by first mean centring and standardizing the data to unit variance [23] and then dividing it into two matrices, in this case a [12×2] matrix, X, containing the variables total league points accumulated and total score difference, and a [12×3] matrix, Y, containing the number of missed tackles, the number of tackle busts, and the number of clean breaks. The pattern of relationships between the columns of X and Y can be stored in a [3×2] covariance matrix, R, [21] which is computed as follows: (1)

SVD of the matrix, R, yields three orthogonal matrices: a [3×2] left singular vector matrix, U, containing the saliences (weights) for matrix, Y; a [2×2] right singular vector matrix, V, containing the saliences (weights) for matrix, X; and S, a [2×2] diagonal matrix containing the singular values [21].

(2)

The quantity of common information shared between X and Y can be directly quantified by computing the inertia, ø, (i.e. the sum of all the non-zero singular values) common to the two sub-groups, as follows [21]: (3) where; n is the number of non-zero singular values of matrix R; and si is ith diagonal element of matrix S. Singular value inertia is an effect size, which directly relates to the strength of the relationship between the two sub-groups X and Y (i.e. the greater the magnitude of the inertia to more shared information common to X and Y).

The statistical significance of the calculated inertia value can then be assessed using a permutation test in which the rows of Y are randomly permutated 10,000 times [21], to produce the null distribution of all the possible inertia values that could occur just by chance, as shown in Fig 1. From this it can be seen that an inertia value of ≥18.638 only occurred 11 times in 10,000 simulations. This equates to a p-value of 0.0011 and indicates that, perhaps unsurprisingly, a very strong relationship exists between the match performance and season outcome variables, something that is unlikely to have occurred by chance.

thumbnail
Fig 1. Singular value inertial value (indicated by dotted line) computed from the observed data and the null-distribution of the inertia computed using a permutation test with 10,000 permutations.

https://doi.org/10.1371/journal.pone.0211776.g001

Having established that a strong relationship exists between the performance variables and season outcome, it is possible to interrogate the data further to identify which variables are most influential. In order to do this in a systematic manner, we developed a novel LOVO strategy, mirroring similar approaches adopted for SVD [18] and random forests [23], which involved repeating the PLSCA several times, with a different Y variable excluded from the analysis on each occasion. By observing the effect of each successive omission on the magnitude of the inertia, it is possible to assess the relative contribution of each performance variable to league outcome. In the example above, when the ‘number of missed tackles’ is omitted from the analysis, the inertia falls only marginally to 18.178, suggesting that this variable is not particularly influential. However by comparison, when the ‘number of tackle busts’ and the ‘number of clean breaks’ are omitted, then the inertia falls to 13.854 and 11.786 respectively, indicating that these two variables are much stronger predictors of league outcome.

Materials and methods

Participants

An observational research study was conducted in which the accumulated TL of sixteen male professional youth rugby league players (age [y]: 17.7 ± 0.9; height [cm]: 179.6 ± 5.5; body mass [kg]: 87.0 ± 8.8) was quantified via GPS, MEMS and session-rating-of-perceived-exertion (sRPE) during a 6-week pre-season training period. Immediately prior to and following this training period, participants undertook the 30–15 intermittent fitness test (30-15IFT), which was used to determine a players ‘starting fitness’ and ‘end fitness’. The content of the training and testing periods was prescribed by the coaching staff and included 3 to 4 field-based sessions per week comprising technical-, tactical-, sprint-, interval- and small-sided-games-based-training. The total number of recorded field training sessions was 273 with players participating in 17 ± 3 sessions. All procedures performed in the study were in accordance with the Leeds Beckett University ethics board which approved of the study prior to data collection and which conformed to the 1964 Helsinki Declaration. Informed consent was obtained from all participants prior to data collection.

Procedures

The 30-15IFT was conducted on artificial turf following two days of complete rest and prior to any additional training as per previous methods [24]. Players possessed familiarity with the 30-15IFT as part of their regular monitoring practices. The 30-15IFT comprised 30 second shuttles run over 40m, with 15 seconds of recovery. The speed of the test was controlled by an audible sound. At this time of the sound, players were required to be within a 3 m tolerance zone at either end or the middle of the 40 m shuttle. The start speed of the test was 8 km·h-1 and increased by 0.5 km·h-1 following each successive shuttle. The test terminated when players were no longer able to maintain the required speed of the test or when they did not reach the 3 m tolerance zone on three consecutive occasions. The last completed velocity during the test was taken as v30-15IFT. The v30-15IFT achieved prior to the start of the training period was deemed to be each players ‘starting fitness’ whilst the v30-15IFT achieved at the completion of the six-week training period was deemed to be each players ‘end fitness’.

All external training load measures were collected concurrently during each training session using 10 Hz GPS devices with in-built 100 Hz tri-axial accelerometer, gyroscope and magnetometer (Optimeye S5, Catapult Innovations, Melbourne, Victoria; firmware version: 7.17). This data was downloaded into specialist software (Catapult Sprint v5.1.7, Catapult Innovations, Melbourne, Victoria). The device was positioned between the scapulae within a manufacturer designed vest according to typical procedures. Each player wore the same unit for each session to limit potential between-unit variability in the data collected [25]. The mean number of satellites and horizontal dilution of precision (HDOP) during the data collection period was 15 ± 3 and 0.8 ± 0.6 respectively [25].

Derived from GPS, the total distance (m) covered during the ~17 training sessions was further differentiated into the distances (m) covered at arbitrary speed thresholds of low- (0 to 3 m∙s-1; SZ1), moderate- (3.1 to 5 m∙s-1; SVZ2), high- (5.1 to 7 m∙s-1; SZ3) and very-high-speeds (> 7.1 m∙s-1; SZ4). The minimum effort duration for each of the speed zones were set at 1 second [26]. An individualised high-speed threshold (IndSZ) was also calculated for each player, which was defined as the distance covered above the terminal speed achieved during the 30-15IFT prior to commencement of the training programme. Between the players, this speed threshold ranged from 4.58 to 5.41 m·s-1.

Derived from the tri-axial accelerometer, total session PlayerLoad is a modified vector magnitude and is expressed as the square root of the sum of the squared instantaneous rate of change in acceleration in each of the three axes (X, Y, and Z) and divided by 100 [27]. This was further differentiated into four zones relating to low- (0 to 1 AU; PLZ1), moderate- (1.1 to 2 AU; PLZ2), high- (2.1 to 3 AU; PLZ3) and very-high (>3 AU; PLZ4) accumulation of PlayerLoad. All PlayerLoad variables were expressed in arbitrary units (AU). PlayerLoad has previously been shown to possess acceptable reliability [27]. sRPE was calculated for each player during the study period using the method of Foster et al. [28]. Exercise intensity for sRPE was determined using the Borg category ratio-10 scale, with players providing this ~15 to 30 minutes following the cessation of the session [28]. This was then multiplied by the training-session duration to calculate the sRPE training load in AU.

Statistical analysis

In order to assess the strength of the relationships between the variables, we first undertook Pearson correlation analysis and then performed MLR analysis with ‘end fitness’ as the response variable and all the other variables included as predictors. VIF values were then calculated for each of the respective predictor variables, with those >10 identified as being particularly problematic [1315].

In the study, PLSCA was used as a filtering tool to by-pass any multicollinearity problems and identify those variables that were most influential in predicting improvements in fitness. This involved performing a baseline PLSCA (as described above) with the variables divided into two groups: an ‘output’ sub-group containing the variables ‘starting fitness’ and ‘end fitness’, and a ‘predictor’ sub-group comprising all the TL variables. We included both the ‘fitness’ variables in the ‘output’ group, because collectively they contained more information about improvements in fitness than would have been the case if just the difference in the fitness level had been used. Once the baseline inertia (i.e. the calculated inertia with all the predictor variables included in the model) and its associated p value were calculated, the whole PLSCA process was then repeated with one predictor variable omitted from the analysis and the new inertia and p value noted. This process was repeated with a different predictor variable omitted each time (as described above), until the contribution of all the variables had been evaluated individually. Having done this, those variables that were deemed influential were used to construct refined PLSCA and MLR models. Improvements derived from refining the baseline PLSCA model were assessed using the Chi- square test and Cramer’s V. MLR analysis was performed with 'end fitness' as the response variable and 'starting fitness' as a covariate, as recommended by Allison [29]. All analysis was undertaken using a combination of in-house algorithms written in Matlab (version R2016b: utilising the ‘Statistics and machine learning’ toolbox) (Math-Works, Natick, MA) and R (version 3.3.2: utilising the packages: ‘psych’; ‘car’; and ‘pracma’) (open source software). For all analyses, p values <0.05 were deemed significant.

Results

The TL descriptive results are presented in Table 2 along with the study data collected for each of the 16 subjects.

thumbnail
Table 2. Training load (TL) data and termination speed during 30–15 intermittent fitness test for 16 professional youth rugby league players.

https://doi.org/10.1371/journal.pone.0211776.t002

Pearson correlation analysis (Table 3) revealed multiple strong correlations between the predictor variables in the data, suggesting that the data exhibited considerable multicollinearity, something that was confirmed by the extremely high VIF values obtained when MLR analysis was performed (Table 4). From Table 4 it can be seen that all but one (i.e. sRPE) of the predictor variables exhibited a VIF>10, with most having values >1000. However, despite numerous strong relationships in the data, ‘end fitness’ was only significantly positively correlated with the variables ‘SZ4’ (r = 0.738, p = 0.001) and ‘starting fitness’ (r = 0.784, p<0.001) (Table 3).

thumbnail
Table 3. Results of the Pearson correlation analysis between training load variables, starting fitness and end fitness.

https://doi.org/10.1371/journal.pone.0211776.t003

thumbnail
Table 4. Baseline multiple linear regression model with end fitness as the response variable, showing the calculated variable inflation factors (VIFs).

https://doi.org/10.1371/journal.pone.0211776.t004

The results of the LOVO PLSCA (Table 5) revealed that the greatest decrease in measured (computed) inertia compared with baseline occurred when the variables SZ3 (1.945) and SZ4 (5.926) were omitted from the PLSCA model, indicating that these were the most influential variables, as illustrated by Fig 2. When only these predictor variables were used to construct the PLSCA model (Table 6), the p value achieved was 0.015, indicating that the matrix containing the predictor variables SZ3 and SZ4 shared a considerable amount of information with the output matrix containing the variables ‘starting fitness’ and ‘end fitness’ and that therefore they were likely to be the best predictors of end fitness. Given that ‘starting fitness’ can be treated as a covariate, this implies that SZ3 and SZ4 were likely to be the best predictors of end fitness.

thumbnail
Fig 2. Variable importance plot showing the decrease in singular value inertia attributable to each predictor variable.

https://doi.org/10.1371/journal.pone.0211776.g002

thumbnail
Table 5. Results of the LOVO PLSCA showing the effect on singular value inertia of omitting variables one at a time.

https://doi.org/10.1371/journal.pone.0211776.t005

When the variables SZ3 and SZ4 were subsequently used, together with ‘starting fitness’ as a covariate, to construct linear regression models (Table 7), it was found that the multicollinearity problems disappeared, with all the VIF values being <2.5. Furthermore, the models produced were particularly strong, both exhibiting adjusted r2 values >0.7. Of the two models produced, the one containing just the predictor variable SZ4 and ‘starting fitness’ (i.e. MLR Model 1) was the strongest, possessing an Akaike information criterion (AIC) value of 28.3 (lower than the AIC value of 30.1 for MLR Model 2, which also included variable SZ3), indicating that these two variables were the most influential in predicting end fitness, corroborating the results of the LOVO PLSCA.

thumbnail
Table 7. Results of refined MLR models with respective variable inflation factors.

https://doi.org/10.1371/journal.pone.0211776.t007

Discussion

The overall aim of the study was to evaluate the extent to which PLSCA might be helpful when analysing TL data that exhibited considerable multicollinearity. As such, we wanted to identify the TL variables that best related to 30-15IFT performance in young rugby league players following 6-weeks of training. With respect to this, the specific findings of the current study revealed perhaps unsurprisingly, that ‘starting fitness’ is an important covariate of ‘end fitness’, with a strong positive correlation between the two–something that others have observed [3031]. The strongest regression model (Table 7; MLR Model 1) suggests that professional youth rugby league players with a lower starting fitness require a lower accumulation of distance at very-high speed (> 7 m·s-1) (compared to players with a higher starting fitness) to elicit a comparable incremental improvement in end fitness (e.g. +1 km·h-1 in v30-15IFT) following 6-weeks of training. This model suggests, for example, that a professional youth rugby league player with a starting v30-15IFT of 17.5 km·h-1 would require an accumulation of 350m at very-high speed over 6-weeks to improve their v30-15IFT by 1 km·h-1 compared to 1050m for a player with a starting fitness of 20.5 km·h-1. As such, this regression model could be used to translate TL data (in conjunction with starting fitness) into practical targets for the applied practitioner working with youth rugby league players. However, it is important to note that this relationship (and associated MLR model) was observed within a single team, meaning the variability between players in the accumulated distances at very-high-speed are specific to the context of the training modalities prescribed by the coaching staff at this club [32]. We therefore recommend that future researchers conduct randomised control trials with appropriate comparator arms in order to consolidate or refute our findings regarding the importance of the interaction between the distance accumulated at very-high speed and a players starting ‘fitness’ to improving prolonged intermittent running ability in team sport athletes.

Consistent with previous research [45,811], the findings of the current study indicate that variables commonly used to assess TL tend to be highly correlated (Table 3) and thus contain considerable shared information. As such, they violate assumptions regarding multicollinearity (Table 4), which can be problematic when performing MLR analysis [1314]. While multicollinearity issues can be addressed by removing variables with a ‘high VIF’ value, this has the disadvantage that it is rather piecemeal and involves making subjective decisions regarding VIF exclusion criteria and the variables to be excluded. Alternatively, principal component regression (PCR) can be used [3334]. However, while PCR is immune to multicollinearity, it has the great disadvantage that it requires the construction of composite predictor variables, which are difficult to interpret. In response to this, we developed the LOVO PLSCA methodology presented above as an alternative for analysing multicollineated data sets. The results of the current study suggest that the LOVO PLSCA strategy is well suited to the analysis of highly correlated sports performance data, suggesting that it might be a useful tool for researchers and practitioners seeking to better understand the factors that influence sports performance.

The LOVO PLSCA approach echoes that of the ‘decrease in Gini impurity’ strategy frequently used with random forests to quantify variable importance [23]. As such, it represents a new orthogonal approach for quantifying variable importance that is immune to multicollinearity and can be used as a variable filtering tool prior to MLR. Furthermore, unlike MLR, which struggles when the number of predictor variables exceeds the number of subjects or observations [35]., PLSCA is not affected by this problem. With PLSCA it is possible to explore the relationship between multiple predictor variables and multiple response variables, enabling complex relationships within the data to be evaluated–something that may be of great value when investigating broad latent constructs such as ‘strength’, ‘fatigue’ or ‘technical-tactical performance’. Say for example, we wanted to evaluate the relationship between, ‘fatigue’ status (measured using the variables: ‘change in countermovement jump height’; ‘perceived recovery’; and ‘6 second watt bike test’) and ‘physical performance’ (measured using the variables: ‘v30-15IFT’; ‘40 metre maximal sprint speed’; and ‘3 repetition maximal squat and bench press’). With PLSCA it would be possible to investigate the relationship between the four ‘physical performance’ variables and the three ‘fatigue’ variables, something that would be difficult using a more conventional approach.

Although LOVO PLSCA can be used to assess the relative importance of predictor variables, because it is not a regression technique it cannot explicitly predict the response variables from a set of predictor variables [19]. In order to do this a related technique, partial least squares regression (PLSR), has been developed [19,21]. While consideration of PLSR is beyond the scope of the current paper, it is worth noting that PLSR shares many similarities with PCR in so much that both techniques are primarily used for prediction and require the construction of composite predictor variables, albeit using different methodologies. As such, PLSR suffers from the same drawback as PCR, namely that the models produced are difficult to interpret because the predictors are not the original measured variables. By comparison, PLSCA, when used as a filtering tool and combined with MLR, overcomes any multicollinearity problems and is much easier to interpret.

Conclusions

The findings of the current study demonstrate that multicollinearity is a major limiting factor, which has the potential to compromise analysis of TL data. However, this problem can be overcome by using an orthogonal PLSCA approach, which is immune to multicollinearity, thus enabling the user to quantify the strength of the relationships between the respective variables. Using LOVO PLSCA we were able to identify those variables that were most influential in explaining improvements in player fitness. This enabled us to remove irrelevant variables and so overcome any multicollinearity issues. This allowed us to produce a robust MLR model for predicting ‘end fitness’, from which we inferred that ‘starting fitness’ and the accumulation of distance at ‘very-high speed’ across a 6-week period of training were the most influential predictors of end fitness in professional youth rugby league players. As such, PLSCA appears to be a useful tool for filtering out irrelevant information and identifying those variables that should be included prior to any given MLR analysis.

Acknowledgments

The authors would like to thank Professor Charles Taylor (University of Leeds) for his helpful suggestions regarding the PLSCA methodology and Melissa Bargh for her assistance in the data collection.

References

  1. 1. Cardinale M & Varley MC. Wearable training-monitoring technology: applications, challenges and opportunities. Int J Sports Physiol Perform. 2017;12: S255–S262. pmid:27834559
  2. 2. Graham MH. Confronting multicollinearity in ecological multiple regression. Ecology. 2003;84(11): 2809–2815.
  3. 3. Slinker BK & Glantz SA. (1985). Multiple regression for physiological data analysis: the problem of multicollinearity. Am J Physiol. 1985;249(1 Pt 2): R1–12. pmid:4014489
  4. 4. Weaving D, Marshall P, Earle K, Nevill A & Abt G. Combining internal- and external-training-load measures in professional rugby league. Int J Sports Physiol Perform. 2014; 9: 905–912. pmid:24589469
  5. 5. McLaren SJ, Macpherson TW, Coutts AJ, Hurst C, Spears IR & Weston M. The relationships between internal and external measures of training load and intensity in team sports: a meta-analysis. Sports Med. 2018;48(3): 641–658. pmid:29288436
  6. 6. Woods CT, Sinclair W & Robertson S. Explaining match outcome and ladder position in the National Rugby League using team performance indicators. J Sci Med Sport. 2017;20(12): S1440–2440.
  7. 7. Hulin BT, Gabbett TJ, Lawson DW, Caputi P & Sampson JA. The acute:chronic workload ratio predicts injury: high chronic workload may decrease injury risk in elite rugby league players. Br J Sports Med. 2016;50(4): 231–236. pmid:26511006
  8. 8. Oxendale CL, Twist C, Daniels M & Highton J. The relationship between match-play characteristics of elite rugby league and indirect markers of muscle damage. Int J Sports Physiol Perform. 2016;11: 515–521. pmid:26355239
  9. 9. Akubat I, Patel E, Barrett S, & Abt G. Methods of monitoring the training and match load and their relationship to changes in fitness in professional youth soccer players. J Sports Sci. 2012;30: 1473–1480. pmid:22857397
  10. 10. Taylor R, Sanders D, Myers T, Abt G, Taylor CA & Akubat I. The dose-response relationship between training load and aerobic fitness in academy rugby union players. Int J Sports Physiol Perform. 2017;22: 1–22.
  11. 11. Weaving D, Jones B, Marshall P, Till K & Abt G. Multiple measures are needed to quantify training loads in professional rugby league. Int J Sports Med. 2017;38: 735–740. pmid:28783849
  12. 12. Weaving D, Jones B, Till K, Abt G, Beggs C. The case for adopting a multivariate approach to optimise training load quantification in team sports. Front Physiol. 2017;8: 1024. pmid:29311959
  13. 13. Marquardt DW. Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation. Technometrics. 1970;12: 591–256.
  14. 14. Mason RL, Gunst RF & Hess JL. Statistical design and analysis of experiments: applications to engineering and science. New York: Wiley; 1989.
  15. 15. Neter J, Wasserman W & Kutner MH. Applied Linear Regression Models. IL: Irwin; 1989.
  16. 16. Golub GH & Kahan W. Calculating the singular values and pseudo-inverse of a matrix. SIAM J Appl Math Series B, Numerical Analysis.1965;2(2): 205–224.
  17. 17. McIntosh AR, Misic B. Multivariate statistical analyses for neuroimaging data. Annu Rev Psychol. 2013;64: 499–525. pmid:22804773
  18. 18. Till K, Jones BL, Cobley S, Morley D, O'Hara J, Chapman C et al. Identifying Talent in Youth Sport: A Novel Methodology Using Higher-Dimensional Analysis. PLoS One. 2016;11(5):e0155047. pmid:27224653
  19. 19. Krishnan A, Williams LJ, McIntosh AR & Abdi H. Partial Least Squares (PLS) methods for neuroimaging: a tutorial and review. Neuroimage. 2011;56(2): 455–75. pmid:20656037
  20. 20. Levine B, Kovacevic N, Nica EI, Schwartz ML, Gao F & Black SE. Quantified MRI and cognition in TBI with diffuse and focal damage. Neuroimage Clin. 2013;2: 534–541. pmid:24049744
  21. 21. Abdi H & Williams LJ. Partial least squares methods: partial least squares correlation and partial least square regression. Methods Mol Biol. 2013;930: 549–579. pmid:23086857
  22. 22. Beggs CB, Magnano C, Belov P, Krawiecki J, Ramasamy DP Hagemeier J et al. Internal Jugular Vein Cross-Sectional Area and Cerebrospinal Fluid Pulsatility in the Aqueduct of Sylvius: A Comparative Study between Healthy Subjects and Multiple Sclerosis Patients. PLoS One. 2016;11(5): e0153960. pmid:27135831
  23. 23. Louppe G, Wehenkel L, Sutera A & Geurts P. Understanding variable importances in forests of randomized trees. Neural Information Processing Systems Conference: Advances in Neural Information Processing Systems. 2013;26.
  24. 24. Darrall-Jones JD, Jones B & Till K. The effect of body mass on the 30–15 intermittent fitness test in rugby union players. Int J Sports Physiol Perform. 2016;11(3): 400–403. pmid:26217047
  25. 25. Malone JJ, Lovell R, Varley MC & Coutts AJ. Unpacking the black box: applications and considerations for using GPS devices in sport. Int J Sports Physiol Perform. 2017;12(Suppl 2): S218–S226. pmid:27736244
  26. 26. Varley MC, Jaspers A, Helsen WF & Malone JJ. Methodological considerations when quantifying high-intensity efforts in team sport using global positioning system technology. Int J Sports Physiol Perform. 2017;4;1–25.
  27. 27. Boyd LJ, Ball K & Aughey RJ. The reliability of MinimaxX accelerometers for measuring physical activity in Australian football. Int J Sports Physiol Perform. 2011;6(3): 311–21. pmid:21911857
  28. 28. Foster C, Florhaugh JA, Franklin J, Gottschall L, Hrovatin LA, Parker S, et al. A new approach to monitoring exercise training. J Strength Cond Res. 2001;15: 109–115. pmid:11708692
  29. 29. Allison PD. Change Scores as Dependent Variables in Regression Analysis. Sociological Methodology. 1990;20; 93–114
  30. 30. Skinner JS, Jaskólski A, Jaskólski A, Krasnoff J, Gagnon J, Leon AS, et al. Age, sex, race, initial fitness, and response to training: the HERITAGE Family Study. J Appl Physiol. 1985;90:1770–1776.
  31. 31. Astorino TA, Schubert MM. Individual responses to completition of short-term and chronic interval training: a retrospective study. PLoS One. 2014;21:e97638.
  32. 32. Weaving D, Jones B, Till K, Marshall P, Earle K, Abt G. Quantifying the external and internal loads of professional rugby league training modes: consideration for concurrent field-based training prescription. J Strength Cond Res. 2017; pmid:28930869
  33. 33. Minitab blog editor. Enough Is Enough! Handling Multicollinearity in Regression Analysis. 16th April 2013. http://blog.minitab.com/blog/understanding-statistics/handling-multicollinearity-in-regression-analysis (accessed 6th May 2018)
  34. 34. James G, Witten D, Hastie T & Tibshirani R. An introduction to statistical learning with applications in R. Springer texts in statistics. Springer, New York; 2013.
  35. 35. Stodden V. Model selection when the number of variables exceeds the number of observations. Stanford University. 2006.