The Jack and Jill Adaptive Working Memory Task: Construction, Calibration and Validation

Elina Tsigeman; Sebastian Silas; Klaus Frieler; Maxim Likhanov; Rebecca Gelding; Yulia Kovas; Daniel Müllensiefen

doi:10.1371/journal.pone.0262200

Abstract

Visuospatial working memory (VSWM) is essential to human cognitive abilities and is associated with important life outcomes such as academic performance. Recently, a number of reliable measures of VSWM have been developed to help understand psychological processes and for practical use in education. We sought to extend this work using Item Response Theory (IRT) and Computerised Adaptive Testing (CAT) frameworks to construct, calibrate and validate a new adaptive, computerised, and open-source VSWM test. We aimed to overcome the limitations of previous instruments and provide researchers with a valid and freely available VSWM measurement tool. The Jack and Jill (JaJ) VSWM task was constructed using explanatory item response modelling of data from a sample of the general adult population (Study 1, N = 244) in the UK and US. Subsequently, a static version of the task was tested for validity and reliability using a sample of adults from the UK and Australia (Study 2, N = 148) and a sample of Russian adolescents (Study 3, N = 263). Finally, the adaptive version of the JaJ task was implemented on the basis of the underlying IRT model and evaluated with another sample of Russian adolescents (Study 4, N = 239). JaJ showed sufficient internal consistency and concurrent validity as indicated by significant and substantial correlations with established measures of working memory, spatial ability, non-verbal intelligence, and academic achievement. The findings suggest that JaJ is an efficient and reliable measure of VSWM from adolescent to adult age.

Citation: Tsigeman E, Silas S, Frieler K, Likhanov M, Gelding R, Kovas Y, et al. (2022) The Jack and Jill Adaptive Working Memory Task: Construction, Calibration and Validation. PLoS ONE 17(1): e0262200. https://doi.org/10.1371/journal.pone.0262200

Editor: David Giofrè, Liverpool John Moores University, UNITED KINGDOM

Received: June 28, 2021; Accepted: December 20, 2021; Published: January 27, 2022

Copyright: © 2022 Tsigeman et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The DOI for the data from studies 1 and 2 is: 10.17605/OSF.IO/CN4T9. The corresponding URL is: https://osf.io/cn4t9/ Data from Studies 3 and 4 are still only available upon request to Yulia Kovas (y.kovas@gold.ac.uk) or to Ethics committee of the Tomsk State University (info@ethicom.ru or ethics.committee@mail.tsu.ru).

Funding: Daniel Müllensiefen - There is no grant number - Alexander von Humboldt-Stiftung Foundation Maxim Likhanov - There is no grant number - ITMO Fellowship and Professorship Program Yulia Kovas -There is no grant number - Sirius University The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The term working memory (WM) first appeared in Miller, Galanter, and Pribham [1] where the authors described it as a storage system “where plans can be retained temporarily when they are being formed, or transformed, or executed”. More recently, it has been described as the “brain’s conductor” [2], because of its vital role in human learning. WM has also been described as a “cognitive primitive” [3], which constrains the acquisition and deployment of most intellectual abilities [4].

Several theoretical WM frameworks have been presented in the literature, including the embedded-processes model [5]; the biologically based model [6]; the attention control model [7]; and the time-based resource-sharing model [8]. To date, one of the most influential WM models is the multi-component model [9–11] which describes WM as a limited capacity system that enables the temporary storage and manipulation of information. According to this model, WM contains four components: the phonological loop (a slave subsystem that stores verbal information), the visuospatial sketchpad (visuospatial working memory, VSWM; a slave subsystem that stores and manipulates visuospatial information), the episodic buffer (a mechanism for multimodal information exchange between WM and other types of memory, e.g., long-term memory) and the central executive (which enables attentional focus, manipulation, and decision-making).

WM correlates substantially with measures of intelligence (e.g., [12] and also predicts a range of salient life outcomes such as reading comprehension [13, 14], domain expertise [15] and academic achievement [16, 17]. In particular, VSWM capacity appears to be a predictor across a wide spectrum of achievements in different domains, showing moderate correlations with reading performance [18], achievement in English [19], maths, and related areas such as geometry and informatics [14, 20] as well as science [21] in schoolchildren.

Structure and measurement of visuospatial working memory

VSWM is believed to consist of several components, which can be differentiated by distinct experimental tasks. Logie [22] described VSWM as a tripartite model consisting of a passive visual cache for maintaining information, an active inner scribe for enabling rehearsal, and a central executive system, which in turn integrates and controls the other two facets. Each of the three components showed distinctive developmental patterns in Logie and Pearson [23].

Della Sala, Gray, Baddeley, Allamano, and Wilson [24] suggested another two different distinct components of VSWM. Their model distinguishes between visual and spatial components, with the spatial component also potentially consisting of a spatial-sequential subset. The visual component enables maintaining the appearance of objects (shape, colour, etc. as assessed by e.g., the odd-one-out task) and location (e.g., Corsi block-tapping test; [25]). The spatial component enables identifying changes between stimuli (e.g., The Visual Patterns Test; [26]). A similar model differentiating between visual, spatial (simultaneous and sequential) and visuospatial complex span components was suggested by the data in Mammarella and colleagues [27]. Other research suggests somewhat different structures when considering WM more broadly. For example, one study performed a confirmatory factor analysis on 10 visuospatial and 3 verbal WM tasks, with the best model containing four passive components for storage (verbal, visual, simultaneous-spatial, and sequential-spatial) and one active component for processing (central executive) [28]. Active processes are suggested to involve the manipulation of information in addition to information maintenance, while passive processes include only maintenance of information that is not modified after encoding.

Many individual differences tasks which have been employed to measure aspects of VSWM have some notable limitations. In particular, they lack a close relationship to a cognitive model of WM that would bolster the task’s construct validity (e.g., [29]). They may also share the problem of task impurity–measuring several WM components at once. Some research (e.g., [30]) addressed this issue by assessing performance on multiple WM tasks and then adopting a latent variable approach to extract the common variance among the tasks. Yet tasks may still involve other cognitive abilities, such as long-term memory encoding or retrieval [31]. Additionally, while there are some well-established visuospatial tests (e.g., in the Woodcock-Johnson III (WJ III) Tests of Cognitive Abilities); [32], sometimes they might be too expensive to be used by researchers and educators.

Furthermore, many traditional measures of VSWM do not require concurrent processing and storage of visuospatial information, which is the simultaneous engagement of the central executive and visuospatial sketchpad, according to the Baddeley and Hitch WM model. To address this gap in the test literature, Shah and Miyake [33] proposed a new spatial span task using a dual-task paradigm, as typical for experimental WM research. The task requires participants to mentally rotate sequentially presented visual shapes (i.e., letters), while also encoding the shapes’ spatial positions in the correct order. The number of correctly recalled location sequences represents the participant’s VSWM score. The task showed high reliability (Cronbach’s α = .80) and loaded strongly onto a spatial WM factor in their original study [33]. These results were replicated in a later study [34], where the test was named ‘letter rotation task’ and loaded on a complex storage-plus-processing VSWM factor (standardized factor loading = .64), as opposed to a simple-storage visuospatial short-term memory factor or an executive function factor. In sum, the letter rotation task represents an easy-to-understand dual-task paradigm that requires processing and storage of two different attributes (orientation and location) of the same visual shape. It has good psychometric properties, is easy to score, and task difficulty can be easily manipulated via the length of sequences presented.

However, the letter rotation task requires the recognition of letter shapes and hence is not suitable for use with young children, illiterate individuals, or participants with severe reading impairments. Therefore, Alloway and colleagues created a new version of the task (the ‘Mister X’ task) for use with individuals from early childhood (5 years) to adulthood (69 years), as part of The Automated Working Memory Assessment (AWMA; [35, 36]). The Mister X task is considered visually more appealing (to children) due to its use of two illustrated human figures that are presented simultaneously on the screen, each holding a ball in one hand. One Mister X figure is rotated, and participants are required to judge whether both Mister X figures hold the ball in the same hand. In addition, participants must remember the spatial location of the ball of the right figure and recall all ball locations in the correct order, after the end of the sequence of trials. The Mister X task showed good psychometric benchmarks (test-retest reliability, r = .77, [36]) and a single clear loading on a VSWM factor for children as well as adults in an independent validation and application study [37]. The evidence of the task’s reliability and validity for use with adults and children is robust and made the Mister X task a good candidate for further development into a computerised adaptive task with automated item generation.

The present study

The objective of this study is to further adapt the Mister X task to improve the practicality, quality, and efficiency of VSWM measurement. We emphasise that the Mister X task and its predecessors have already been validated and were considered VSWM tasks in the literature. Consequently, our objective is to further validate the task in different settings. For instance, high WM abilities are often associated with obtaining high levels of expertise in given domains, like music [38–40]. Yet, the Mister X task we reproduced was only validated in populations of young children from the general population. Hence, we wanted to extend the domain further to see whether JaJ could also be validated within the context of high performing/specialist populations (Studies 3 and 4).

Here, we document the development and validation of a new adaptive, computerised and open-source WM task–Jack and Jill—based on modern psychometric techniques.

We expanded on the Mister X task in a number of ways. Firstly, in an attempt to engage younger participants, the style of the graphics was chosen to be more cartoon-like. Further, instead of one single character in the task, we created two characters of different genders (i.e., Jack and Jill). This may make the task more widely relatable, especially to younger participants who may find a Mister X character more abstract than the nursery rhyme characters Jack and Jill.

The second change concerns the design and calibration of the test based on item response theory (IRT, see e.g., [41] for an introduction). IRT provides a principled and flexible measurement framework for the Jack and Jill task. The IRT framework allows the computation of estimates of item difficulties and person abilities, as well as of measurement error, on the basis of a single underlying probabilistic model. Furthermore, the use of explanatory item response models [42, 43], which use theoretically motivated item features to predict the difficulty of individual items, can contribute to the construct validity of a test. Moreover, the problem of task impurity, found in previous approaches, can be mitigated by developing an empirically validated cognitive model of task performance via the IRT approach. Despite the theoretical and practical advantages of IRT-based testing, so far only very few established WM tasks make use of an IRT framework. Moreover, the available tests (e.g., [44]) are not available in an open computerised adaptive testing framework at present.

Using IRT as the basis of the Jack and Jill task also enables the automated generation of test items based on their item difficulties; and its implementation using Computerised Adaptive Testing (CAT, [45]). In CAT, the difficulty of items presented is matched dynamically to the current estimate of a participant’s ability. After each trial, the participant’s ability is re-estimated using the response data from the previous trial as well as the prior ability estimate. Because CAT aims to present only items that are as close as possible to the participant’s ability, it is maximally informative for estimating a participant’s ability with increasing precision. Thus, CAT helps to produce an efficient version of the test, able to capture the wide variability of WM found in heterogeneous samples, as well as mitigate fatigue effects. In particular, it offers the possibility to shorten or lengthen a task, guided by the knowledge of how changing the task’s length affects measurement error and reliability.

Our research was taken forward in multiple different settings for a few reasons. Firstly, it allowed us to recruit from a wider pool of participants, yielding a larger overall sample. Secondly, it allowed us to ensure our sample was more heterogenous and to validate our task in different cultural settings, making it more internationally robust. Thirdly, this required translating the test into several languages which are available for use in our release, and hence, makes the test more accessible. Finally, it allowed research teams to address more specific questions in separate work from that presented here.

The following sections report four studies designed to construct, calibrate, and validate a new IRT-based VSWM task that can be used in a computerised adaptive operation mode. Study 1 describes the construction and empirical calibration of the Jack and Jill task, using explanatory item response modelling of data from a large sample of the general adult population in the UK and the US. Study 2 describes the validation of the adaptive version of the Jack and Jill task in a sample of young adults from the UK and Australia. Study 3 reports the validation of the static version of the Jack and Jill task in a sample of adolescent participants from a talent development centre in Russia. Study 4 provides validation against a battery of established spatial tasks, using data from another adolescent sample from the same centre. Table 1 shows the overall design of our study.

Download:

Table 1. Design of the study.

https://doi.org/10.1371/journal.pone.0262200.t001

Study 1: Construction and calibration of the Jack and Jill task

The aim of Study 1 was to calibrate the Jack and Jill (JaJ) task. Specifically, we tested whether it is possible to predict participants’ performance from item length.

Method

Participants.

The sample consisted of 244 participants (41% females; age range = 18–68, mean age = 31.3; SD = 10.3). The current study utilises a simple IRT model with a constant discrimination parameter, with difficulty (item length) being the only parameter that varies across items. Thus, the current sample size is appropriately justified with respect to the guidelines for the IRT approach [41].

Participants were recruited from the UK (32%) and the US (68%). The socio-demographic background was similar across both countries: overall, 48% of participants were in part-time or full-time work, 29% were students, and 23% were unemployed and/or had small earnings from a low-income job.

Procedure.

Participants were recruited through the Slicethepie (run by market intelligence company SoundOut, Reading, UK) online panel and forwarded to an online test battery, which comprised of study information, consent page, and the JaJ task. Participants were not instructed on what kind of device (e.g., laptop or smartphone) to use during task performance. The entire procedure took around 10 minutes to complete. The study received ethical approval from the ethics committee at Goldsmiths, University of London.

Materials.

The Jack and Jill visuospatial working memory task. The JaJ task presents participants with pictures of a young female (“Jill”) and a young male (“Jack”) on a white background (see Fig 1). The task is divided into trials of different lengths (starting with a sequence of 1 to a maximum of 7 stimuli). Each sequence length is used twice with sequence lengths increasing over time, totalling 14 trials in one testing session. For a detailed task description see Fig 1. Following the scoring procedures of the Letter Rotation task and the Mister X task, participant responses were scored as correct (1) if the whole sequence of a trial was successfully repeated or as incorrect (0) otherwise. ‘Same/different’ judgements of hand positions were not used as the main unit of scoring as this task was used only to additionally load WM processes as in a dual-task paradigm. A participant’s VSWM span is indexed by the length of the sequence of ball positions that a participant is able to recall. The JaJ task can be considered a complex span task because it requires participants to switch between memory encoding (i.e., memorising the ball’s position) and mental rotation with a subsequent decision on visual input, which creates an additional cognitive load.

Download:

Fig 1. Schematic of the four consecutive screens of a JaJ trial of length two.

Note: Jill always stays in the same position holding a blue ball in her right (from the participant’s perspective) hand (Panels 1–4), while Jack rotates around his axis on each stimulus presentation and can hold a ball in either his right or left hand (Panels 1–2). Jack’s ball also moves, randomly taking one of the 6 marked possible positions on the screen (orange dots). On each stimulus presentation, participants are required to perform two tasks: a) indicate whether Jack holds the ball in the same hand as Jill; and b) memorise the current ball position. At the end of each trial, participants are asked to recall the sequence of the ball positions by clicking on the marked positions in the correct order (Panels 3–4). Cursor locations represent correct answers.

https://doi.org/10.1371/journal.pone.0262200.g001

A trial is considered an ‘item’ in the language of item response theory. The main parameter hypothesised to affect item difficulty is item length (i.e., the length of the sequence of ball positions).

All items were prepared in advance using independent random sequences of hand and ball positions. Ball position sequences can contain the same position more than once. No items are presented twice within one test session. In the beginning, participants are presented with step-by-step instructions and subsequently two training items of length 1 and 2 with feedback on the correctness of the responses regarding the hand judgements and ball positions. Participants do not receive feedback regarding their performance during the testing phase or after the task.

The computerised version of the JaJ VSWM task was programmed using the psychTestR [51] and Shiny frameworks [52] for the R programming language [53]. It took participants 5–15 minutes to complete.

Results

Whilst we do not use it for formal scoring or modelling, the percentage of correct responses in the secondary “same/different” judgement task was 93.1%, suggesting the task was taken seriously and would have contributed to loading VSWM.

On the main ball position task, averaged across all 14 trials, participants demonstrated an accuracy of 35.5% correct responses (SD = 28.9). Male participants had an average of 33.0% (SD = 28.5) correct responses while females had 39.2% (SD = 29) responses correct. This difference did not reach the common significance level according to a Welch’s t-test (t(195.74) = 1.59, p = .11).

After excluding items of length 1 the relationship between performance accuracy becomes approximately linear as shown in Fig 2.

Download:

Fig 2. Regression lines representing the decrease in response accuracy with increasing sequence length.

Note: The black line represents proportion of correct responses by length of sequences to be remembered; Error bars represent 95% CI of the proportion based on the standard error. The blue regression line was fit to sequences from length 1 to 7. The red line was only fit to sequences of length 2 to 7. The red regression line fits the empirical average accuracies connected by the dashed line fairly closely, suggesting an approximate linear trend.

https://doi.org/10.1371/journal.pone.0262200.g002

In order to account for the discontinuity in performance accuracy between items of length 1 and the linear trend for items of length 2 to 7, we created a binary dummy variable that codes items of length 1 as ‘1’ and ‘0’ for all other item lengths.

To formally model participant accuracy, we constructed an explanatory item response model in the form of a binomial mixed-effects logistic regression model [42] using item length and the dummy variable for item length = 1 as fixed effects and participant ID as random intercept effect which traditionally corresponds to the participant ability parameter in IRT models. Participant responses at the item level (correct/incorrect) served as a dependent variable. The function glmer from the R package lme4 [54] was employed for modelling.

According to the model, both the dummy variable and item length contributed significantly to the prediction of the correctness of the responses as shown in Table 2. As expected, longer items decreased the probability of responding correctly and an item of length 1 increased this probability. The model has a predictive accuracy of 86.7% when random effects were included. Without the random effect, prediction accuracy dropped to 71.3%.

Download:

Table 2. Logistic regression of item accuracy.

https://doi.org/10.1371/journal.pone.0262200.t002

In sum, the model confirmed our expectations for item length to affect task performance. In addition, the model has good predictive validity and the inclusion of individual differences information (i.e., random participant effect) explains a further considerable amount of variance in the data. Hence, the model was accepted as an explanatory model for the performance of the JaJ task. In a subsequent step, the item difficulty parameters for all item lengths as well as the discrimination parameter were extracted from this model for use with the 1-PL IRT (for technical details of parameter extraction and transformation see [43]).

Simulations using the randomCAT function from the catR package (version 3.16) [55], our derived IRT parameters and true abilities scores in range [-4, 4.15] and test lengths [5. 20] showed that the JaJ task produces scores in the range of [-1.89, 2.86]. The lowest ability value (-1.89) corresponds to the ability of remembering not more than 1 position consistently. This is a meaningful lower bound for a WM test but the assignment of the numerical value -1.89 is rather arbitrary. In the long run, it would be helpful to collect norms from a larger sample drawn from the general population (factoring in age as well), such that values of the IRT scale can be mapped to achievement percentiles in this population sample.

Discussion

Aggregated accuracy scores show that performance on the JaJ task strongly depends on item length with the exception of items of length 1 (see Fig 2). This result may mean that items of length 1 are processed by a different mechanism, possibly visual echoic or short-term memory that does not require the engagement of the central executive and the visuospatial sketchpad, at least not to the same degree as is the case of longer items. In addition, the chance of guessing the correct response for items of length 1 is considerably higher (.167) compared to items with length > 1 (guessing level < .03). This could also have made a (small) contribution to the better performance of length 1 items. In any case, to account for the different behaviour with length 1 items, and a potential change in memory strategy, the explanatory model includes a dummy variable for items of length 1 in addition to the variable that codes item length numerically.

The item difficulty parameters and discrimination parameters extracted from this model form the basis of scoring participant ability in Studies 2 and 3 as well as the computerised adaptive version employed in Study 4. Note that by default participant ability estimates are scaled to have a mean of 0 and a standard deviation of 1 which enables the easy comparison with sample averages from subsequent studies. Item parameters of the 1-PL IRT model are given in Table A1 in the S1 Appendix.

Study 2. Validation of the static version of the Jack and Jill task with an adult sample

Study 2 aims to establish the reliability and validity of the static 14-item version of the Jack and Jill task described in Study 1. We expect the average IRT standard error of measurement for a person’s ability to be acceptably low, indicating the task’s reliability. With regards to the validity, we expect to find significant moderate correlations between performance on the JaJ task and two established measures of WM. Moreover, we seek to conclude Study 2 with a latent variable approach. Since the construct of WM is defined as the ability to simultaneously store and actively transform information across short time spans [9, 56], VSWM can be operationalised as the ability to transiently remember and manipulate visuospatial information across time. The VSWM tasks in Study 2, Jack and Jill (JaJ), Memory Updating Figural (MUF), and Backwards Digit Span (BDS) all share the common uniting element of transiently remembering and manipulating something presented in the visual domain. Hence, we hypothesize that the common variance shared by this task set is predominantly VSWM and operationalise them as loading onto the latent variable VSWM. A good factor analysis solution should indicate the validity of this hypothesis.