Abstract

Statistical comparison of single cases to small samples is a methodology that has been extensively used in, for example, cognitive and clinical neuropsychology. This is most often done to determine changes in cognitive processing after an individual has incurred some type of brain damage. In a clinical setting one often wish to infer whether a patient exhibit abnormally low performance on some cognitive ability. In cognitive neuropsychology on the other hand one often wish to infer if a patient exhibits an abnormally large discrepancy in performance between two cognitive abilities. Because cognitive abilities seldom are well represented by one single performance on one single task one might additionally be interested in the abnormality of a case on several measurements converging on a cognitive ability of interest, or the abnormality of a case in a multivariate space. Several methods to estimate case abnormality have been developed that keeps the Type I error rate at its nominal level. However, they have not been available in any standard statistical software environment and their documentation is spread thin across multiple articles and compiled computer programs. This vignette aims to gather and review the most popular methods while presenting them and their usage in the R package singcar. Of note are the more flexible and useful methods that have not received as much spread as the simpler. These include techniques using Bayesian regression to allow for the inclusion of covariates and linear mixed models to handle repeated measures data. Additionally, statistical comparison of single cases to a control population are inherently low powered. To facilitate experimental planning and design power calculators have been implemented in singcar and the concept of power for this type of statistical analysis is reviewed.

2 Comparing a single case to small samples

2.1 Background

Variates of interest will in a neuropsychological context often be scores on some task assessing a cognitive function. Therefore they will often be referred to as “task scores”. It has not been uncommon in this field to compare patients to a control sample, treating the sample statistics as population parameters and estimating deficits by evaluating the \(p\) value associated with the patient’s \(Z\) score from the estimated distribution. Estimating dissociations was done in a similar way using a method developed by Payne and Jones (1957), where a \(Z\) score is calculated from the distribution of difference scores in the control sample.

This is of course problematic if small samples are used, which is often the case in neuropsychology, since the sampling distribution of the sample variance is right skewed for small sample sizes. If one derives a test statistic that is divided by a skewed distribution the size of the obtained values would be inflated and thus the Type I error rate (Crawford and Howell 1998).

In the following sections I will describe methods that has been devised as replacements to the \(Z\) score methods. Both frequentist and Bayesian significance tests have been developed and are covered in Section 2.2 and Section 2.3 respectively. More complex model designs can be achieved using Bayesian regression techniques, described in Section 2.3.4, or linear mixed models, described in Section 2.4. For estimating multivariate abnormality, see Section 2.5.

2.1.1 Example data

In the following sections the theory behind the methods in singcar is reviewed. But at the end of each subsection the usage of the implementations will be exemplified. To aid this the package comes with the dataset size_weight_illusion, which is a dataset from a neuropsychological study investigating the size-weight illusion in a well known patient “DF” (Hassan et al. 2020). DF incurred damage to the lateral occipital complex which gave rise to visual form agnosia, potentially making her less susceptible to the size-weight illusion (Dijkerman et al. 2004). This illusion is a perceptual phenomenon making objects of small size be perceived as heavier during lifting than objects of larger size but equal in weight (Buckingham 2014).

However, due to the location of DF’s brain damage one would only suspect less susceptibility to the illusion for visual cues since she other sensory processing should be unaffected of her damage. In the study by Hassan et al. (2020) the illusion is tested given visual cues in one condition and kinaesthetic cues in another. It was predicted that she would be abnormally less susceptible to visual size-weight illusion and that there would be an abnormally large discrepancy between visual and kinaesthetic size-weight illusion. The control sample consisted of 28 participants and their age and sex was collected along with their performance in the two conditions. The size-weight illusion is given in the dataset as a scaled measure of the number of grams weight difference perceived per cubic cm of volume change. To use this data, start by installing and loading the package:

install.packages("singcar")
library("singcar")

head(size_weight_illusion)

##   GROUP PPT    SEX YRS      V_SWI      K_SWI
## 1    SC  DF Female  65 0.02814921 0.10012712
## 2    HC E01 Female  70 0.21692634 0.21792930
## 3    HC E02   Male  64 0.17223226 0.26338899
## 4    HC E03 Female  63 0.07138049 0.09331700
## 5    HC E04 Female  65 0.10186453 0.25938045
## 6    HC E05 Female  66 0.15911439 0.08922615

V_SWI and K_SWI are the measurements for visual and kinaesthetic size-weight illusion respectively. Most functions in singcar can take summary data as input, but for demonstrative purposes the raw data of this dataset will be used. To illustrate usage of the methods more generally, the scores of the variates of interest as well as of the covariate YRS is extracted from the data and renamed:

caseA <- size_weight_illusion[1, "V_SWI"]
contA <- size_weight_illusion[-1, "V_SWI"]
caseB <- size_weight_illusion[1, "K_SWI"]
contB <- size_weight_illusion[-1, "K_SWI"]
caseAGE <- size_weight_illusion[1, "YRS"]
contAGE <- size_weight_illusion[-1, "YRS"]

Here, caseA and caseB refers to the case scores on some task A and B respectively. Similarly, contA and contB refers to the scores of the control sample on the same tasks.

2.2 Frequentist approaches

When we are comparing a patient to a healthy population on some normally distributed variate that we are estimating from a small sample, we are in fact not estimating abnormality from a normal distribution but a central \(t\) distribution.

Hence, we must use a \(t\) score rather than a \(Z\) score if we want to estimate deficits without inflating effect sizes. This was noted by Sokal and Rohlf (1981) (p. 227) where they devised a test statistic for single cases in the biological sciences, but it was first popularised within neuropsychology by Crawford and Howell (1998). Its common application is as a one-tailed “test of deficit” (TD) that estimate the proportion of the healthy population that would exhibit a lower score than the case on some cognitive task. If this score falls below some chosen threshold the patient can be diagnosed with a deficit.

This basic approach is simply a modified two samples \(t\) test,

\[\begin{equation} t_{n-1} = \frac{y^* - \overline{y}}{s \sqrt{\frac{n + 1}{n}}} \tag{2.1} \end{equation}\]

where one of the samples are treated as a size of \(n = 1\). The degrees of freedom then quite naturally become \(n+1-2 = n -1\). \(\overline{y}, \ s\) and \(n\) are the sample mean, standard deviation and size and \(y^*\) is the case score. This method allows transparent control over the Type I error rate (Crawford, Garthwaite, and Howell 2009; Crawford et al. 2004; Crawford and Garthwaite 2012) when used for significance testing. In addition, the \(p\) value associated with he obtained \(t\) statistic gives an unbiased point estimate of the abnormality of the case. That is, it provides us with the expected percentage of the normal population that would exhibit a more extreme score than the case. This estimate is often of main interest in neuropsychology. The simple proof for this is demonstrated by Crawford and Garthwaite (2006) as outlined below.

The percentage of a population expected to score lower on some variate \(Y\) than the case \(y^*\) is \[ \mathbb{P}[Y< y^*]. \] Subtracting \(\overline{y}\) from both sides of the inequality and dividing by \(s \sqrt{\frac{n + 1}{n}}\) gives \[ \mathbb{P}[Y< y^*] =\mathbb{P}\left[\frac{Y-\overline{y}}{s \sqrt{\frac{n + 1}{n}}} < \frac{y^* - \overline{y}}{s \sqrt{\frac{n + 1}{n}}}\right]. \] The quantity to the left of the inequality, i.e., \(\frac{y-\overline{y}}{s \sqrt{\frac{n + 1}{n}}}\) is \(t\) distributed with \(n-1\) degrees of freedom. Hence, \[ \mathbb{P}[Y< y^*] =\mathbb{P}\left[t_{n-1} < \frac{y^* - \overline{y}}{s \sqrt{\frac{n + 1}{n}}}\right]. \] To the right of the inequality we have the test statistic from ((2.1)). Therefore, \(\mathbb{P}[y< y^*]\) is the \(p\) value from the test of deficit. This simple fact also makes the construction of confidence intervals for the abnormality (i.e. \(p\)) possible.

Although the \(Z\) score is not an appropriate statistic to use for significance testing when the control sample is small, it does provide a standardised effect measure of the deficit of interest similar to Cohen’s \(d\), because insensitivity to sample size is a requirement for effect size measures (Cohen 1988; Crawford, Garthwaite, and Porter 2010). So, Crawford, Garthwaite, and Porter (2010) proposed the use of \(Z\) scores as effect size measures within single-case neuropsychology and dubbed this measure \(Z_{CC}\) where the subscript indicates “case-controls” comparison: \[\begin{equation} Z_{CC} = \frac{y^* - \overline{y}}{s}. \tag{2.2} \end{equation}\] If \(p\) is the percentage of the population that would fall below or above a case score, \(Z_{CC}\) can be used to construct \(100(1-\alpha)\)% confidence intervals around \(p\). The method to do so was described by Crawford and Garthwaite (2002).

Given \(Z_{CC}\) and \(y^* \neq \overline{y}\), \(Z_{CC}\) comes from a non-central \(t\) distribution on \(n-1\) degrees of freedom. We find the value \(\delta_U\) being the non-centrality parameter (NCP) of a non-central \(t\) distribution with \(df=n-1\) such that the \(100\frac{\alpha}{2}\) percentile equates \(Z_{CC}\sqrt{n}\) and correspondingly we find the NCP \(\delta_L\) of a non-central \(t\) distribution such that the \(100(1-\frac{\alpha}{2})\) percentile equates \(Z_{CC}\sqrt{n}\). The easiest approach to finding these quantities is by deploying a search algorithm. We can then construct the upper and lower boundaries for \(Z_{CC}\) as: \[ Z_{CC_U} = \frac{\delta_U}{\sqrt{n}}, \ \ Z_{CC_L} = \frac{\delta_L}{\sqrt{n}}. \] When the case falls on the left side of the distribution the boundaries for \(p\) are given by: \[ p_U = \Phi\left(\frac{\delta_U}{\sqrt{n}}\right), \ p_L = \Phi\left(\frac{\delta_L}{\sqrt{n}}\right), \] where \(\Phi\) is the CDF of the standard normal distribution. And when the case falls on the right side of the distribution the boundaries are given by: \[ p_U = 1 - \Phi\left(\frac{\delta_L}{\sqrt{n}}\right), \ p_L = 1 - \Phi\left(\frac{\delta_U}{\sqrt{n}}\right). \]

Evidently, estimating abnormality on a single variate is trivial. It is equally simple to estimate abnormality on the difference between two variates \(Y_1 - Y_2\) if the normally distributed variates \(Y_1\) and \(Y_2\) do not need standardisation to be comparable. If this is the case then the test function is similar to ((2.1)) but one divides the differences by the standard deviation of the difference scores: \[\begin{equation} t_{n-1} = \frac{(y^*_1 - \overline{y}_1) - (y^* _2 - \overline{y}_2) }{ \sqrt{(s^2_1 +s^2_2 -2s_1 s_2 \rho_{12})(\frac{n+1}{n})}} \tag{2.3} \end{equation}\] where \(\rho_{12}\) is the correlation between the variates. Confidence intervals for this “unstandardised difference test” (UDT) is constructed as for the TD. Applications testing unstandardised differences is however scarce, at least within neuropsychology. Much more common is the need to asses abnormality of differences between variates that require standardisation to be comparable.

If we just standardise the variates in ((2.3)) and do not take sample size into account, we get an effect size measure of the difference (dissociation) similar to \(Z_{CC}\), ((2.2)): \[\begin{equation} Z_{DCC} = \frac{z^*_1 - z^*_2}{\sqrt{2-2\rho_{12}}}. \tag{2.4} \end{equation}\] The two quantities in the numerator are the standardised case scores on variate \(Y_1\) and \(Y_2\) and the subscript indicates “discrepancy-case-controls”. This measure was proposed as a reporting standard in the single-case literature by Crawford, Garthwaite, and Porter (2010), but was first suggested as a statistic for significance testing by Payne and Jones (1957). It is of course not appropriate for such use since we would get an inflated resultant statistic (\(Z_{DCC}\)) because the quantities in the numerator (\(z^*_1\) and \(z^*_2\)) are inflated.

When we only can estimate standardised case scores from small samples and need to test if the difference between them is abnormally large we are in fact estimating the difference between two \(t\) distributed variates. Since linear combinations of correlated \(t\) distributed variates are not \(t\) distributed themselves this is not a trivial problem.

Garthwaite and Crawford (2004) examined the distribution of such difference scores using asymptotic expansion. They searched for quantities that divided by a function of the sample correlation would be asymptotically \(t\) distributed. They found:

\[\begin{equation} \psi=\frac{\frac{(y^*_1-\overline{y}_1)}{s_{1}}-\frac{(y^*_2-\overline{y}_2)}{s_{2}}}{ \sqrt{ (\frac{n+1}{n}) \left( (2-2 \rho)+ \frac{2(1-\rho^{2})}{n-1}+ \frac{(5+c^{2})(1-\rho^{2})}{2(n-1)^{2}}+ \frac{\rho(1+c^{2})(1-\rho^{2})}{2(n-1)^{2}}\right) }}, \end{equation}\]

\(\rho\) being the sample correlation and \(c\) a pre-specified critical two-tailed \(t\) value on \(df= n-1\). They showed that \(\mathbb{P}[ \psi > c] \approx\mathbb{P}[t >c]\) and that one must solve for \(\psi = c\) to obtain a precise probability of \(\psi\) which yields a quantity that is not dependent on a specified critical value. This is a quadratic equation in \(c^2\), choosing the positive root of which yields:

\[\begin{align} c & = \sqrt{\frac{ -b + \sqrt{b^2 - 4ad}}{2a}}, \ \text{where} \\ a & = (1+r)(1-r^2), \\ b & = (1-r)[4(n-1)^2+4(1+r)(n-1)+(1+r)(5+r)], \\ d & = - 2\left[\frac{y^*_{1} - \overline{y}_1}{s_1}-\frac{y^*_2 -\overline{y}_2}{s_2}\right]^2\left(\frac{n(n-1)^2}{n+1}\right) \tag{2.5} \end{align}\]

where \(p = \mathbb{P}[t_{n-1}>c]\) is used for significance testing and is the point estimate the percentage of the control population that is expected to show a more extreme difference score than the case. Note that the test statistic \(c\) will always be positive, no matter the direction of the effect.

Crawford and Garthwaite (2005) refer to this test statistic as the “revised standardised difference test” (RSDT) because it was an iteration on a test previously developed by the authors, that did not keep control of the Type I error rate Crawford, Howell, and Garthwaite (1998). Crawford and Garthwaite (2005) show that the RSDT succeeds in this attempt fairly well.

Using simulations the RSDT was shown to barely exceed the specified error rate even for very small samples such as \(n = 5\). However, when a case lacks a dissociation but exhibits extreme scores on both variates (in the same direction of course) the error rate increases steeply. According to one simulation study the RSDT starts to lose control of the error rate for scores that are simultaneously more extreme than two standard deviations away from the mean and for task scores as extreme as 8 standard deviations away from the mean, error rates as high as 35% were observed Crawford and Garthwaite (2007).

Another issue with the RSDT is that, because \(\psi\) is only approximately \(t\) distributed, constructing confidence intervals for estimated abnormality in the same manner as for the TD is not possible. So to remedy these drawbacks of this test statistic Crawford and colleagues looked into Bayesian methodology.

2.2.1 Example usage in `singcar`

The test of deficit is called in the singcar package with the function TD(). Testing for a deficit using the example data from Section 2.1.1 we have:

TD(case = caseA, controls = contA, alternative = "less", conf_level = 0.95)

## 
##  Test of Deficit
## 
## data:  Case = 0.03, Controls (m = 0.16, sd = 0.08, n = 28)
## t = -1.7243, df = 27, p-value = 0.04804
## alternative hypothesis: true diff. between case and controls is less than 0
## sample estimates:
##   Std. case score (Z-CC), 95% CI [-2.34, -1.15] 
##                                       -1.754857 
## Proportion below case (%), 95% CI [0.95, 12.47] 
##                                        4.804003

Where the argument alternative specifies the alternative hypothesis, i.e. whether we are performing two or one sided test and if the latter, which direction. If one instead wants to test for a dissociation the most common alternative hypothesis is two sided. Using the RSDT for this test, call:

RSDT(case_a = caseA, case_b = caseB, controls_a = contA, controls_b = contB,
      alternative = "two.sided")

## 
##  Revised Standardised Difference Test
## 
## data:  Case A: 0.03, B: 0.10, Ctrl. A (m, sd): (0.16, 0.08), B: (0.18, 0.10)
## approx. abs. t = 1.015, df = 27, p-value = 0.3191
## alternative hypothesis: true difference between tasks is not equal to 0
## sample estimates:
## Std. case score, task A (Z-CC) Std. case score, task B (Z-CC) 
##                     -1.7548574                     -0.7836956 
##  Std. task discrepancy (Z-DCC)      Proportion below case (%) 
##                     -1.0647889                     15.9560625

Notably, this function does not produce confidence intervals. If the two tasks are directly comparable without the need of standardisation (which in fact is the case for this specific dataset), call instead UDT() with the same syntax and confidence intervals will be given.

2.3 Bayesian approaches

The most prominent difference between the frequentist and the Bayesian framework is that in the former parameters are treated as fixed attributes of a population and in the latter they are themselves treated as random variables with associated distributions. To estimate these one can use prior knowledge about the parameter of interest which is specified as a prior distribution. There is often a desire that the data should be the only thing influencing estimations. If so, one can reduce the impact of the prior by, for example, assigning equal probabilities to all possible parameter values. This is an example of what is known as a non-informative prior distribution. This distribution is then updated when new information is obtained from data and as such forms the posterior parameter distribution. This is calculated using Bayes theorem and if we omit the normalising constant in the denominator of Bayes theorem it can be expressed: \[\begin{equation*} posterior \ \propto \ likelihood \times prior. \end{equation*}\]

If we have a hypothesis for a parameter value, what the above is saying is that the posterior density of this hypothesis is proportional (\(\propto\)) to the likelihood of the data under that hypothesis multiplied by the prior probability of the hypothesis.

One of the disadvantages with Bayesian methodology is that it might be impossible or at least very difficult to analytically calculate the posterior. They are however often feasible to construct with numerical solutions using Monte Carlo methods. One of the reasons Bayesian statistics has received such upsurge is because the increase of computational power in recent years that has made many otherwise infeasible numerical strategies possible. Monte Carlo methods are mathematical algorithms that solve numerical problems by repeated random sampling. The algorithms vary depending on the problem at hand, but for Bayesian methodology they are all building on rules for drawing random values based on the likelihood and prior – as such “constructing” the posterior with the draws after a large number of iterations. The peak of the constructed distribution is often the parameter of interest and the width is a measurement of the certainty of the estimation. If using non-informative priors the peak often coincide with the maximum likelihood estimation of the parameter. This means that non-informative priors often yields estimators with frequentist properties, necessary for null hypothesis significance testing.

Because the testing property of the estimation methods presented here are of main interest, frequentist properties was required when Crawford and Garthwaite (2007) and Crawford, Garthwaite, and Ryan (2011) devised Bayesian analogues to the test functions presented in Section 2.2. The procedural details of these methods will be described in the following sections. This is to provide details of how they were implemented in singcar as well as an overview of how they operate. The methods are all based on Monte Carlo methods and some are, computationally, very intense. They are implemented in singcar as straight R code but would probably have benefited from being implemented in compiled code first. This is an aim for future updates of the package. The following algorithms presented closely follows the original articles, but with few slight changes of notation to make the presentation more coherent.

2.3.1 The Bayesian test of deficit

The Bayesian analogue of the test of deficit (BTD) produces asymptotically identical output as the frequentist test of deficit. It produces an estimate of \(p\) with accompanying credible intervals, i.e. the percentage of controls that would exhibit a more extreme score than the case. The method is included here for completeness but for actual analysis there is no advantage to using it over the TD, ((2.1)).

From a sample of size \(n\) we obtain measurements of some normally distributed variate \(Y\) with unknown mean \(\mu\) and unknown variance \(\sigma^2\). Let \(\overline{y}\) and \(s^2\) denote the sample mean and variance and \(y^*\) the case score. The prior distribution of \(\mu\) is conditioned upon the prior distribution of \(\sigma^2\) since both parameters are unknown. The prior considered here is \(f(\mu, \sigma^2) \propto (\sigma^2)^{-1}\), which is the standard non-informative prior for normal data. The marginal posterior distribution of \(\frac{(n-1)s^2}{\sigma^2} \sim \chi^2_{n-1}\) and the posterior distribution \(\mu|\sigma^2 \sim \mathcal{N}(\overline{y}, \sigma^2/n)\), see e.g., Gelman et al. (2013) (p. 45 and 65).

When we have a posterior distribution for the control population parameters, \(p\) can be estimated by iterating the following steps.

Estimate \(\hat{\sigma}^2_{(i)}\) as \(\hat{\sigma}^2_{(i)} = \frac{(n-1)s^2}{\psi}\) where \(\psi\) is generated from a \(\chi^2\) distribution on \(n-1\) degrees of freedom. The subscript \((i)\) indicates that the statistic is an estimate of \(\sigma^2\) for this iteration.
The estimate for \(\mu\) for this iteration is given by \(\hat{\mu}_{(i)}=\overline{y}+z\sqrt{(\hat{\sigma}_{(i)}^2/n)}\), where \(z\) is generated from a standard normal distribution.
The estimate of \(p\) is then calculated as if \(\hat{\sigma}^2_{(i)}\) and \(\hat{\mu}_{(i)}\) are the true population parameters. That is, put \(z^*_{(i)}=\frac{y^* - \hat{\mu}_{(i)}}{\sqrt{\hat{\sigma}_{(i)}^2}}\) as the standardised case score and obtain an iteration estimate of \(p\) as \(\hat{p}_{(i)}=\Phi\left(z^*_{(i)}\right)\) or \(\hat{p}_{(i)} = 1-\Phi\left(z^*_{(i)}\right)\) depending on alternative hypothesis.

Iterating these steps a large number of times yields a distribution of \(\hat{p}\), the mean of which is the point estimate of \(p\), which can be used for significance testing. The 2.5th and 97.5th percentile of the distribution of \(\hat{p}\) as well as of \(z^*\) gives the boundaries for the 95% credible interval of \(\hat{p}\) and \(\hat{Z}_{CC}\) respectively. The point estimate of the latter is that of ((2.2)).

2.3.2 The Bayesian standardised difference test

In contrast to the BTD, the Bayesian analogue of the difference tests does not produce asymptotically identical results to the RSDT.
For the Bayesian standardised difference test (BSDT) we now obtain measurements on \(Y_1\) and \(Y_2\) following a bivariate normal distribution representing two tasks of interest, from a control sample of size \(n\). The goal is to estimate the percentage \(p\) of the control population that would be expected to show a greater difference \(Y_1 - Y_2\) than the case.

Let \(\overline{y}_1\) and \(\overline{y}_2\) be the sample means and
\[\begin{equation*} \pmb{A}=\begin{bmatrix} s^2_{1} & s_{12} \\ s_{12} & s^2_{2} \end{bmatrix} \end{equation*}\] the sample variance-covariance matrix. Then \(\pmb{S} =\pmb{A}(n-1)\) is the sums of squares and cross products (SSCP) matrix. The case scores are denoted \(y_1^*\) and \(y_2^*\).

The population parameters \[\begin{equation*} \pmb{\mu} = \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix} \ \text{and} \ \Sigma=\begin{bmatrix} \sigma^2_{1} & \sigma_{12} \\ \sigma_{12} & \sigma^2_{2} \end{bmatrix} \end{equation*}\] are both unknown. To estimate \(p\) we must first obtain a posterior of these parameters. Since both are unknown the prior distribution of \(\pmb\mu\) is again conditioned upon the prior distribution of \(\Sigma\).

The prior considered for \(f(\pmb\mu, \Sigma^{-1})\) in Crawford and Garthwaite (2007) was \(f(\pmb\mu, \Sigma^{-1}) \propto |\Sigma|\). This is a common non-informative prior, however, it was chosen in favour of the perhaps even more commonly used \(f(\pmb\mu, \Sigma^{-1}) \propto |\Sigma|^{(k+1)/2}\) for multivariate normal data, where \(k=\) the number of variates. This was because the former prior was shown to give asymptotically identical interval estimates as the UDT for unstandardised case scores. Even though the latter perhaps is more commonly applied, Crawford and Garthwaite (2007) refer to the former as the “standard theory” prior.

For this prior we have that \(\Sigma^{-1} \sim \mathcal{W}_k(\pmb{S}^{-1}, n)\). That is, the posterior marginal distribution for \(\Sigma^{-1}\) is a Wishart distribution parametrised with scale matrix \(\pmb{S}^{-1}\) and \(n\) degrees of freedom. The Wishart distribution is a multivariate generalisation of the \(\chi^2\) distribution. It is possible to view \(S\sim\mathcal{W}_p(\Sigma, n)\) as a distribution of SSCP-matrices associated with \(\Sigma\). Inversely, one can view the inverse Wishart distribution parametrised with some SSCP matrix and \(n\) degrees of freedom as a distribution of variance-covariance matrices related to that SSCP matrix.

We then have that \(\pmb\mu|\Sigma \sim \mathcal{N}_k(\pmb{\overline{y}}, \Sigma/n)\) where \(\mathcal{N}_k(\pmb{\overline{y}}, \Sigma/n)\) denotes a multivariate normal distribution with mean \(\pmb{\overline{y}}\) and variance \(\Sigma/n\), see e.g., Gelman et al. (2013) (p. 72-73).

The posterior marginal distribution of \(\Sigma\) is then constructed by iterating draws of random observations from \(\mathcal{W}^{-1}(\pmb{S}, n)\), each observation, \(\hat{\Sigma}_{(i)}\), being the estimation of \(\Sigma\) for the \(i\)th iteration.

As previously mentioned, frequentist properties are desired if the estimations are to be used for significance testing. Berger and Sun (2008) investigated convergence to frequentist estimates for bivariate normal distributions using various priors. They showed that the “standard theory” prior produced converging estimates of \(\sigma_1\) and \(\sigma_2\) but that the convergence was not as good for \(\rho\) and differences in means. As a “general purpose” prior they instead recommend using \(f(\pmb\mu, \Sigma^{-1}) \propto \frac{1}{\sigma_1\sigma_2(1-\rho^2)}\). Posteriors with this prior is constructed by using rejection sampling. Random draws from an inverse Wishart on \(n-1\) degrees of freedom are generated, but only a subset of the the observations are retained.

Crawford, Garthwaite, and Ryan (2011) considered this prior but noticed that it gave rise to too narrow credible intervals, overestimating the certainty of the estimations. However, they showed that if the sample was treated as being of size \(n-1\) when estimating \(\Sigma\), i.e. so that the generated intervals were somewhat more conservative, frequentist properties were retained. This “calibrated” prior is recommended by Crawford, Garthwaite, and Ryan (2011), and they refer to it as such.

To construct the posterior using the calibrated prior the following steps are iterated:

Since we are reducing the sample size we must put \(\pmb{S}^*= (n-2)\pmb{S}/(n-1)\) to retain the unbiased property of \(\pmb{S}/(n-1)\) as an estimate of \(\Sigma\).
Draw a random observation from \(\mathcal{W}^{-1}(\pmb{S}^*, n-2)\). This draw is denoted: \[ \hat{\Sigma} = \begin{bmatrix} \hat{\sigma}^2_{1} & \hat{\sigma}_{12} \\ \hat{\sigma}_{12} & \hat{\sigma}^2_{2} \end{bmatrix} \] and \[ \hat{\rho}= \frac{\hat{\sigma}_{12}}{\sqrt{\hat{\sigma}^2_{1}\hat{\sigma}^2_{2}}}. \]
If \(u\) is a random draw from a uniform distribution \(u \sim U(0, 1)\), \(\hat\Sigma\) is only accepted as the estimation of \(\Sigma\) if \(u^2 \leq 1-\hat{\rho}^2\), if not we iterate the procedure until we have an accepted draw of \(\hat\Sigma\). Once accepted we have an estimation of \(\Sigma\) for this iteration and denote the draw: \[\begin{equation*} \hat{\Sigma}_{(i)}=\begin{bmatrix} \hat{\sigma}^2_{1(i)} & \hat{\sigma}_{12(i)} \\ \hat{\sigma}_{12(i)} & \hat{\sigma}^2_{2(i)} \end{bmatrix}. \end{equation*}\]

Even though this latter calibrated prior is recommended, some might argue that frequentist properties are of no concern and therefore choose the “standard theory” prior. But regardless of the prior chosen, when we have an estimate of \(\Sigma\), the following steps are iterated to obtain an estimate of \(p\), the percentage of controls expected to exhibit a more extreme difference between the variates than the case.

Generate two draws from a standard normal distribution and denote them \(z_{r1}\) and \(z_{r2}\). Find the lower triangular matrix \(\pmb{T}\) such that \(\pmb{T}\pmb{T}' = \hat\Sigma_{(i)}\) using Choleksy decomposition. The estimate of \(\pmb{\mu}\) for this iteration is then: \[\begin{equation*} \pmb{\hat{\mu}}_{(i)} = \begin{pmatrix} \hat{\mu}_{1(i)} \\ \hat{\mu}_{2(i)} \end{pmatrix} = \begin{pmatrix} \overline{y}_1 \\ \overline{y}_2 \end{pmatrix}+ \pmb{T} \begin{pmatrix} z_{r1} \\ z_{r2} \end{pmatrix} / \sqrt{n}. \end{equation*}\] See for example Gelman et al. (2013) (p. 580).
We now have estimates of both \(\pmb{\mu}\) and \(\Sigma\) and calculate \(p\) as if these were the true population parameters. The method to do so differ slightly depending on whether a standardised or unstandardised test is required. For an unstandardised test, the size of the difference score is calculated as:
\[\begin{equation*} z_{(i)}^* = \frac{(y_1^* - \hat{\mu}_{1(i)}) - (y^*_2 - \hat{\mu}_{2(i)})} {\sqrt{\hat{\sigma}^2_{1(i)}+\hat{\sigma}^2_{2(i)}-2\hat{\sigma}_{12(i)}}}. \end{equation*}\] More commonly we would want to calculate the difference score from standardised variates. If this is the case, put: \[\begin{equation*} z_{1(i)} = \frac{y_1^* - \hat{\mu}_{1(i)}}{\sqrt{\hat{\sigma}^2_{1(i)}}}, \ z_{2(i)} = \frac{y_2^* - \hat{\mu}_{2(i)}}{\sqrt{\hat{\sigma}^2_{2(i)}}}, \ \hat{\rho}_{(i)} = \frac{\hat{\sigma}_{12(i)}}{\sqrt{\hat{\sigma}_{1(i)}\hat{\sigma}_{2(i)}}} \end{equation*}\] and \[\begin{equation*} \hat{z}^*_{(i)} = \frac{z_{1(i)} - z_{2(i)}}{\sqrt{2-2\hat{\rho}_{(i)}}}. \end{equation*}\]
The estimate of \(p\), \(\hat{p}_{(i)}\), for this iteration is then the tail area of a standard normal distribution more extreme than \(z^*_{(i)}\).

Iterate the procedure a large number of times and save the estimations for each iteration. This will yield a distribution of \(\hat{p}\) and \(\hat{z}^*\), the quantiles of which again are used to calculate the boundaries of the credible intervals for \(p\) and \(Z_{DCC}\). The point estimate of the former is the mean of the estimated distribution, while the point estimate of the latter is that of ((2.4)).

2.3.3 Example usage in `singcar`

Testing for a deficit using the Bayesian test of deficit, call BTD():

BTD(case = caseA, controls = contA, alternative = "less",
     iter = 10000, int_level = 0.95)

## 
##  Bayesian Test of Deficit
## 
## data:  Case = 0.03, Controls (m = 0.16, sd = 0.08, n = 28)
## df = 27, p-value = 0.04799
## alternative hypothesis: true diff. between case and controls is less than 0
## sample estimates:
##   Std. case score (Z-CC), 95% CI [-2.33, -1.16] 
##                                       -1.754857 
## Proportion below case (%), 95% CI [0.98, 12.38] 
##                                        4.799212

As can be seen, compared to TD() this function has the additional argument iter, which indicates the number of iterations to be used for the posterior approximation. This number should be based on desired accuracy of the analysis.

Testing for a dissociation using the BSDT (recommended over RSDT), call:

BSDT(case_a = caseA, case_b = caseB, controls_a = contA, controls_b = contB,
      alternative = "two.sided", iter = 10000, unstandardised = FALSE,
      calibrated = TRUE)

## 
##  Bayesian Standardised Difference Test
## 
## data:  Case A: 0.03, B: 0.10, Ctrl. A (m, sd): (0.16,0.08), B: (0.18,0.10)
## df = 26, p-value = 0.3212
## alternative hypothesis: true difference between tasks is not equal to 0
## sample estimates:
##                  Std. case score, task A (Z-CC) 
##                                      -1.7548574 
##                  Std. case score, task B (Z-CC) 
##                                      -0.7836956 
## Std. discrepancy (Z-DCC), 95% CI [-1.69, -0.40] 
##                                      -1.0647889 
## Proportion below case (%), 95% CI [4.53, 34.43] 
##                                      16.0584420

This function has several additional arguments compared to RSDT(). Instead of having a separate function for an unstandardised Bayesian difference test one can specify whether to standardise the variates or not within the function. If unstandardised is set to TRUE, the results from BSDT() will converge to that of UDT(). To use the “standard theory” prior instead of the calibrated, set calibrated to FALSE.

2.3.4 Tests using Bayesian regression to allow for covariates

It is seldom possible to design perfect experiments. This is especially true in psychological science where causal relationships can be confounded by a great variety of factors. When comparing a case to a control sample it is therefore important to reduce noise by matching the control group to the case as well as possible on variates beyond the ones of interest, such as age or education level, making the collection of control samples cumbersome.

However, this matching of covariates can instead be achieved statistically. Crawford, Garthwaite, and Ryan (2011) describe methods where they extend the tests presented in 2.2 to allow for the inclusion of covariates using Bayesian regression. These methods thus allow you to compare the case’s score on the task(s) of interest against the control population conditioned on them having the same score as the case on the covariate. This both reduces noise and alleviate the collection of control data. But because one degree of freedom is lost for each included covariate it is advisable to only include covariates with a correlation to the variate(s) of interest larger than \(0.3\), as a rule of thumb (Crawford, Garthwaite, and Ryan 2011).

We assume that the variates of interest are normally distributed, but no assumption of the distribution of the covariates is made. The procedure to get an estimate of \(p\) in the presence of covariates is presented below for the general case, that is either when we are interested in a deficit or a dissociation. The main difference between the methods is the recommended prior.

From a sample of size \(n\) we obtain measurements on \(k= 1,2\) variates of interest an \(m\) number of covariates, denoted \(\boldsymbol{X} = (X_1, \dots, X_m)\).

We wish to estimate the regression coefficients for each covariate on the variate(s) of interest: \[ \boldsymbol{B} = \begin{bmatrix} \boldsymbol{\beta}_1 & \cdots & \boldsymbol{\beta}_k \end{bmatrix}. \] Here \(\boldsymbol{B}\) is an \(m+1 \times1,2\) matrix where each column contains the regression coefficients for the \(i\)th variate of interest, the first row being the intercept(s). Further, we wish to estimate the covariance matrix or variance for the variate(s) of interest conditioned upon the covariates, we assume these statistics not to vary with the covariates. \[ \Sigma \ | \ \boldsymbol{X} = \begin{bmatrix} \sigma^2_1 \ | \ \boldsymbol{X} & \rho\sigma_1\sigma_2 \ | \ \boldsymbol{X} \\ \rho\sigma_1\sigma_2 \ | \ \boldsymbol{X} & \sigma^2_2 \ | \ \boldsymbol{X} \end{bmatrix} \ \text{or} \ \left[\sigma^2 | \boldsymbol{X} \right]. \]

If \(\boldsymbol{X}\) is the \(n \times (m+1)\) design matrix and \(\boldsymbol{Y}\), the \(n \times k\) response matrix \[ \boldsymbol{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1m} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{nm} \end{bmatrix},\ \ \boldsymbol{Y} = \begin{bmatrix} y_{11} & \cdots & y_{1k} \\ \vdots & \ddots & \vdots \\ y_{n1} & \cdots & y_{nk} \end{bmatrix} \] the data estimates of \(\boldsymbol{B}\) and \(\Sigma\) are \[ \boldsymbol{B}^* = (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{Y} \ \text{and} \ \Sigma^* = \frac{1}{n-m-1}(\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{B}^*)'(\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{B}^*). \]

When \(k=1\) or the “standard theory” prior is used for \(k=2\) the posterior of \(\Sigma\) is an inverse Wishart with scale matrix \((n-m-1)\Sigma^*\) on \(n-m-1\) or \(n-m\) degrees of freedom for \(k=1\) and \(k=2\) respectively (Crawford, Garthwaite, and Ryan 2011; Tiao and Zellner 1964). Hence, to get an estimate \(\hat{\Sigma}_{(i)}\) of \(\Sigma\) we generate a draw from these distributions.

For the “calibrated” prior we instead follow the procedure outlined in Section 2.3.2 and use \(\pmb{S}^* = (n-m-2)\pmb{S}/(n-m-1)\) as scale matrix to sample from an inverse Wishart on \(n-m-2\) degrees of freedom, \(\pmb{S} = (n-m-1)\Sigma^*\). The estimates are then:

\[\begin{equation} \hat{\Sigma}_{(i)}=[\hat{s}^2_{(i)}] \ \text{when} \ k=1 \ \text{and} \ \hat{\Sigma}_{(i)} = \begin{bmatrix} \hat{s}^2_{1(i)} & \hat{s}_{12(i)} \\ \hat{s}_{12(i)} & \hat{s}^2_{2(i)} \end{bmatrix} \ \text{when} \ k=2. \ \tag{2.6} \end{equation}\] k=2.
\tag{2.6} \end{equation}

We turn the \((m+1) \times k\) matrix \(\boldsymbol{B}^*\) into a \(k(m + 1) \times 1\) vector by concatenating the columns in \(\boldsymbol{B}^*\). \(\boldsymbol{B}^*_{vec} = (\boldsymbol{\beta}^{*'}_1, ...,\boldsymbol{\beta}^{*'}_k)'\) is the estimate of \(\boldsymbol{B}_{vec}\). The posterior of \(\boldsymbol{B}_{vec}| \hat{\Sigma}_{(i)}\) is a multivariate normal distribution with mean \(\boldsymbol{B}^*_{vec}\) and covariance matrix \(\boldsymbol{\Lambda}_{(i)} =\boldsymbol{\hat\Sigma}_{(i)} \otimes (\boldsymbol{X'X})^{-1}\), where \(\otimes\) denotes the Kronecker product. To get an estimate of \(\hat{\boldsymbol{B}}_{(i)}\) we generate a random observation from this distribution where the \(k(m+1) \times 1\) vector of obtained values \(\hat{\boldsymbol{B}}^*_{\text{vec(i)}} =(\hat{\boldsymbol{\beta}}'_{1(i)}, ...,\hat{\boldsymbol{\beta}}'_{k(i)})'\) is turned back into the \(m+1 \times k\) matrix \(\hat{\boldsymbol{B}}_{(i)} = (\hat{\boldsymbol{\beta}}_{1(i)}, ...,\hat{\boldsymbol{\beta}}_{k(i)})\). If \(\boldsymbol{x}^*\) is a vector of the case values on the covariates we obtain estimates of the conditional case scores on the task of interest as so: \[ \hat{\boldsymbol{\mu}}_{(i)} = \hat{\boldsymbol{B}}_{(i)}\boldsymbol{x}^*. \]

With these values we can now estimate the conditional effect size of the case. Crawford, Garthwaite, and Ryan (2011) denote these effect measures \(Z_{CCC}\) and \(Z_{DCCC}\) depending on whether \(k=1\) or \(k=2\). As the names indicate they are related to \(Z_{CC}\) ((2.2)) and \(Z_{DCC}\) ((2.4)), but calculated from statistics conditioned upon the covariates, hence the extra C in the subscript.

If \(y^*_1\) is the case score on a variate \(Y_1\), we would calculate the conditional deficit (\(Z_{DCCC}\)) of the case on variate \(Y_1\) like so: \[\begin{equation} \hat{Z}_{CCC(i)} = \frac{y^*_1-\hat{\mu}_{(i)}}{\hat{s}^2_{(i)}}. \tag{2.7} \end{equation}\] If \(y^*_2\) is the case score on variate \(Y_2\) and we are interested in a dissociation between \(Y_1\) and \(Y_2\), we denote the two conditional means on these variates \(\hat{\mu}_{1(i)}\) and \(\hat{\mu}_{2(i)}\). \(Z_{DCCC}\) is then calculated like so: \[\begin{equation} \hat{Z}_{DCCC(i)} = \frac{\frac{y^*_1-\hat{\mu}_{1(i)}}{\hat{s}_{1(i)}}-\frac{y^*_2-\hat{\mu}_{2(i)}} {\hat{s}_{2(i)}}}{\sqrt{2-2\hat{\rho}_{12(i)}}}. \tag{2.8} \end{equation}\] The conditional standard deviations \(\hat{s}_{1(i)}\) and \(\hat{s}_{2(i)}\) are obtained from \(\hat{\Sigma}_{(i)}\) in ((2.6)), the conditional correlation between the variates (\(\hat{\rho}_{12(i)}\)) is given by \[\hat{\rho}_{12(i)} = \frac{\hat{s}_{12(i)}}{\sqrt{\hat{s}^2_{1(i)}\hat{s}^2_{2(i)}}}.\]

To get estimates of \(p\) we find the tail area under the standard normal distribution with more extreme values on \(\hat{Z}_{CCC(i)}\) or \(\hat{Z}_{DCCC(i)}\), depending on the problem and alternative hypothesis specified. If testing for a defict we would for example have: \[ \hat{p}_{(i)}= \Phi(\hat{Z}_{CCC(i)}). \]

Iterate the procedure a large number of times and save the estimations for each iteration. The mean of the resultant distribution of \(\hat{p}\) is taken at the point estimate of \(p\). Point estimates of \(Z_{CCC}\) and \(Z_{DCCC}\) are obtained by using ((2.7)) and ((2.8)) with the conditional means, standard deviations and correlation calculated directly from the control sample. The \(1-\alpha\) credible interval boundaries for each estimate is obtained from the quantiles of the estimated distributions as described for the previous methods.

2.3.5 Example usage in `singcar`

To specify a Bayesian test of deficit with one or more covariates, call:

BTD_cov(case_task = caseA, case_covar = caseAGE, control_task = contA,
         control_covar = contAGE, alternative = "less", int_level = 0.95,
         iter = 10000, use_sumstats = FALSE)

## 
##  Bayesian Test of Deficit with Covariates
## 
## data:  Case = 0.03, Controls (m = 0.16, sd = 0.08, n = 28)
## df = 26, p-value = 0.05201
## alternative hypothesis: true diff. between case and controls is less than 0
## sample estimates:
##  Std. case score (Z-CCC), 95% CI [-2.30, -1.10] 
##                                       -1.749556 
## Proportion below case (%), 95% CI [1.08, 13.60] 
##                                        5.200838

The covariates should for the case be specified as a single value or vector of values containing the case scores for each covariate included. For the controls the covariates should be specified as a vector or matrix where each column represents one of the covariates.

To specify a Bayesian test for a dissociation with one or more covariates present, call:

BSDT_cov(case_tasks = c(caseA, caseB), case_covar = caseAGE, 
          control_tasks = cbind(contA, contB), control_covar = contAGE, 
          alternative = "two.sided", int_level = 0.95, iter = 10000,
          calibrated = TRUE, use_sumstats = FALSE)

## 
##  Bayesian Standardised Difference Test with Covariates
## 
## data:  Case A = 0.03, B = 0.10, Ctrl. (m, sd) A: (0.16,0.08), B: (0.18,0.10)
## df = 25, p-value = 0.3299
## alternative hypothesis: true difference between tasks is not equal to 0
## sample estimates:
##                   Std. case score, task A (Z-CC) 
##                                        -1.754857 
##                   Std. case score, task B (Z-CC) 
##                                        -0.783696 
## Std. discrepancy (Z-DCCC), 95% CI [-1.66, -0.41] 
##                                        -1.064152 
##  Proportion below case (%), 95% CI [4.84, 34.10] 
##                                        16.500000

Here, the case scores as well as the control scores from variates of interest must be concatenated. The covariates are specified in the same manner as for BTD_cov(). The prior used is again specified with the argument calibrated.

If using summary statistics for these functions, the argument use_sumstats must be set to TRUE. Information on how to specify the data in such a case consult the documentation.

2.4 Using mixed effects models for repeated measures

It should be noted that the TD of course is nothing other than a linear regression model with a dummy coded variable for belonging to the patient group. This equivalence might seem trivial but is included here for completeness. If \(\boldsymbol{y} = (y^*_1, y_2,...,y_n)\) is the vector of scores for the case \(y^*_1\) and the control sample \((y_2,...,y_n)\) on the variate of interest, then \(\boldsymbol{x} = (1, 0, ..., 0)\) is a dummy coded variable indicating inclusion in the patient group. The equation for a regression model with these variables is then: \[ y_i = \beta_0+\beta_1x_i+\epsilon_i, \] where the error term \(\epsilon_i\sim\mathcal{N}(0, \sigma^2)\). The least squares estimate of \(\beta_1\), \(\hat{\beta}_1\) equals the numerator in the equation for the TD, ((2.1)). We have: \[\begin{equation} \hat{\beta}_1 & = \frac{\frac{1}{n}\sum^n_{i =1}(x_i-\overline{x})(y_i-\overline{y})}{\frac{1}{n}\sum^n_{i = 1}(x_i-\overline{x})^2} =\frac{\sum^n_{i =1}y_ix_i-\frac{1}{n}\sum^n_{i =1}y_i\sum^n_{i =1}x_i}{\sum^n_{i=1}(x_i-\overline{x})^2} \\ & =\frac{\frac{ny^*_1-\sum^n_{i=1}y_i}{n}}{\left(1-\frac{1}{n}\right)^2+(n-1)\left(-\frac{1}{n}\right)^2} =\frac{\frac{ny^*_1-(y^*_1+\sum^n_{i=2}y_i)}{n}}{\frac{n-1}{n}} =\frac{(n-1)y^*_1-\sum^n_{i=2}y_i}{n-1} \\ & =\frac{(n-1)y^*_1-\sum^n_{i=2}y_i}{n-1} =y^*_1-\frac{\sum^n_{i=2}y_i}{n-1} =y^*_1-\overline{y}_c. \end{equation}\] Here \(\overline{y}_c\) denotes the sample mean of the controls, which is also the least squares estimate of the intercept, \(\hat{\beta}_0\): \[\begin{equation} \hat{\beta}_0 &= \overline{y}-\hat{\beta}_1\overline{x} = \frac{\sum^n_{i=1}y_i}{n} -\frac{ny_1-\sum^n_{i=1}y_i}{(n-1)}\frac{1}{n} \\ &= \frac{(n-1)\sum^n_{i=1}y_i-ny_1+\sum^n_{i=1}y_i}{n(n-1)}\\ &= \frac{n\sum^n_{i=1}y_i-ny_1}{n(n-1)} = \frac{\sum^n_{i=2}y_i}{n-1} = \overline{y}_c. \end{equation}\]

To test the slope in a simple regression model one uses the statistic \[\begin{equation} t_{n-2} = \frac{\hat{\beta}_1}{SE\left(\hat{\beta}_1\right)} \tag{2.9} \end{equation}\] following a \(t\) distribution on \(n-2\) degrees of freedom, given that the null hypothesis is true. The degrees of freedom for this distribution is of course the same as \(n_c-1\) degrees of freedom if \(n_c\) is the size of the control sample. Remember that the denominator in the TD was \(s_c \sqrt{\frac{n_c + 1}{n_c}}\), where \(s_c\) is the standard deviation of the control sample. We know that the standard error for the slope coefficient is: \[ SE\left(\hat{\beta}_1\right) = \sqrt{\frac{\frac{1}{n-2}\sum^n_{i =1}\hat{\epsilon}^2_i}{\sum^2_{i=1}(x_i-\overline{x})^2} } = \sqrt{\frac{\frac{1}{n-2}\sum^n_{i=1}(y_i-\hat{y}_i)^2}{\frac{n-1}{n}}}. \] Furthermore, \(\hat{y}_i = \hat{\beta}_0+\hat{\beta}_1x_i\) and \(\hat{y}_1 = \overline{y}_c+(y^*_1-\overline{y}_c) = y^*_1\), hence: \[\begin{equation} SE\left(\hat{\beta}_1\right) &= \sqrt{\frac{\frac{1}{n-2}\sum^n_{i=2}(y_i-\hat{y}_i)^2+\cancel{(y^*_1-\hat{y}_1)^2}}{\frac{n-1}{n}}} \\ &= \sqrt{\frac{\sum^n_{i=2}(y_i-\hat{y}_i)^2}{n-2}}\sqrt{\frac{n}{n-1}} = s_c\sqrt{\frac{n_c+1}{n_c}}. \tag{2.10}. \end{equation}\] As such the \(t\) test for the slope in the regression equation and the test of deficit are identical. To get \(Z_{CC}\) if using the regression approach simply divide \(\hat{\beta}_1\) by the standard deviation of the sample, \(Z_{CC} = \hat{\beta}_1/s_c\).

The fact that we can use simple linear regression instead of the TD to model case abnormality is in itself not very interesting. However, one can include covariates without going through the trouble of performing Bayesian regression for the methods as described in Section 2.3.4. To compute the conditional effect size \(Z_{CCC}\) in this case one must first calculate the conditional mean and standard deviation from the control sample. The conditional mean can be predicted from the fitted model given the case values on the covariates and the conditional variance can be computed like so: \[ \sigma^2|X = \frac{\sum^n_{i=2}\epsilon_i^2}{n-2}, \] where \(X\) denotes the covariates and \(\epsilon_i\) the residuals from the fitted model. Note also that the index starts at 2 to leave out the case. \(Z_{CCC}\) is then: \[ Z_{CCC} = \frac{y_1^*-\mu|X}{\sqrt{\sigma^2|X}}. \]

More importantly though, we can generalise the linear regression to linear mixed effect models (West, Welch, and Galecki 2014). With linear mixed effect models (LMM) one can model data with dependency between observations, opening for the possibility to use repeated measures data in single case comparisons and other, more complex, designs. This is because LMM can model both fixed and random effects (hence “mixed”). Fixed effects are in this case fixed or constant across participants and random effects are found for observations that are somehow more related to each other (e.g. several measurements from the same participant or hierarchically clustered observations). The concept of applying LMM to single case comparisons was first investigated by Huber et al. (2015).

Since impairments on tasks converging on a cognitive construct of interest provides stronger evidence of a potential deficit, the possibility to model random effects successfully is of great importance. Such a model would take the form: \[\begin{equation} y_{ij} = \beta_0 + \beta_1x_i + v_{0i}+\epsilon_{ij}, \ v_{0i} \sim \mathcal{N}(0, \sigma^2_v) \ \text{and} \ \epsilon_{ij} \sim \mathcal{N}(0, \sigma^2). \tag{2.11} \end{equation}\] Here \(i\) indicates participant and \(j\) indicates task or item. So, \(y_{ij}\) is then the response of participant \(i\) for task/item \(j\). \(\beta_0\) and \(\beta_1\) are still the intercept of the sample and raw effect of the case while \(v_{0i}\) denotes the random intercept for each participant. As such the term allows for participants to vary from the average intercept of the sample. However, while the model in ((2.11)) accounts for participant variability we must include yet a further term to account for the task/item variability: \[\begin{equation} y_{ij} = \beta_0 + \beta_1x_i + v_{0i} + w_{0j} +\epsilon_{ij}, \\ v_{0i} \sim \mathcal{N}(0, \sigma^2_v), \ w_{0j} \sim \mathcal{N}(0, \sigma^2_w) \ \text{and} \ \epsilon_{ij} \sim \mathcal{N}(0, \sigma^2), \tag{2.12} \end{equation}\] where \(w_{0j}\) denotes the random intercept for tasks/items, which should be included to avoid inflation of Type I errors (Baayen, Davidson, and Bates 2008).

A common method to obtain \(p\) values for LMMs is to use likelihood ratio tests. However, Huber et al. (2015) show that the Type I error rate for this method only is at an acceptable level when the size of the control sample is larger than 50. In contrast, using parametric bootstrapping or one-tailed \(t\) tests approximating the degrees of freedom with the Satterthwaite method, kept the error rate at its nominal level (or rather somewhat above, implying that a slightly more conservative \(\alpha\) value should be considered) even for small control sample sizes. As a reference they compared error rate and statistical power of the mixed model to the TD using aggregated data, i.e. taking the mean of all tasks/items for each participant, which is what Crawford and Howell (1998) suggests for this type of data. The \(p\) value of \(\beta_1\) in ((2.12)) using the Satterthwaite method will be approximately equal to the two sided \(p\) value from the TD if averaging the scores for each participant over the tasks/items, given a large enough sample size.

Furthermore, because we can model data with dependency between observations with LMM it is also possible to model effects within participants, i.e. between conditions. The model equation would then be: \[\begin{equation} y_{ij} = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i}+ v_{0i} + w_{0j} +\epsilon_{ij}, \\ v_{0i} \sim \mathcal{N}(0, \sigma^2_v), \ w_{0j} \sim \mathcal{N}(0, \sigma^2_w) \ \text{and} \ \epsilon_{ij} \sim \mathcal{N}(0, \sigma^2), \tag{2.13} \end{equation}\] where \(\beta_2\) is the raw effect of the condition indicated by the categorical variable \(x_2\). Introducing an interaction term between \(x_1\) and \(x_2\) in this model we can even test for the presence of a dissociation. This model would take the form: \[\begin{equation} y_{ij} = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i}+\beta_3(x_{1i}x_{2i})+ v_{0i} + v_{1i} + w_{0j} +\epsilon_{ij}, \\ \begin{pmatrix} v_{0i} \\ v_{1i} \end{pmatrix} \sim \mathcal{N}\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \Sigma_v = \begin{bmatrix} \sigma^2_{v_0} & \sigma_{v_0}\sigma_{v_1} \\ \sigma_{v_0}\sigma_{v_1} & \sigma^2_{v_1} \end{bmatrix} \right), \ w_{0j} \sim \mathcal{N}(0, \sigma^2_w) \ \text{and} \ \epsilon_{ij} \sim \mathcal{N}(0, \sigma^2), \tag{2.14} \end{equation}\] where \(\beta_3\) is the raw effect of the dissociation and \(v_{1i}\) is the random slope of the participants. That is, \(v_{1i}\) accounts for the variability of the effect of the condition, between participants. A caveat of using LMM is that it requires more observations than parameters specified, to be identifiable. Specifying random slopes and random intercepts for each participant then one must have \(>2n\) observations. Implying that it would not be a good model for testing a dissociation between two tasks without repeated measurements within each task.

There are already several implementations of LMMs in various open source software which can be used directly in the context of single case comparisons as described above. Therefore this methodology has not been implemented in singcar. However, due to the usefulness of the concept and the lack of methodology in singcar to handle repeated measures data I will give a brief description of how one can specify an LMM in R using the package lme4 (Bates et al. 2015) and lmerTest (Kuznetsova, Brockhoff, and Christensen 2017) (the latter is used to obtain \(p\) values from \(t\) tests with Satterthwaite approximation of the degrees of freedom).

2.4.1 Example usage in `R`

Begin by installing and loading required packages:

install.packages(c("lme4", "lmerTest", "MASS"))
library("lme4")
library("lmerTest")
library("MASS") # For multivariate simulation

Because the example data is not suited for repeated measures analysis a simulated dataset will be used. Say that we have 30 measurements from four normally distributed variates \(Y_i\), \(i = 1,2,3,4\). Call these variates items. They are thought to converge on some cognitive ability of interest and have the distribution: \[ \begin{pmatrix} Y_1 \\ Y_2 \\ Y_3 \\ Y_4 \end{pmatrix} \sim \mathcal{N}\left( \begin{bmatrix} 100 \\ 80 \\ 50 \\ 20 \end{bmatrix}, \begin{bmatrix} 15^2 & 108 & 78 & 30 \\ 108 & 10^2 & 12 & 15 \\ 78 & 12 & 10^2 & 25 \\ 30 & 15 & 25 & 5^2 \end{bmatrix} \right). \] The case has incurred a deficit of two standard deviations on all four variates. These data could then be simulated as follows:

simdata <-  expand.grid(case = c(1, rep(0, 29)), item = factor(1:4))
simdata$ID <- factor(rep(1:30, 4)) 

set.seed(123) # For replicability
mu <- c(100, 80, 50, 20)  
sigma <- matrix(c(15^2, 108, 78, 30,
                  108, 10^2, 12, 15,
                  78,  12, 10^2, 25,
                  30,  15,  25,  5^2), nrow = 4, byrow = T)
dv <- mvrnorm(30, mu = mu, Sigma = sigma) # Dependent variable

dv[1, ] <- dv[1, ] + c(-30, -20, -20,  -10) # Case scores
simdata$dv <- c(dv)

The linear mixed model to test the deficit of the case would then be

model1 <- lmer(dv ~ case + (1|ID) + (1|item), data = simdata)
summary(model1)[["coefficients"]]

##              Estimate Std. Error        df   t value    Pr(>|t|)
## (Intercept)  62.15587  17.512303  3.030078  3.549269 0.037506018
## case        -25.07172   7.793911 27.999869 -3.216834 0.003263006

Where (1|ID) and (1|item) specifies the random effects structure, in this case the random intercepts for participant and item. This is approximately the same \(t\) statistic that would be produced by the test of deficit if averaging the scores for each participant:

agg_data <- rowMeans(dv)
TD(agg_data[1], agg_data[-1], alternative = "two.sided")[["statistic"]]

##        t 
## -3.21684

Now, say that we want to model a dissociation using LMM. We have collected 30 measurements from two items during two different conditions. It is thought that these two different conditions are tapping two independent cognitive abilities but that the items collected in each condition are converging on the same. Hence we have four variates \(Y_{ij}\), \(i=1,2\), \(j=1,2\), distributed according to the multivariate distribution: \[\begin{equation} \begin{pmatrix} Y_{11} \\ Y_{21} \\ Y_{12} \\ Y_{22} \end{pmatrix} \sim \mathcal{N}\left( \begin{bmatrix} 100 \\ 80 \\ 50 \\ 20 \end{bmatrix}, \begin{bmatrix} 15^2 & 120 & 45 & 7 \\ 120 & 10^2 & 11 & 15 \\ 45 & 11 & 10^2 & 40 \\ 7 & 15 & 40 & 5^2 \end{bmatrix} \right), \tag{2.15} \end{equation}\] where e.g. \(Y_{12}\) is the random variable corresponding to the first item in the second condition. Given that the two conditions in fact taps different cognitive structures, the incurred impairment of the case would potentially affect performance in one of the conditions more than the other. Hence, we impose a deficit of two standard deviations on the case scores. The data can then be simulated as follows:

simdata <-  expand.grid(case = c(1, rep(0, 29)),
                        item = factor(1:2), condition = factor(1:2))
simdata$ID <- factor(rep(1:30, 4)) 
contrasts(simdata$condition) <- contr.sum(2) # Effect coding

set.seed(123) # For replicability
mu <- c(100, 80, 50, 20)  
sigma <- matrix(c(15^2, 120,    45,  7,
                  120, 10^2,    11, 15,
                  45,    11,  10^2, 40,
                  7,     15,    40, 5^2), nrow = 4, byrow = T)
dv <- mvrnorm(30, mu = mu, Sigma = sigma) # Dependent variable
dv[1, ] <- dv[1, ] + c(-30, -20, -0,  -0) # Case scores
simdata$dv <- c(dv)

The fully crossed linear mixed model used to test a dissociation for these data would then be specified like so:

model2 <- lmer(dv ~ case*condition + (condition|ID) + (1| item), data = simdata)
round(summary(model2)[["coefficients"]], 5)

##                  Estimate Std. Error       df  t value Pr(>|t|)
## (Intercept)      61.87447   12.18101  1.02152  5.07959  0.11994
## case            -17.24035    7.66811 27.99986 -2.24832  0.03261
## condition1       28.08855    0.98493 28.00058 28.51834  0.00000
## case:condition1 -13.82757    5.39468 28.00058 -2.56319  0.01603

Comparing the \(p\) value of the interaction effect to the \(p\) value obtained from RSDT() we see that that they are far from equivalent.

agg_data1 <- rowMeans(dv[ , 1:2])
agg_data2 <- rowMeans(dv[ , 3:4])
RSDT(agg_data1[1], agg_data2[1], agg_data1[-1], agg_data2[-1])[["p.value"]]

## [1] 0.06737687

However, a small simulation study conducted with the data described above revealed that the RSDT and the LMM yields comparable results regarding power and Type I error. It takes the specified impairments of the case for each variate and returns the ratio of significant interaction effects from ((2.14)) divided by the total number of simulations run, as well as the same ratio obtained from either the RSDT or the BSDT. The distribution of the data is the same in this small simulation study as ((2.15)).

R> powerLMM(nsim = 1000, case_impairment = c(-30, -20, -0, -0),
+ compare_against = "RSDT", alpha = 0.05)
RSDT LMM
1 0.353 0.462

It seems like the method based on the linear mixed model has quite a good power advantage over the RSDT for data simulated from this particular distribution. To compare the error rate of two tests when can use the same ratio as for power, when there is no true effect present. In this case that would be either when the case has no incurred impairment at all or when the impairments are equal in both conditions.

R> powerLMM(nsim = 1000, case_impairment = c(-0, -0, -0, -0),
+ compare_against = "RSDT", alpha = 0.05)
RSDT LMM
1 0.032 0.039

Even though a power advantage of the mixed model was observed it seems to produce an error rate under the nominal level of \(0.05\) as well. If this was a consistent behaviour it would eliminate the usefulness of the RSDT. However, if the case exhibits extreme impairments in both conditions we would have another null scenario. This was discussed previously in Section 2.2 where it was noted that one of the reasons Crawford and Garthwaite (2007) developed a Bayesian approach to test for dissociations was the highly increasing error rate of the RSDT, for increasing deficits on both tasks with no discrepancy between them. Say that the case has impairments of six standard deviations on both items in both conditions and let us compare the LMM to the BSDT instead of the RSDT:

R> powerLMM(nsim = 1000, case_impairment = c(-90, -60, -60, -30),
+ compare_against = "BSDT", alpha = 0.05)
BSDT LMM
1 0.053 0.61

The BSDT seems to keep the error rate almost at its nominal level while it is enormous for the LMM. Even though a more extensive simulation study with various types of data structures should be performed to solidify these findings, the preliminary results shown here indicate some limitations of using LMM for dissociations and reiterates the recommendation of Crawford and Garthwaite (2007) to use the BSDT to test for dissociations, even when using aggregated repeated measures data. That is, because the effects of brain damage can be severe I would recommend against using LMM to model dissociations.

2.5 Multivariate methods

Up until now I have discussed univariate testing methods. However, as discussed in the introduction, Section 1, stronger evidence for either deficits or dissociations is provided by using multiple measures (tasks) for a cognitive function of interest. Conceptually, what we want to estimate or test in this scenario is the distance between a vector of observations for a single case and a vector of population means. A common way to do this is by using the Mahalanobis distance measure. If \(\boldsymbol{y}^*\) is a vector of task observations from a single case and \(\boldsymbol{\mu}\) is the population mean vector, the Mahalanobis distance, \(\delta\), is a measure of the distance between the case and the population mean, given by: \[ \delta= \sqrt{(\boldsymbol{y}^*-\boldsymbol{\mu})'\Sigma^{-1}(\boldsymbol{y}^*-\boldsymbol{\mu})}. \]

Given that the population follows a multivariate normal distribution we have that \(\delta^2 \sim \chi^2_{\nu_1}\) where \(\nu_1\) is the degrees of freedom and the number of dimensions in the multivariate distribution. This squared distance is sometimes referred to as the Mahalanobis index. Because the purpose of case-comparison methodology is to estimate abnormality we are interested in \(p\), that is the proportion of the control population that would exhibit larger \(\delta\) than a single case. If we need to estimate \(\boldsymbol{\mu}\) and \(\Sigma\) the sample Mahalanobis distance is given by \[ \hat\delta = \sqrt{(\boldsymbol{y}^*-\overline{\boldsymbol{y}})'\boldsymbol{S}^{-1}(\boldsymbol{y}^*-\overline{\boldsymbol{y}})}, \] where \(\overline{\boldsymbol{y}}\) is the sample mean vector and \(\boldsymbol{S}\) the sample covariance matrix. Using the Mahalanobis index between a case and the mean of a normative sample, the abnormality is estimated with \[ \hat{p} = \mathbb{P}[(\boldsymbol{y}-\overline{\boldsymbol{y}})'\boldsymbol{S}^{-1}(\boldsymbol{y}-\overline{\boldsymbol{y}}) > (\boldsymbol{y}^*-\overline{\boldsymbol{y}})'\boldsymbol{S}^{-1}(\boldsymbol{y}^*-\overline{\boldsymbol{y}})]. \]

An obvious way to estimate \(p\) would be to use the Hotelling’s \(T^2\) test. The \(T^2\) statistic is proportional to the \(F\) distribution and hence its related \(p\) value seems like an intuitive choice as an estimate of \(p\). This was in fact proposed by Huizenga et al. (2007), but Elfadaly, Garthwaite, and Crawford (2016) show that this estimate is biased and devise instead several other candidates. None of these was shown to be perfectly unbiased but several of the estimates performed better than the estimate based on the \(T^2\) statistic.

If we let \(\lambda = \delta^2 = (\boldsymbol{y}^*-\boldsymbol{\mu})'\Sigma^{-1}(\boldsymbol{y}^*-\boldsymbol{\mu})\) and \(\chi^2_{\nu_1}= (\boldsymbol{y}-\boldsymbol{\mu})'\Sigma^{-1}(\boldsymbol{y}-\boldsymbol{\mu})\) be the case’s Mahalanobis index and the Mahalanobis index of any vector \(\boldsymbol{x}\) respectively, what we want to estimate is: \[ p = \mathbb{P}[\chi^2_{\nu_1}> \lambda]. \] If \(\overline{\boldsymbol{y}}\) and \(\hat\Sigma = \frac{n-1}{n}\boldsymbol{S}\) denotes the maximum likelihood estimates of \(\boldsymbol{\mu}\) and \(\Sigma\) the Hotelling’s \(T^2\) based estimate of \(p\), \(\hat{p}_F\), is \[ \hat{p}_F = \mathbb{P}\left[(\boldsymbol{y}-\overline{\boldsymbol{y}})'\hat\Sigma^{-1}(\boldsymbol{y}-\overline{\boldsymbol{y}})>(\boldsymbol{y}^*-\overline{\boldsymbol{y}})'\hat\Sigma^{-1}(\boldsymbol{y}^*-\overline{\boldsymbol{y}})\right]. \] If we let \(\lambda_0 = (\boldsymbol{y}^*-\overline{\boldsymbol{y}})'\boldsymbol{S}^{-1}(\boldsymbol{y}^*-\overline{\boldsymbol{y}})\), we have that \[ T^2=\frac{n\lambda_0}{n+1} \ \text{and} \ \hat{p}_F = \mathbb{P}\left[F_{\nu_1, \nu_2}>\frac{\nu_2}{\nu_1(n-1)}T^2\right]. \] i.e., this estimate is the \(p\) value associated with testing the hypothesis that the case is coming from the control population, using a Hotelling’s \(T^2\) test. As such it is also often used as a point estimate of \(p\). However, we can also use the fact that \(\lambda \sim \chi^2_{\nu_1}\) and use the \(p\) value from the \(\chi^2\) distribution by only using the sample estimates on the right hand side of the inequality: \[ \hat{p}_{\chi^2} = \mathbb{P}\left[(\boldsymbol{y}-\boldsymbol{\mu})'\Sigma^{-1}(\boldsymbol{y}-\boldsymbol{\mu})>(\boldsymbol{y}^*-\overline{\boldsymbol{y}})'\hat\Sigma^{-1}(\boldsymbol{y}^*-\overline{\boldsymbol{y}})\right]. \] We have that: \[ \hat{p}_{\chi^2} = \mathbb{P}\left[\chi^2_{\nu_1}>\frac{n}{n-1}\lambda_0\right]. \] This is an intuitive estimator and it was shown to exhibit fairly low bias (lower than \(\hat{p}_F\)) for low true values of \(p\) and it had the smallest mean square error of all estimator considered by Elfadaly, Garthwaite, and Crawford (2016). But unfortunately, as \(p\) increases \(\hat{p}_{\chi^2}\) starts to show substantial bias.

Both of theses estimates are intuitive, however, since they both are biased Elfadaly, Garthwaite, and Crawford (2016) recommend using other estimators, based on the needs of the analysis. One of which is based on the a modified confidence interval for Mahalanobis distance and two are based on quadrature polynomial approximation of the \(\chi^2\) distribution. In Section 2.5.1 I describe two estimators based on confidence intervals of \(\delta\) which are both implemented in singcar along with \(\hat{p}_F\) and \(\hat{p}_{\chi^2}\).

2.5.1 Estimation based on confidence intervals

Reiser (2001) suggested a method to construct confidence intervals on \(\delta^2\). Using this method Elfadaly, Garthwaite, and Crawford (2016) propose a way to construct confidence intervals on \(p\) as well. We have: \[ F_0 = \frac{n\nu_2}{(n-1)\nu_1}\lambda_0 \sim F_{\nu_1, \nu_2}(n\lambda), \] that is, the sample statistic \(F_0\) has a non-central \(F\) distribution on \(\nu_1\) and \(\nu_2\) degrees of freedom and non-centrality parameter \(n\lambda\). We call the value of \(n\lambda\), where \(F_0\) is the \(\alpha\)-quantile of this distribution, \(L_{\alpha}\). We can construct the lower and upper boundaries of a confidence interval for \(p\) using the CDF for the \(\chi^2\)-distribution on \(\nu_1\) degrees of freedom, denoted \(G(.)\), as such: \[\begin{equation} \left[1-G\left(\frac{L_{\alpha/2}}{n}\right), \ 1-G\left(\frac{L_{1-\alpha/2}}{n}\right)\right]. \tag{2.16} \end{equation}\]

Similarly, they suggest a point estimator of \(p\) based on the median of \(F_{\nu_1, \nu_2}(n\lambda)\). So that if we let \(L_{0.5}\) be the value of \(n\lambda\) where \(F_0\) is the median of \(F_{\nu_1, \nu_2}(n\lambda)\) the estimator is: \[\begin{equation} \hat{p}_D= 1-G\left(\frac{L_{0.5}}{n}\right). \tag{2.17} \end{equation}\]

Similarly we would construct a confidence interval around \(\lambda\) by choosing \(\lambda_u\) and \(\lambda_l\) such that \[\begin{equation} \mathbb{P}\left[F_{\nu_1,n- \nu_1}(n\lambda_l)\leq F_0\right] = \frac{\alpha}{2}, \ \text{and} \ \mathbb{P}\left[F_{\nu_1,n- \nu_1}(n\lambda_u)\leq F_0\right] =1- \frac{\alpha}{2}. \tag{2.18} \end{equation}\]

However, this method has a drawback. It exactly matches the nominal level, i.e. all of the say 95% intervals (assuming \(\alpha= 0.05\)) will in fact contain the true value of \(\delta\) 95% of the time, proved by Wunderlich et al. (2015). But when \(F_0\) is small enough so that \(\mathbb{P}[F_{\nu_1,n- \nu_1}(0)\leq F_0] \leq 1- \frac{\alpha}{2}\), we cannot find a \(\lambda_u\) so that \(\mathbb{P}\left[F_{\nu_1,n- \nu_1}(n\lambda_u)\leq F_0\right] =1- \frac{\alpha}{2}\) and it is instead set to 0. Similarly we set \(\lambda_l\) to 0 when \(\mathbb{P}[F_{\nu_1,n- \nu_1}(0)\leq F_0] \leq \frac{\alpha}{2}\). But we know with full certainty that an interval of [0, 0] will not contain \(\lambda\), since \(\boldsymbol{y}^*\) is continuous with \(\mathbb{P}[\boldsymbol{y}^* = \boldsymbol{\mu}] = 0\).

This problem follows in the construction of confidence intervals for \(p\), ((2.16)) as well as for the estimator \(\hat{p}_D\), ((2.17)). The median of the distribution \(F_{\nu_1, \nu_2}(n\lambda)\) decreases of course as the non-centrality parameter \(n\lambda\) decreases. We have that \(\lambda \geq 0\), therefore the lower limit of the median for \(F_{\nu_1,\nu_2}(n\lambda)\) is the median of of this distribution when the non-centrality parameter \(n\lambda = 0\), i.e. the median of a central \(F\) distribution on \(\nu_1\) and \(\nu_2\) degrees of freedom. So, using the approach of Reiser (2001), if \(F_0\) is smaller than this limit one set \(n\lambda\) to zero.

We see, however, a bias in this estimator when \(F_0\) is small even if it is not smaller than the lower limit. Elfadaly, Garthwaite, and Crawford (2016) gives as an example of this scenario where we have \(\nu_1 =4, \ \nu_2 = 20\) and \(\lambda_0 = 0.4\), hence \(n =24\) and we have \(F_0 = \frac{24* 20}{23 * 4}0.4 = 2.09\). The value of the non-centrality parameter \(n\lambda\) where the median of \(F_{\nu_1, \nu_2}(n\lambda)\) equals 2.09 is ~5.015 and \(\hat{p}_D = 1-G_{4}\left(5.015/24\right) = 0.9949\) or 99.49%. So when using the estimated Mahalanobis index of \(0.4\) we would infer that only 0.51% of the population would exhibit a lower Mahalanobis index. If instead calculating the true percentage of the population that would exhibit a smaller Mahalanobis index when 0.4 is the true case value, that is \(p = 1-G_4(0.4) = 0.9825\) or 98.25%, which is quite a considerable discrepancy.

Garthwaite, Elfadaly, and Crawford (2017) proposed a solution to this in the construction of confidence intervals of \(\delta\) by modifying the method by Reiser (2001). They suggested to form a Bayesian credible interval instead of the frequentist confidence interval, whenever \(F_0\) is small. Credible intervals have the more intuitive interpretation that they contain the true value of interest with a probability of \(1-\alpha\). As such this interval will not give rise to unreasonable values such as [0, 0]. But to form a credible interval, a prior must be specified.

Let \(\boldsymbol{y}\) be a vector of values from an observation of the control population and not necessarily the case. We denote the prior for \(\delta\) when \(\delta\) is the Mahalanobis distance of the case as \(\pi(\delta)\) and when \(\delta\) is the distance between the mean of the controls and a randomly selected observation as \(\psi(\delta)\). A priori \(\delta\) should be larger when it is calculated from the observation of the case than a randomly selected observation of the controls. Garthwaite, Elfadaly, and Crawford (2017) assume that for any \(\hat{\delta}^2\) and any specified quantile \(\pi_q(\delta | \hat{\delta}) \geq \psi_q(\delta | \hat{\delta})\), where \(\pi_q(.)\) and \(\psi_q(.)\) are the q:th quantile of the prior distributions.

Note that they do not aim to specify \(\pi(\delta)\), instead they want to put limits on the quantiles of the resultant posterior through \(\psi(\delta)\). This is because we know neither \(\pi(\delta)\) nor \(\pi_q(\delta | \hat{\delta})\) but it is possible to establish \(\psi(\delta|\hat{\delta})\) and the \(\alpha/2\) and \(1-\alpha/2\) quantiles of this distribution are then the upper and lower boundaries of a \(1-\alpha\) credible interval of \(\delta\).

The logic behind this modification is intuitive: treat the case like a randomly drawn observation from the control distribution whenever the case observation is more similar to the average of the controls than the majority of the control observations. That is, we take the boundaries that are lowest of the confidence and credible intervals.

Again, we set \(\lambda = \delta^2\) and \(\hat{\lambda} = \hat{\delta}^2\) and let \(\psi^*(\lambda)\) be the prior corresponding to \(\psi(\delta)\). This is a \(\chi^2\) distribution on \(\nu_1\) degrees of freedom. \[\begin{equation} \psi^*(\lambda) = \frac{1}{2^{\nu_1/2}\Gamma(\nu_1/2)}\lambda^{\nu_1/2-1}e^{-\lambda/2}, \ \lambda \geq 0. \tag{2.19} \end{equation}\] If we set \(m = n/(n-1)\) the likelihood of \(\lambda\), \(L(\lambda; \hat{\lambda})\), is given by \[\begin{equation} L(\lambda; \hat{\lambda}) = \sum_{r=0}^{\infty}\frac{(n\lambda/2)^re^{-n\lambda}}{B[(n-\nu_1)/2, \ \nu_1/2+r]r!}m^{\nu_1/2+r}\left(\frac{1}{1+m\hat{\lambda}}\right)^{n/2+r}\hat{\lambda}^{\nu_1/2+r-1}, \ \lambda \geq 0. \tag{2.20} \end{equation}\] Combining the prior with the likelihood, \(\psi^*(\lambda)L(\lambda; \hat{\lambda})\) yields the posterior \(\psi^*(\lambda | \hat{\lambda})\): \[\begin{equation} \psi^*(\lambda | \hat{\lambda}) = \frac{1}{c} \sum_{r=0}^{\infty}\frac{(n^r/2)(\lambda/2)^{\nu_1/2+r-1} e^{-(n+1)\lambda/2}}{B[(n-\nu_1)/2, \ \nu_1/2+r]\Gamma(\nu_1/2)r!}m^{\nu_1/2+r} \left(\frac{1}{1+m\hat{\lambda}}\right)^{n/2+r}\hat{\lambda}^{\nu_1/2+r-1}, \ \lambda \geq 0. \tag{2.21} \end{equation}\]

Here \(B(.,\ .)\) and \(\Gamma(.)\) are the beta and gamma function respectively, \(c\) is the norming constant which is obtained through numeric integration of (2.21) over \(0 < \lambda < \infty\).

Sums over infinite ranges are problematic for practical computation. One way do deal with this is to set reasonable endpoints instead of infinity. For the implementation in singcar the endpoint of the infinite sum was set by noting that the denominator in ((2.21)) goes to infinity faster than the numerator. In fact, in R and many other languages \(r! = \infty\) for \(r > 170\), so a reasonable boundary for the infinite sum would then be \(170\), because anything divided by infinity is 0. However, for some parameter values infinity will be approached in the numerator before it is approached in the denominator. Infinity divided by any real value is still infinity but infinity divided by infinity is undefined. Therefore, to be able to sum, we remove any infinite or undefined value if present. As such, the integral over the sum in ((2.21)) will always be convergent and the normalising constant \(c\) can be calculated.

singcar uses the function integrate in the base R package stats to do this. This numerical integration procedure, based on routines from the QUADPACK package (Piessens et al. 2012), often yields correct integrals. However, it uses subdivision of the interval to be integrated and when integrating to infinity the area of interest will be undersampled. Hence, in singcar we set: \[ c = \int_0^{0.5\sqrt{\hat\lambda}}\psi^*(\lambda | \hat{\lambda}) d\lambda + \int_{0.5\sqrt{\hat\lambda}}^{\sqrt{\hat\lambda}}\psi^*(\lambda | \hat{\lambda}) d\lambda + \int_{\sqrt{\hat\lambda}}^{1.5\sqrt{\hat\lambda}}\psi^*(\lambda | \hat{\lambda}) d\lambda + \int_{1.5\sqrt{\hat\lambda}}^{2\sqrt{\hat\lambda}}\psi^*(\lambda | \hat{\lambda}) d\lambda + \int_{2\sqrt{\hat\lambda}}^{\infty}\psi^*(\lambda | \hat{\lambda}) \ d\lambda. \] This ensures heavier sampling around \(\sqrt{\hat\lambda}\), which of course is the area we are most interested in. Further, for large values of \(\hat\lambda\) some discontinuities will be introduced in the posterior which can be problematic for some numerical integration techniques as well but the above subdivision will alleviate this problem somewhat.

The q:th quantile of the posterior \(\psi^*(\lambda | \hat{\lambda})\), \(\lambda_q\) is then found by an optimisation search procedure (nlminb in stats) that aims to minimise \[ \left| \int_{\lambda_0}^{\lambda_q} \psi^*(\lambda | \hat{\lambda})d\lambda - q \right|. \]

To get an estimate of \(p\) we use the same procedure to find the median of the posterior, \(\lambda_{0.5}\). Elfadaly, Garthwaite, and Crawford (2016)’s modified estimator of \(p\), \(\hat{p}_{MD}\), is then: \[ \hat{p}_{MD} = min\left[\hat{p}_D, \ 1-G\left(\lambda_{0.5}\right) \right]. \] Note however that due to the fact that \(c\) will sometimes be very small and hence \(1/c\) very large, computations of the normalised posterior can be very slow, but this mostly happens when \(\hat\delta\) is so large that \(\hat{p}_D\) would be greater than \(1-G(\lambda_{0.5})\) anyway. In singcar it is therefore recommended to use \(\hat{p}_{D}\) but to calculate \(\hat{p}_{MD}\) if the Mahalanobis distance of the case seems suspiciously small.

The difference between the estimators \(\hat{p}_D\) and \(\hat{p}_{MD}\) is small and Elfadaly, Garthwaite, and Crawford (2016) show that their bias and means square error are almost identical, but they recommend \(\hat{p}_{MD}\). This is because, as \(\hat{\delta} \rightarrow 0, \ \hat{p}_D \rightarrow 1\) much faster than it, by common sense, should.

It should be noted that the implementation of \(\hat{p}_{MD}\) in singcar is unstable when compared to the implementation by Garthwaite, Elfadaly, and Crawford (2017) written in C. I have not been able to find the cause of this discrepancy, but it only happens for relatively large \(\hat{\delta}\). It is likely that the numerical integration procedures differ and that the procedure used in the implementation by Garthwaite, Elfadaly, and Crawford (2017) better handles discontinuous functions, which we see in the posterior ((2.21)) for high values of \(\hat{\delta}\). Alternatively, the discontinuities cause problems for the optimisation routine. It is therefore recommended that both estimators are calculated, and the smallest of the two are chosen, but if that is \(\hat{p}_{MD}\) then compare the result to the implementation by Garthwaite, Elfadaly, and Crawford (2017).

2.5.2 Example usage in `singcar`

Say that we want to know the case abnormality in the example data on both tasks simultaneously. To estimate this abnormality on a multivariate space, call:

MTD(case = c(caseA, caseB), controls = cbind(contA, contB),
    conf_level = 0.95, method = c("pd", "pchi", "pf", "pmd"),
    mahalanobis_dist = NULL, k = NULL, n = NULL)

As can be seen, \(\hat{p}_D\) is the default method to estimate the abnormality. This is because the difference in results between \(\hat{p}_D\) and \(\hat{p}_{MD}\) is very small, but the difference in efficiency is huge. In addition, "pmd" in singcar is actually \(1-G(\lambda_{0.5})\) rather than \(min[\hat{p}_D, 1-G(\lambda_{0.5})]\), hence taking the minimum is left to the user. The reason for this was to minimise confusion. However, I realise that this could yield the opposite effect and might be changed in future updates.

The \(p\) value in the output of MTD() is the \(p\) value associated with the Hotelling’s \(T^2\) statistic and the same as the abnormality estimate if the method pf is chosen. The three last arguments are only required if summary statistics are used.

2.5.3 Power calculators

A further capacity of singcaris that it can be used to calculate statistical power of the tests. The notion of power when comparing cases to control samples have been somewhat overlooked for this class of statistical tests. In recent work (McIntosh and Rittmo 2020), we argued that, even though power is inherently limited in this paradigm, a priori calculations are still useful for study design and interpretation in neuropsychological and other applications. Calculating power for the test of deficit is similar to calculating power for any \(t\) test and can be done analytically. \[\begin{equation} power = 1 - \beta = T_{n-1}\left(t_{\alpha, \ n-1} \Bigg\rvert \frac{x^* - \overline{x}}{\sigma \sqrt{\frac{n+1}{n}}}\right) \tag{2.22} \label{eq:TDpower} \end{equation}\] Where \(T_{n-1}(.\rvert \theta)\) is the cumulative distribution function for the non-central \(t\) distribution with \(n-1\) degrees of freedom and non-centrality parameter \(\frac{y^* - \overline{y}}{\sigma \sqrt{\frac{n+1}{n}}}\) (i.e., TD, Equation (2.1) and \(t_{\alpha, \ n-1}\) is the \(\alpha\) quantile of the \(t\) distribution on \(n-1\) degrees of freedom (note that this is for a one-sided test). For the unstandardised difference test power is calculated in an analogous way by putting Equation (2.3) as the non-centrality parameter. Deriving power for the other functions in an analytic manner is however not possible (the RSDT is only approximately \(t\) distributed) and a Monte Carlo approach has been used for these tests. To call any power calculator in the package one simply uses the function names with _power added as a suffix.

So, for example, to calculate power for the test of deficit we call TD_power(). The expected case score and either sample size or desired power must be supplied. The mean and standard deviation of the control sample can also be specified with the arguments mean and sd. If not, they take the default values of 0 and 1 respectively so that the case score is interpreted as distance from the mean in standard deviations. A conventional \(\alpha\)-level of \(0.05\) is assumed if nothing else is supplied. The alternative hypothesis can also be specified by the argument alternative: specify "less" (default) or "greater" for a one-tailed test, specify "two.sided" for a two-tailed test.

TD_power(case = 70,
         mean = 100,
         sd = 15,
         sample_size = 16,
         power = NULL,
         alternative = "less",
         alpha = 0.05,
         spec = 0.005)

## [1] 0.5819579

TD_power() can also be used to calculate required sample size for a desired level of power. For example, if we specify a desired power level of 0.6, leave sample_size to its default and let the rest of the arguments be as in the previous example, we see from the output of the function that power will not increase more than 0.5% for any additional participant after a sample size of 15. That is, the algorithm stops searching when this level of specificity has been reached and we are nearing the asymptotic maximum power for this effect size. We can increase the specificity by lowering the spec argument.

TD_power(case = 70,
         mean = 100,
         sd = 15,
         sample_size = NULL,
         power = 0.6,
         alternative = "less",
         alpha = 0.05,
         spec = 0.005)

## Power (0.578) will not increase more than 0.5% per participant for n > 15

##    n     power
## 1 15 0.5780555

Power calculators for the Bayesian tests of deficit cannot calculate required sample size. This is because they rely on simulation methods to estimate approximate power and deploying a search algorithm to find the required sample size for a given level of power would be computationally too intense. The syntax is otherwise relatively similar to that of TD_power(). For BTD_power() we have the two extra arguments nsim and iter, indicating the number of simulations used in the power function and by BTD(), respectively.

BTD_power(case = 70,
         mean = 100,
         sd = 15,
         sample_size = 15,
         alternative = "less",
         alpha = 0.05,
         nsim = 1000,
         iter = 1000)

## [1] 0.574

The only difference in syntax of BTD_cov_power() is due to the inclusion of covariates. The variate of interest must be specified as a vector where the first element gives the mean and the second the standard deviation in the argument control_task. The covariates can be specified similarly or as an \(m \times 2\) matrix where the first column gives the means of each covariate and the second column gives the standard deviations. The correlation matrix of the variates must be given as well. In the example below, power is evaluated for a test taking two covariates into account, both with a mean of 0 and a standard deviation of 1. The correlation is specified as a \(3\times 3\) matrix with pairwise correlations of \(0.3\). The default settings include only one covariate having a \(0.3\) correlation with the variate of interest. This function is computationally intense and hence, the number of simulations has, for the example below, been decreased.

covars <- matrix(c(0, 1,
                   0, 1), ncol = 2, byrow = TRUE)
BTD_cov_power(case = -2,
              case_cov = c(0.2, -0.6),
              control_task = c(0, 1),
              control_covar = covars,
              cor_mat = diag(3) + 0.3 - diag(c(0.3, 0.3, 0.3)),
              sample_size = 15,
              alternative = "less",
              alpha = 0.05,
              nsim = 100,
              iter = 100)

## [1] 0.49

For the difference tests one must supply the expected case scores on both variates as well as sample size. The means and standard deviations of the control sample can also be specified. If unspecified, they take on the default values of 0 and 1 respectively, so that the expected case scores are interpreted as distances from the means in standard deviations. RSDT_power(), BSDT_power() and UDT_power() additionally require an estimate of the sample correlation between the variates of interest, r_ab. If this is not specified a correlation of 0.5 is assumed by default. For BSDT_cov_power() the correlation matrix between the variates of interest and the covariates must instead be supplied (i.e., at least a \(3\times3\) matrix where the first correlation is the correlation between the variates of interest).

The alternative hypothesis is by default assumed to be "two.sided" since the direction of the effect is dependent on the order of the inputs, but can be specified to be "less" or "greater" as well. The syntax is similar for all three functions but with small differences. For UDT_power() one can request required sample size for a desired power, as for TD_power(). Calculators for the Bayesian tests have the extra argument calibrated as to be able to specify the prior. BSDT_cov_power() requires input in the same format as BTD_cov_power() for both control_tasks and control_covar. The two examples below demonstrate usage for RSDT_power() and BSDT_cov_power().

RSDT_power(case_a = 70,
           case_b = 55,
           mean_a = 100,
           mean_b = 50,
           sd_a = 15,
           sd_b = 10,
           r_ab = 0.5,
           sample_size = 15,
           alternative = "two.sided",
           alpha = 0.05,
           nsim = 1000)

## [1] 0.604

cor_mat <- matrix(c(1,   0.5, 0.6,
                    0.5,   1, 0.3,
                    0.6, 0.3,   1), ncol = 3, byrow = TRUE)
BSDT_cov_power(case_tasks = c(70, 55),
               case_cov = 65,
               control_tasks = matrix(c(100, 15,
                                        50, 10), ncol = 2, byrow = TRUE),
               control_covar = c(50, 25),
               cor_mat = cor_mat,
               sample_size = 15,
               alternative = "two.sided",
               alpha = 0.05,
               nsim = 100,
               iter = 100,
               calibrated = TRUE)

## [1] 0.76

The R package singcar: Comparing Single Cases to Small Samples

Jonathan Ö. Rittmo

Robert D. McIntosh

1 Introduction

2 Comparing a single case to small samples

2.1 Background

2.1.1 Example data

2.2 Frequentist approaches

2.2.1 Example usage in `singcar`

2.3 Bayesian approaches

2.3.1 The Bayesian test of deficit

2.3.2 The Bayesian standardised difference test

2.3.3 Example usage in `singcar`

2.3.4 Tests using Bayesian regression to allow for covariates

2.3.5 Example usage in `singcar`

2.4 Using mixed effects models for repeated measures

2.4.1 Example usage in `R`

2.5 Multivariate methods

2.5.1 Estimation based on confidence intervals

2.5.2 Example usage in `singcar`

2.5.3 Power calculators

3 Summary

4 References

Appendix

The R package singcar: Comparing Single Cases to Small Samples

Jonathan Ö. Rittmo

Robert D. McIntosh

1 Introduction

2 Comparing a single case to small samples

2.1 Background

2.1.1 Example data

2.2 Frequentist approaches

2.2.1 Example usage in singcar

2.3 Bayesian approaches

2.3.1 The Bayesian test of deficit

2.3.2 The Bayesian standardised difference test

2.3.3 Example usage in singcar

2.3.4 Tests using Bayesian regression to allow for covariates

2.3.5 Example usage in singcar

2.4 Using mixed effects models for repeated measures

2.4.1 Example usage in R

2.5 Multivariate methods

2.5.1 Estimation based on confidence intervals

2.5.2 Example usage in singcar

2.5.3 Power calculators

3 Summary

4 References

Appendix

2.2.1 Example usage in `singcar`

2.3.3 Example usage in `singcar`

2.3.5 Example usage in `singcar`

2.4.1 Example usage in `R`

2.5.2 Example usage in `singcar`