Abstract
Statistical comparison of single cases to small samples is a methodology that has been extensively used in, for example, cognitive and clinical neuropsychology. This is most often done to determine changes in cognitive processing after an individual has incurred some type of brain damage. In a clinical setting one often wish to infer whether a patient exhibit abnormally low performance on some cognitive ability. In cognitive neuropsychology on the other hand one often wish to infer if a patient exhibits an abnormally large discrepancy in performance between two cognitive abilities. Because cognitive abilities seldom are well represented by one single performance on one single task one might additionally be interested in the abnormality of a case on several measurements converging on a cognitive ability of interest, or the abnormality of a case in a multivariate space. Several methods to estimate case abnormality have been developed that keeps the Type I error rate at its nominal level. However, they have not been available in any standard statistical software environment and their documentation is spread thin across multiple articles and compiled computer programs. This vignette aims to gather and review the most popular methods while presenting them and their usage in the R
package singcar
. Of note are the more flexible and useful methods that have not received as much spread as the simpler. These include techniques using Bayesian regression to allow for the inclusion of covariates and linear mixed models to handle repeated measures data. Additionally, statistical comparison of single cases to a control population are inherently low powered. To facilitate experimental planning and design power calculators have been implemented in singcar
and the concept of power for this type of statistical analysis is reviewed.
There might be several needs to probabilistically determine if a single observation belongs to some population distribution. In some areas, such as neuropsychology, this methodology has been essential for the advancement of the field. This is because brain-damaged patients with patterns of associated cognitive functional impairments often are unique in each case. A single patient might then be the only available source of data for a phenomenon of interest. In addition, clinical diagnoses and description of patient impairments are always performed for single cases.
Patients with brain damage are hence often compared to a distribution of cognitive functioning in the healthy population, both in research and clinical contexts. We do however not always know the population parameters of such a distribution, and when we do not, these must be estimated from a sample – we compare the single case to a control sample. There are other potential application areas of this method such as studies of uncommon expertise, targeted quality checks in small production industries or some animal studies where rearing large experimental groups might be too costly or unethical.
However, since these methods most commonly are used within the field of neuropsychology the related nomenclature will be adopted. Abnormally low score on some variate compared to the control sample will be referred to as a deficit, important both in clinics and in research. For the latter another concept is also considered to be of great interest: namely an abnormally large discrepancy between two variates. This is known as a dissociation and provides evidence for independence between cognitive abilities. By mapping such discrepancies it is possible to build theories of the architecture of the human mind (Shallice 1988).
During the last 20 years methods allowing researchers to estimate abnormality and test for deficits and dissociations with retained control of Type I errors have been developed. This has mainly been done by John Crawford and Paul Garthwaite, e.g. Crawford and Howell (1998), Crawford, Garthwaite, and Ryan (2011), Crawford and Garthwaite (2007), Crawford and Garthwaite (2002) and Crawford and Garthwaite (2005). These methods have been implemented in freely available software. However, only as standalone programs, not open source and only for Windows. Most of the programs require manual input of summary statistics and cannot be used on raw data. Full documentation of the implemented methods are only available in the original publications.
This was the motivation behind the development of the R
package singcar
(Rittmo and McIntosh 2021). We wanted to encourage and simplify the usage of these
methods by bringing them together in a fully documented package with open source
code that works across platforms. In Section 2 all implemented methods
will be described in detail. This contribution will provide a comprehensive
overview of the methods available together with their advantages and disadvantages.
The development of Crawford and Garthwaite’s methods has been focused around limiting Type I errors. But in recent work we showed the inherent inability of controlling Type II errors using this experimental design and argued that this limitation might even be more detrimental than an increase Type I error rate, for some applications of the methods (McIntosh and Rittmo 2020). Therefore we also provide power calculators for each test function in the package.
When we conduct single case comparisons we want to estimate the probability that a random draw from the distribution estimated by the sample is more extreme than the single observation we are interested in. This means that the power of a case comparison hypothesis test is contingent on the standard deviation of the distribution, not on the standard error of the test statistic. Increasing the sample size then (which usually is done to increase power) only leads to a better estimation of the control distribution, not a reduction of the the standard error. Hence, the limit of achievable power is mostly contingent on the size of the effect. When investigating unique cases it is not always possible to target larger effects to get rid of this problem.
One approach that often is highlighted for its potential to increase power is multivariate hypothesis testing. This is however only true to a certain extent (Frane 2015) and one should not force a multivariate analysis on a univariate research question. Focus hitherto has been univariate in nature, but for a neuropsychological dissociation to be convincing the optimal experimental design would include multiple tasks converging on the cognitive functions of interest, performed on more than one occasion (Shallice 1988; McIntosh and Rittmo 2020). Integrating such information both across tasks as well as over several performances could be aided by multivariate analyses.
This calls for the need of methods that estimate a case’s abnormality in a
multidimensional space estimated by a sample. The most obvious way to do this
would of course entail the \(p\) value of a Hotelling’s \(T^2\)-score. However,
Elfadaly, Garthwaite, and Crawford (2016) show that this statistic is
substantially biased as a measure of abnormality. They investigate nine other candidates including two point
estimates based on polynomial expansion, which gave rise to the lowest bias.
These point estimates would, however, sometimes yield estimates out of the [0,
1] range and their interpretation is not very intuitive. Because ease of
interpretation is paramount in scientific reporting
Elfadaly, Garthwaite, and Crawford (2016) suggested using an adapted median
estimator when bias is not of “immediate importance”, in which case the
polynomial estimators should be used. A compiled computer program
for this adapted median estimator is available (Garthwaite, Elfadaly, and Crawford 2017),
but again, no open source software working across platforms.
This relatively novel estimation method has also been implemented in
singcar
.
The following section, Section 2, will review available
methods for single case comparisons, their theory and how they can be
used in R
. Section 2.5.3 gives a brief review of statistical
power in this experimental design and the implementation of power
calculators in singcar
, but for a more thorough and perhaps
more accessible review see McIntosh and Rittmo (2020).
Variates of interest will in a neuropsychological context often be scores on some task assessing a cognitive function. Therefore they will often be referred to as “task scores”. It has not been uncommon in this field to compare patients to a control sample, treating the sample statistics as population parameters and estimating deficits by evaluating the \(p\) value associated with the patient’s \(Z\) score from the estimated distribution. Estimating dissociations was done in a similar way using a method developed by Payne and Jones (1957), where a \(Z\) score is calculated from the distribution of difference scores in the control sample.
This is of course problematic if small samples are used, which is often the case in neuropsychology, since the sampling distribution of the sample variance is right skewed for small sample sizes. If one derives a test statistic that is divided by a skewed distribution the size of the obtained values would be inflated and thus the Type I error rate (Crawford and Howell 1998).
In the following sections I will describe methods that has been devised as replacements to the \(Z\) score methods. Both frequentist and Bayesian significance tests have been developed and are covered in Section 2.2 and Section 2.3 respectively. More complex model designs can be achieved using Bayesian regression techniques, described in Section 2.3.4, or linear mixed models, described in Section 2.4. For estimating multivariate abnormality, see Section 2.5.
In the following sections the theory behind the methods in singcar
is
reviewed. But at the end of each subsection the usage of the implementations
will be exemplified. To aid this the package comes with the dataset
size_weight_illusion
, which is a dataset from a neuropsychological study
investigating the size-weight illusion in a well known patient “DF”
(Hassan et al. 2020). DF incurred damage to the lateral
occipital complex which gave rise to visual form agnosia, potentially making her less
susceptible to the size-weight illusion (Dijkerman et al. 2004).
This illusion is a perceptual phenomenon making objects of small size be
perceived as heavier during lifting than objects of larger size but equal in
weight (Buckingham 2014).
However, due to the location of DF’s brain damage one would only suspect less susceptibility to the illusion for visual cues since she other sensory processing should be unaffected of her damage. In the study by Hassan et al. (2020) the illusion is tested given visual cues in one condition and kinaesthetic cues in another. It was predicted that she would be abnormally less susceptible to visual size-weight illusion and that there would be an abnormally large discrepancy between visual and kinaesthetic size-weight illusion. The control sample consisted of 28 participants and their age and sex was collected along with their performance in the two conditions. The size-weight illusion is given in the dataset as a scaled measure of the number of grams weight difference perceived per cubic cm of volume change. To use this data, start by installing and loading the package:
install.packages("singcar")
library("singcar")
head(size_weight_illusion)
## GROUP PPT SEX YRS V_SWI K_SWI
## 1 SC DF Female 65 0.02814921 0.10012712
## 2 HC E01 Female 70 0.21692634 0.21792930
## 3 HC E02 Male 64 0.17223226 0.26338899
## 4 HC E03 Female 63 0.07138049 0.09331700
## 5 HC E04 Female 65 0.10186453 0.25938045
## 6 HC E05 Female 66 0.15911439 0.08922615
V_SWI
and K_SWI
are the measurements for visual and kinaesthetic
size-weight illusion respectively. Most functions in singcar
can take summary data as input, but for demonstrative purposes the raw
data of this dataset will be used. To illustrate usage
of the methods more generally, the scores of the variates of interest
as well as of the covariate YRS
is extracted from the data and renamed:
<- size_weight_illusion[1, "V_SWI"]
caseA <- size_weight_illusion[-1, "V_SWI"]
contA <- size_weight_illusion[1, "K_SWI"]
caseB <- size_weight_illusion[-1, "K_SWI"]
contB <- size_weight_illusion[1, "YRS"]
caseAGE <- size_weight_illusion[-1, "YRS"] contAGE
Here, caseA
and caseB
refers to the case scores on some task A
and B respectively. Similarly, contA
and contB
refers to the scores
of the control sample on the same tasks.
When we are comparing a patient to a healthy population on some normally distributed variate that we are estimating from a small sample, we are in fact not estimating abnormality from a normal distribution but a central \(t\) distribution.
Hence, we must use a \(t\) score rather than a \(Z\) score if we want to estimate deficits without inflating effect sizes. This was noted by Sokal and Rohlf (1981) (p. 227) where they devised a test statistic for single cases in the biological sciences, but it was first popularised within neuropsychology by Crawford and Howell (1998). Its common application is as a one-tailed “test of deficit” (TD) that estimate the proportion of the healthy population that would exhibit a lower score than the case on some cognitive task. If this score falls below some chosen threshold the patient can be diagnosed with a deficit.
This basic approach is simply a modified two samples \(t\) test,
\[\begin{equation} t_{n-1} = \frac{y^* - \overline{y}}{s \sqrt{\frac{n + 1}{n}}} \tag{2.1} \end{equation}\]
where one of the samples are treated as a size of \(n = 1\). The degrees of freedom then quite naturally become \(n+1-2 = n -1\). \(\overline{y}, \ s\) and \(n\) are the sample mean, standard deviation and size and \(y^*\) is the case score. This method allows transparent control over the Type I error rate (Crawford, Garthwaite, and Howell 2009; Crawford et al. 2004; Crawford and Garthwaite 2012) when used for significance testing. In addition, the \(p\) value associated with he obtained \(t\) statistic gives an unbiased point estimate of the abnormality of the case. That is, it provides us with the expected percentage of the normal population that would exhibit a more extreme score than the case. This estimate is often of main interest in neuropsychology. The simple proof for this is demonstrated by Crawford and Garthwaite (2006) as outlined below.
The percentage of a population expected to score lower on some variate \(Y\) than the case \(y^*\) is \[ \mathbb{P}[Y< y^*]. \] Subtracting \(\overline{y}\) from both sides of the inequality and dividing by \(s \sqrt{\frac{n + 1}{n}}\) gives \[ \mathbb{P}[Y< y^*] =\mathbb{P}\left[\frac{Y-\overline{y}}{s \sqrt{\frac{n + 1}{n}}} < \frac{y^* - \overline{y}}{s \sqrt{\frac{n + 1}{n}}}\right]. \] The quantity to the left of the inequality, i.e., \(\frac{y-\overline{y}}{s \sqrt{\frac{n + 1}{n}}}\) is \(t\) distributed with \(n-1\) degrees of freedom. Hence, \[ \mathbb{P}[Y< y^*] =\mathbb{P}\left[t_{n-1} < \frac{y^* - \overline{y}}{s \sqrt{\frac{n + 1}{n}}}\right]. \] To the right of the inequality we have the test statistic from ((2.1)). Therefore, \(\mathbb{P}[y< y^*]\) is the \(p\) value from the test of deficit. This simple fact also makes the construction of confidence intervals for the abnormality (i.e. \(p\)) possible.
Although the \(Z\) score is not an appropriate statistic to use for significance testing when the control sample is small, it does provide a standardised effect measure of the deficit of interest similar to Cohen’s \(d\), because insensitivity to sample size is a requirement for effect size measures (Cohen 1988; Crawford, Garthwaite, and Porter 2010). So, Crawford, Garthwaite, and Porter (2010) proposed the use of \(Z\) scores as effect size measures within single-case neuropsychology and dubbed this measure \(Z_{CC}\) where the subscript indicates “case-controls” comparison: \[\begin{equation} Z_{CC} = \frac{y^* - \overline{y}}{s}. \tag{2.2} \end{equation}\] If \(p\) is the percentage of the population that would fall below or above a case score, \(Z_{CC}\) can be used to construct \(100(1-\alpha)\)% confidence intervals around \(p\). The method to do so was described by Crawford and Garthwaite (2002).
Given \(Z_{CC}\) and \(y^* \neq \overline{y}\), \(Z_{CC}\) comes from a non-central \(t\) distribution on \(n-1\) degrees of freedom. We find the value \(\delta_U\) being the non-centrality parameter (NCP) of a non-central \(t\) distribution with \(df=n-1\) such that the \(100\frac{\alpha}{2}\) percentile equates \(Z_{CC}\sqrt{n}\) and correspondingly we find the NCP \(\delta_L\) of a non-central \(t\) distribution such that the \(100(1-\frac{\alpha}{2})\) percentile equates \(Z_{CC}\sqrt{n}\). The easiest approach to finding these quantities is by deploying a search algorithm. We can then construct the upper and lower boundaries for \(Z_{CC}\) as: \[ Z_{CC_U} = \frac{\delta_U}{\sqrt{n}}, \ \ Z_{CC_L} = \frac{\delta_L}{\sqrt{n}}. \] When the case falls on the left side of the distribution the boundaries for \(p\) are given by: \[ p_U = \Phi\left(\frac{\delta_U}{\sqrt{n}}\right), \ p_L = \Phi\left(\frac{\delta_L}{\sqrt{n}}\right), \] where \(\Phi\) is the CDF of the standard normal distribution. And when the case falls on the right side of the distribution the boundaries are given by: \[ p_U = 1 - \Phi\left(\frac{\delta_L}{\sqrt{n}}\right), \ p_L = 1 - \Phi\left(\frac{\delta_U}{\sqrt{n}}\right). \]
Evidently, estimating abnormality on a single variate is trivial. It is equally simple to estimate abnormality on the difference between two variates \(Y_1 - Y_2\) if the normally distributed variates \(Y_1\) and \(Y_2\) do not need standardisation to be comparable. If this is the case then the test function is similar to ((2.1)) but one divides the differences by the standard deviation of the difference scores: \[\begin{equation} t_{n-1} = \frac{(y^*_1 - \overline{y}_1) - (y^* _2 - \overline{y}_2) }{ \sqrt{(s^2_1 +s^2_2 -2s_1 s_2 \rho_{12})(\frac{n+1}{n})}} \tag{2.3} \end{equation}\] where \(\rho_{12}\) is the correlation between the variates. Confidence intervals for this “unstandardised difference test” (UDT) is constructed as for the TD. Applications testing unstandardised differences is however scarce, at least within neuropsychology. Much more common is the need to asses abnormality of differences between variates that require standardisation to be comparable.
If we just standardise the variates in ((2.3)) and do not take sample size into account, we get an effect size measure of the difference (dissociation) similar to \(Z_{CC}\), ((2.2)): \[\begin{equation} Z_{DCC} = \frac{z^*_1 - z^*_2}{\sqrt{2-2\rho_{12}}}. \tag{2.4} \end{equation}\] The two quantities in the numerator are the standardised case scores on variate \(Y_1\) and \(Y_2\) and the subscript indicates “discrepancy-case-controls”. This measure was proposed as a reporting standard in the single-case literature by Crawford, Garthwaite, and Porter (2010), but was first suggested as a statistic for significance testing by Payne and Jones (1957). It is of course not appropriate for such use since we would get an inflated resultant statistic (\(Z_{DCC}\)) because the quantities in the numerator (\(z^*_1\) and \(z^*_2\)) are inflated.
When we only can estimate standardised case scores from small samples and need to test if the difference between them is abnormally large we are in fact estimating the difference between two \(t\) distributed variates. Since linear combinations of correlated \(t\) distributed variates are not \(t\) distributed themselves this is not a trivial problem.
Garthwaite and Crawford (2004) examined the distribution of such difference scores using asymptotic expansion. They searched for quantities that divided by a function of the sample correlation would be asymptotically \(t\) distributed. They found:
\[\begin{equation} \psi=\frac{\frac{(y^*_1-\overline{y}_1)}{s_{1}}-\frac{(y^*_2-\overline{y}_2)}{s_{2}}}{ \sqrt{ (\frac{n+1}{n}) \left( (2-2 \rho)+ \frac{2(1-\rho^{2})}{n-1}+ \frac{(5+c^{2})(1-\rho^{2})}{2(n-1)^{2}}+ \frac{\rho(1+c^{2})(1-\rho^{2})}{2(n-1)^{2}}\right) }}, \end{equation}\]
\(\rho\) being the sample correlation and \(c\) a pre-specified critical two-tailed \(t\) value on \(df= n-1\). They showed that \(\mathbb{P}[ \psi > c] \approx\mathbb{P}[t >c]\) and that one must solve for \(\psi = c\) to obtain a precise probability of \(\psi\) which yields a quantity that is not dependent on a specified critical value. This is a quadratic equation in \(c^2\), choosing the positive root of which yields:
\[\begin{align} c & = \sqrt{\frac{ -b + \sqrt{b^2 - 4ad}}{2a}}, \ \text{where} \\ a & = (1+r)(1-r^2), \\ b & = (1-r)[4(n-1)^2+4(1+r)(n-1)+(1+r)(5+r)], \\ d & = - 2\left[\frac{y^*_{1} - \overline{y}_1}{s_1}-\frac{y^*_2 -\overline{y}_2}{s_2}\right]^2\left(\frac{n(n-1)^2}{n+1}\right) \tag{2.5} \end{align}\]
where \(p = \mathbb{P}[t_{n-1}>c]\) is used for significance testing and is the point estimate the percentage of the control population that is expected to show a more extreme difference score than the case. Note that the test statistic \(c\) will always be positive, no matter the direction of the effect.
Crawford and Garthwaite (2005) refer to this test statistic as the “revised standardised difference test” (RSDT) because it was an iteration on a test previously developed by the authors, that did not keep control of the Type I error rate Crawford, Howell, and Garthwaite (1998). Crawford and Garthwaite (2005) show that the RSDT succeeds in this attempt fairly well.
Using simulations the RSDT was shown to barely exceed the specified error rate even for very small samples such as \(n = 5\). However, when a case lacks a dissociation but exhibits extreme scores on both variates (in the same direction of course) the error rate increases steeply. According to one simulation study the RSDT starts to lose control of the error rate for scores that are simultaneously more extreme than two standard deviations away from the mean and for task scores as extreme as 8 standard deviations away from the mean, error rates as high as 35% were observed Crawford and Garthwaite (2007).
Another issue with the RSDT is that, because \(\psi\) is only approximately \(t\) distributed, constructing confidence intervals for estimated abnormality in the same manner as for the TD is not possible. So to remedy these drawbacks of this test statistic Crawford and colleagues looked into Bayesian methodology.
singcar
The test of deficit is called in the singcar
package with the function TD()
.
Testing for a deficit using the example data from Section 2.1.1 we have:
TD(case = caseA, controls = contA, alternative = "less", conf_level = 0.95)
##
## Test of Deficit
##
## data: Case = 0.03, Controls (m = 0.16, sd = 0.08, n = 28)
## t = -1.7243, df = 27, p-value = 0.04804
## alternative hypothesis: true diff. between case and controls is less than 0
## sample estimates:
## Std. case score (Z-CC), 95% CI [-2.34, -1.15]
## -1.754857
## Proportion below case (%), 95% CI [0.95, 12.47]
## 4.804003
Where the argument alternative
specifies the alternative hypothesis, i.e.
whether we are performing two or one sided test and if the latter, which direction.
If one instead wants to test for a dissociation the most common alternative hypothesis
is two sided. Using the RSDT for this test, call:
RSDT(case_a = caseA, case_b = caseB, controls_a = contA, controls_b = contB,
alternative = "two.sided")
##
## Revised Standardised Difference Test
##
## data: Case A: 0.03, B: 0.10, Ctrl. A (m, sd): (0.16, 0.08), B: (0.18, 0.10)
## approx. abs. t = 1.015, df = 27, p-value = 0.3191
## alternative hypothesis: true difference between tasks is not equal to 0
## sample estimates:
## Std. case score, task A (Z-CC) Std. case score, task B (Z-CC)
## -1.7548574 -0.7836956
## Std. task discrepancy (Z-DCC) Proportion below case (%)
## -1.0647889 15.9560625
Notably, this function does not produce confidence intervals. If the two tasks
are directly comparable without the need of standardisation (which in fact is the
case for this specific dataset), call instead UDT()
with the same syntax and
confidence intervals will be given.
The most prominent difference between the frequentist and the Bayesian framework is that in the former parameters are treated as fixed attributes of a population and in the latter they are themselves treated as random variables with associated distributions. To estimate these one can use prior knowledge about the parameter of interest which is specified as a prior distribution. There is often a desire that the data should be the only thing influencing estimations. If so, one can reduce the impact of the prior by, for example, assigning equal probabilities to all possible parameter values. This is an example of what is known as a non-informative prior distribution. This distribution is then updated when new information is obtained from data and as such forms the posterior parameter distribution. This is calculated using Bayes theorem and if we omit the normalising constant in the denominator of Bayes theorem it can be expressed: \[\begin{equation*} posterior \ \propto \ likelihood \times prior. \end{equation*}\]
If we have a hypothesis for a parameter value, what the above is saying is that the posterior density of this hypothesis is proportional (\(\propto\)) to the likelihood of the data under that hypothesis multiplied by the prior probability of the hypothesis.
One of the disadvantages with Bayesian methodology is that it might be impossible or at least very difficult to analytically calculate the posterior. They are however often feasible to construct with numerical solutions using Monte Carlo methods. One of the reasons Bayesian statistics has received such upsurge is because the increase of computational power in recent years that has made many otherwise infeasible numerical strategies possible. Monte Carlo methods are mathematical algorithms that solve numerical problems by repeated random sampling. The algorithms vary depending on the problem at hand, but for Bayesian methodology they are all building on rules for drawing random values based on the likelihood and prior – as such “constructing” the posterior with the draws after a large number of iterations. The peak of the constructed distribution is often the parameter of interest and the width is a measurement of the certainty of the estimation. If using non-informative priors the peak often coincide with the maximum likelihood estimation of the parameter. This means that non-informative priors often yields estimators with frequentist properties, necessary for null hypothesis significance testing.
Because the testing property of the estimation methods presented here are of main interest,
frequentist properties was required when Crawford and Garthwaite (2007) and
Crawford, Garthwaite, and Ryan (2011) devised Bayesian analogues to the test functions
presented in Section 2.2. The procedural details of these methods
will be described in the following sections. This is to provide details of
how they were implemented in singcar
as well as an overview of how they operate.
The methods are all based on Monte Carlo methods and some are, computationally, very
intense. They are implemented in singcar
as straight R code but would probably
have benefited from being implemented in compiled code first. This is an aim
for future updates of the package. The following algorithms presented closely
follows the original articles, but with few slight changes of notation to make
the presentation more coherent.
The Bayesian analogue of the test of deficit (BTD) produces asymptotically identical output as the frequentist test of deficit. It produces an estimate of \(p\) with accompanying credible intervals, i.e. the percentage of controls that would exhibit a more extreme score than the case. The method is included here for completeness but for actual analysis there is no advantage to using it over the TD, ((2.1)).
From a sample of size \(n\) we obtain measurements of some normally distributed variate \(Y\) with unknown mean \(\mu\) and unknown variance \(\sigma^2\). Let \(\overline{y}\) and \(s^2\) denote the sample mean and variance and \(y^*\) the case score. The prior distribution of \(\mu\) is conditioned upon the prior distribution of \(\sigma^2\) since both parameters are unknown. The prior considered here is \(f(\mu, \sigma^2) \propto (\sigma^2)^{-1}\), which is the standard non-informative prior for normal data. The marginal posterior distribution of \(\frac{(n-1)s^2}{\sigma^2} \sim \chi^2_{n-1}\) and the posterior distribution \(\mu|\sigma^2 \sim \mathcal{N}(\overline{y}, \sigma^2/n)\), see e.g., Gelman et al. (2013) (p. 45 and 65).
When we have a posterior distribution for the control population parameters, \(p\) can be estimated by iterating the following steps.
Iterating these steps a large number of times yields a distribution of \(\hat{p}\), the mean of which is the point estimate of \(p\), which can be used for significance testing. The 2.5th and 97.5th percentile of the distribution of \(\hat{p}\) as well as of \(z^*\) gives the boundaries for the 95% credible interval of \(\hat{p}\) and \(\hat{Z}_{CC}\) respectively. The point estimate of the latter is that of ((2.2)).
In contrast to the BTD, the Bayesian analogue of the difference tests does not
produce asymptotically identical results to the RSDT.
For the Bayesian standardised difference test (BSDT) we now obtain
measurements on \(Y_1\) and \(Y_2\) following a bivariate normal distribution
representing two tasks of interest, from a control sample of size \(n\). The goal
is to estimate the percentage \(p\) of the control population that would be
expected to show a greater difference \(Y_1 - Y_2\) than the case.
Let \(\overline{y}_1\) and \(\overline{y}_2\) be the sample means and
\[\begin{equation*}
\pmb{A}=\begin{bmatrix}
s^2_{1} & s_{12} \\
s_{12} & s^2_{2} \end{bmatrix}
\end{equation*}\]
the sample variance-covariance matrix.
Then \(\pmb{S} =\pmb{A}(n-1)\) is the sums of squares and cross products (SSCP) matrix.
The case scores are denoted \(y_1^*\) and \(y_2^*\).
The population parameters \[\begin{equation*} \pmb{\mu} = \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix} \ \text{and} \ \Sigma=\begin{bmatrix} \sigma^2_{1} & \sigma_{12} \\ \sigma_{12} & \sigma^2_{2} \end{bmatrix} \end{equation*}\] are both unknown. To estimate \(p\) we must first obtain a posterior of these parameters. Since both are unknown the prior distribution of \(\pmb\mu\) is again conditioned upon the prior distribution of \(\Sigma\).
The prior considered for \(f(\pmb\mu, \Sigma^{-1})\) in Crawford and Garthwaite (2007) was \(f(\pmb\mu, \Sigma^{-1}) \propto |\Sigma|\). This is a common non-informative prior, however, it was chosen in favour of the perhaps even more commonly used \(f(\pmb\mu, \Sigma^{-1}) \propto |\Sigma|^{(k+1)/2}\) for multivariate normal data, where \(k=\) the number of variates. This was because the former prior was shown to give asymptotically identical interval estimates as the UDT for unstandardised case scores. Even though the latter perhaps is more commonly applied, Crawford and Garthwaite (2007) refer to the former as the “standard theory” prior.
For this prior we have that \(\Sigma^{-1} \sim \mathcal{W}_k(\pmb{S}^{-1}, n)\). That is, the posterior marginal distribution for \(\Sigma^{-1}\) is a Wishart distribution parametrised with scale matrix \(\pmb{S}^{-1}\) and \(n\) degrees of freedom. The Wishart distribution is a multivariate generalisation of the \(\chi^2\) distribution. It is possible to view \(S\sim\mathcal{W}_p(\Sigma, n)\) as a distribution of SSCP-matrices associated with \(\Sigma\). Inversely, one can view the inverse Wishart distribution parametrised with some SSCP matrix and \(n\) degrees of freedom as a distribution of variance-covariance matrices related to that SSCP matrix.
We then have that \(\pmb\mu|\Sigma \sim \mathcal{N}_k(\pmb{\overline{y}}, \Sigma/n)\) where \(\mathcal{N}_k(\pmb{\overline{y}}, \Sigma/n)\) denotes a multivariate normal distribution with mean \(\pmb{\overline{y}}\) and variance \(\Sigma/n\), see e.g., Gelman et al. (2013) (p. 72-73).
The posterior marginal distribution of \(\Sigma\) is then constructed by iterating draws of random observations from \(\mathcal{W}^{-1}(\pmb{S}, n)\), each observation, \(\hat{\Sigma}_{(i)}\), being the estimation of \(\Sigma\) for the \(i\)th iteration.
As previously mentioned, frequentist properties are desired if the estimations are to be used for significance testing. Berger and Sun (2008) investigated convergence to frequentist estimates for bivariate normal distributions using various priors. They showed that the “standard theory” prior produced converging estimates of \(\sigma_1\) and \(\sigma_2\) but that the convergence was not as good for \(\rho\) and differences in means. As a “general purpose” prior they instead recommend using \(f(\pmb\mu, \Sigma^{-1}) \propto \frac{1}{\sigma_1\sigma_2(1-\rho^2)}\). Posteriors with this prior is constructed by using rejection sampling. Random draws from an inverse Wishart on \(n-1\) degrees of freedom are generated, but only a subset of the the observations are retained.
Crawford, Garthwaite, and Ryan (2011) considered this prior but noticed that it gave rise to too narrow credible intervals, overestimating the certainty of the estimations. However, they showed that if the sample was treated as being of size \(n-1\) when estimating \(\Sigma\), i.e. so that the generated intervals were somewhat more conservative, frequentist properties were retained. This “calibrated” prior is recommended by Crawford, Garthwaite, and Ryan (2011), and they refer to it as such.
To construct the posterior using the calibrated prior the following steps are iterated:
Since we are reducing the sample size we must put \(\pmb{S}^*= (n-2)\pmb{S}/(n-1)\) to retain the unbiased property of \(\pmb{S}/(n-1)\) as an estimate of \(\Sigma\).
Draw a random observation from \(\mathcal{W}^{-1}(\pmb{S}^*, n-2)\). This draw is denoted: \[ \hat{\Sigma} = \begin{bmatrix} \hat{\sigma}^2_{1} & \hat{\sigma}_{12} \\ \hat{\sigma}_{12} & \hat{\sigma}^2_{2} \end{bmatrix} \] and \[ \hat{\rho}= \frac{\hat{\sigma}_{12}}{\sqrt{\hat{\sigma}^2_{1}\hat{\sigma}^2_{2}}}. \]
If \(u\) is a random draw from a uniform distribution \(u \sim U(0, 1)\), \(\hat\Sigma\) is only accepted as the estimation of \(\Sigma\) if \(u^2 \leq 1-\hat{\rho}^2\), if not we iterate the procedure until we have an accepted draw of \(\hat\Sigma\). Once accepted we have an estimation of \(\Sigma\) for this iteration and denote the draw: \[\begin{equation*} \hat{\Sigma}_{(i)}=\begin{bmatrix} \hat{\sigma}^2_{1(i)} & \hat{\sigma}_{12(i)} \\ \hat{\sigma}_{12(i)} & \hat{\sigma}^2_{2(i)} \end{bmatrix}. \end{equation*}\]
Even though this latter calibrated prior is recommended, some might argue that frequentist properties are of no concern and therefore choose the “standard theory” prior. But regardless of the prior chosen, when we have an estimate of \(\Sigma\), the following steps are iterated to obtain an estimate of \(p\), the percentage of controls expected to exhibit a more extreme difference between the variates than the case.
Generate two draws from a standard normal distribution and denote them \(z_{r1}\) and \(z_{r2}\). Find the lower triangular matrix \(\pmb{T}\) such that \(\pmb{T}\pmb{T}' = \hat\Sigma_{(i)}\) using Choleksy decomposition. The estimate of \(\pmb{\mu}\) for this iteration is then: \[\begin{equation*} \pmb{\hat{\mu}}_{(i)} = \begin{pmatrix} \hat{\mu}_{1(i)} \\ \hat{\mu}_{2(i)} \end{pmatrix} = \begin{pmatrix} \overline{y}_1 \\ \overline{y}_2 \end{pmatrix}+ \pmb{T} \begin{pmatrix} z_{r1} \\ z_{r2} \end{pmatrix} / \sqrt{n}. \end{equation*}\] See for example Gelman et al. (2013) (p. 580).
We now have estimates of both \(\pmb{\mu}\) and \(\Sigma\) and calculate
\(p\) as if these were the true population parameters. The method to do so
differ slightly depending on whether a standardised or unstandardised test
is required. For an unstandardised test, the size of the difference score is calculated
as:
\[\begin{equation*}
z_{(i)}^* = \frac{(y_1^* - \hat{\mu}_{1(i)}) - (y^*_2 - \hat{\mu}_{2(i)})}
{\sqrt{\hat{\sigma}^2_{1(i)}+\hat{\sigma}^2_{2(i)}-2\hat{\sigma}_{12(i)}}}.
\end{equation*}\]
More commonly we would want to calculate the difference score from standardised
variates. If this is the case, put:
\[\begin{equation*}
z_{1(i)} = \frac{y_1^* - \hat{\mu}_{1(i)}}{\sqrt{\hat{\sigma}^2_{1(i)}}}, \ z_{2(i)} = \frac{y_2^* -
\hat{\mu}_{2(i)}}{\sqrt{\hat{\sigma}^2_{2(i)}}}, \ \hat{\rho}_{(i)} = \frac{\hat{\sigma}_{12(i)}}{\sqrt{\hat{\sigma}_{1(i)}\hat{\sigma}_{2(i)}}}
\end{equation*}\]
and
\[\begin{equation*}
\hat{z}^*_{(i)} = \frac{z_{1(i)} - z_{2(i)}}{\sqrt{2-2\hat{\rho}_{(i)}}}.
\end{equation*}\]
The estimate of \(p\), \(\hat{p}_{(i)}\), for this iteration is then the tail area of a standard normal distribution more extreme than \(z^*_{(i)}\).
Iterate the procedure a large number of times and save the estimations for each iteration. This will yield a distribution of \(\hat{p}\) and \(\hat{z}^*\), the quantiles of which again are used to calculate the boundaries of the credible intervals for \(p\) and \(Z_{DCC}\). The point estimate of the former is the mean of the estimated distribution, while the point estimate of the latter is that of ((2.4)).
singcar
Testing for a deficit using the Bayesian test of deficit, call BTD()
:
BTD(case = caseA, controls = contA, alternative = "less",
iter = 10000, int_level = 0.95)
##
## Bayesian Test of Deficit
##
## data: Case = 0.03, Controls (m = 0.16, sd = 0.08, n = 28)
## df = 27, p-value = 0.04799
## alternative hypothesis: true diff. between case and controls is less than 0
## sample estimates:
## Std. case score (Z-CC), 95% CI [-2.33, -1.16]
## -1.754857
## Proportion below case (%), 95% CI [0.98, 12.38]
## 4.799212
As can be seen, compared to TD()
this function has the additional argument iter
, which
indicates the number of iterations to be used for the posterior approximation.
This number should be based on desired accuracy of the analysis.
Testing for a dissociation using the BSDT (recommended over RSDT), call:
BSDT(case_a = caseA, case_b = caseB, controls_a = contA, controls_b = contB,
alternative = "two.sided", iter = 10000, unstandardised = FALSE,
calibrated = TRUE)
##
## Bayesian Standardised Difference Test
##
## data: Case A: 0.03, B: 0.10, Ctrl. A (m, sd): (0.16,0.08), B: (0.18,0.10)
## df = 26, p-value = 0.3212
## alternative hypothesis: true difference between tasks is not equal to 0
## sample estimates:
## Std. case score, task A (Z-CC)
## -1.7548574
## Std. case score, task B (Z-CC)
## -0.7836956
## Std. discrepancy (Z-DCC), 95% CI [-1.69, -0.40]
## -1.0647889
## Proportion below case (%), 95% CI [4.53, 34.43]
## 16.0584420
This function has several additional arguments compared to RSDT()
.
Instead of having a separate function for an unstandardised
Bayesian difference test one can specify whether to standardise
the variates or not within the function. If unstandardised
is
set to TRUE
, the results from BSDT()
will converge to that of
UDT()
. To use the “standard theory” prior instead of the calibrated,
set calibrated
to FALSE
.
It is seldom possible to design perfect experiments. This is especially true in psychological science where causal relationships can be confounded by a great variety of factors. When comparing a case to a control sample it is therefore important to reduce noise by matching the control group to the case as well as possible on variates beyond the ones of interest, such as age or education level, making the collection of control samples cumbersome.
However, this matching of covariates can instead be achieved statistically. Crawford, Garthwaite, and Ryan (2011) describe methods where they extend the tests presented in 2.2 to allow for the inclusion of covariates using Bayesian regression. These methods thus allow you to compare the case’s score on the task(s) of interest against the control population conditioned on them having the same score as the case on the covariate. This both reduces noise and alleviate the collection of control data. But because one degree of freedom is lost for each included covariate it is advisable to only include covariates with a correlation to the variate(s) of interest larger than \(0.3\), as a rule of thumb (Crawford, Garthwaite, and Ryan 2011).
We assume that the variates of interest are normally distributed, but no assumption of the distribution of the covariates is made. The procedure to get an estimate of \(p\) in the presence of covariates is presented below for the general case, that is either when we are interested in a deficit or a dissociation. The main difference between the methods is the recommended prior.
From a sample of size \(n\) we obtain measurements on \(k= 1,2\) variates of interest an \(m\) number of covariates, denoted \(\boldsymbol{X} = (X_1, \dots, X_m)\).
We wish to estimate the regression coefficients for each covariate on the variate(s) of interest: \[ \boldsymbol{B} = \begin{bmatrix} \boldsymbol{\beta}_1 & \cdots & \boldsymbol{\beta}_k \end{bmatrix}. \] Here \(\boldsymbol{B}\) is an \(m+1 \times1,2\) matrix where each column contains the regression coefficients for the \(i\)th variate of interest, the first row being the intercept(s). Further, we wish to estimate the covariance matrix or variance for the variate(s) of interest conditioned upon the covariates, we assume these statistics not to vary with the covariates. \[ \Sigma \ | \ \boldsymbol{X} = \begin{bmatrix} \sigma^2_1 \ | \ \boldsymbol{X} & \rho\sigma_1\sigma_2 \ | \ \boldsymbol{X} \\ \rho\sigma_1\sigma_2 \ | \ \boldsymbol{X} & \sigma^2_2 \ | \ \boldsymbol{X} \end{bmatrix} \ \text{or} \ \left[\sigma^2 | \boldsymbol{X} \right]. \]
If \(\boldsymbol{X}\) is the \(n \times (m+1)\) design matrix and \(\boldsymbol{Y}\), the \(n \times k\) response matrix \[ \boldsymbol{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1m} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{nm} \end{bmatrix},\ \ \boldsymbol{Y} = \begin{bmatrix} y_{11} & \cdots & y_{1k} \\ \vdots & \ddots & \vdots \\ y_{n1} & \cdots & y_{nk} \end{bmatrix} \] the data estimates of \(\boldsymbol{B}\) and \(\Sigma\) are \[ \boldsymbol{B}^* = (\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{Y} \ \text{and} \ \Sigma^* = \frac{1}{n-m-1}(\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{B}^*)'(\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{B}^*). \]
When \(k=1\) or the “standard theory” prior is used for \(k=2\) the posterior of \(\Sigma\) is an inverse Wishart with scale matrix \((n-m-1)\Sigma^*\) on \(n-m-1\) or \(n-m\) degrees of freedom for \(k=1\) and \(k=2\) respectively (Crawford, Garthwaite, and Ryan 2011; Tiao and Zellner 1964). Hence, to get an estimate \(\hat{\Sigma}_{(i)}\) of \(\Sigma\) we generate a draw from these distributions.
For the “calibrated” prior we instead follow the procedure outlined in Section 2.3.2 and use \(\pmb{S}^* = (n-m-2)\pmb{S}/(n-m-1)\) as scale matrix to sample from an inverse Wishart on \(n-m-2\) degrees of freedom, \(\pmb{S} = (n-m-1)\Sigma^*\). The estimates are then:
\[\begin{equation}
\hat{\Sigma}_{(i)}=[\hat{s}^2_{(i)}] \ \text{when} \ k=1 \ \text{and} \
\hat{\Sigma}_{(i)} =
\begin{bmatrix}
\hat{s}^2_{1(i)} & \hat{s}_{12(i)} \\
\hat{s}_{12(i)} & \hat{s}^2_{2(i)}
\end{bmatrix} \ \text{when} \ k=2. \
\tag{2.6}
\end{equation}\] k=2.
\tag{2.6}
\end{equation}
We turn the \((m+1) \times k\) matrix \(\boldsymbol{B}^*\) into a \(k(m + 1) \times 1\) vector by concatenating the columns in \(\boldsymbol{B}^*\). \(\boldsymbol{B}^*_{vec} = (\boldsymbol{\beta}^{*'}_1, ...,\boldsymbol{\beta}^{*'}_k)'\) is the estimate of \(\boldsymbol{B}_{vec}\). The posterior of \(\boldsymbol{B}_{vec}| \hat{\Sigma}_{(i)}\) is a multivariate normal distribution with mean \(\boldsymbol{B}^*_{vec}\) and covariance matrix \(\boldsymbol{\Lambda}_{(i)} =\boldsymbol{\hat\Sigma}_{(i)} \otimes (\boldsymbol{X'X})^{-1}\), where \(\otimes\) denotes the Kronecker product. To get an estimate of \(\hat{\boldsymbol{B}}_{(i)}\) we generate a random observation from this distribution where the \(k(m+1) \times 1\) vector of obtained values \(\hat{\boldsymbol{B}}^*_{\text{vec(i)}} =(\hat{\boldsymbol{\beta}}'_{1(i)}, ...,\hat{\boldsymbol{\beta}}'_{k(i)})'\) is turned back into the \(m+1 \times k\) matrix \(\hat{\boldsymbol{B}}_{(i)} = (\hat{\boldsymbol{\beta}}_{1(i)}, ...,\hat{\boldsymbol{\beta}}_{k(i)})\). If \(\boldsymbol{x}^*\) is a vector of the case values on the covariates we obtain estimates of the conditional case scores on the task of interest as so: \[ \hat{\boldsymbol{\mu}}_{(i)} = \hat{\boldsymbol{B}}_{(i)}\boldsymbol{x}^*. \]
With these values we can now estimate the conditional effect size of the case. Crawford, Garthwaite, and Ryan (2011) denote these effect measures \(Z_{CCC}\) and \(Z_{DCCC}\) depending on whether \(k=1\) or \(k=2\). As the names indicate they are related to \(Z_{CC}\) ((2.2)) and \(Z_{DCC}\) ((2.4)), but calculated from statistics conditioned upon the covariates, hence the extra C in the subscript.
If \(y^*_1\) is the case score on a variate \(Y_1\), we would calculate the conditional deficit (\(Z_{DCCC}\)) of the case on variate \(Y_1\) like so: \[\begin{equation} \hat{Z}_{CCC(i)} = \frac{y^*_1-\hat{\mu}_{(i)}}{\hat{s}^2_{(i)}}. \tag{2.7} \end{equation}\] If \(y^*_2\) is the case score on variate \(Y_2\) and we are interested in a dissociation between \(Y_1\) and \(Y_2\), we denote the two conditional means on these variates \(\hat{\mu}_{1(i)}\) and \(\hat{\mu}_{2(i)}\). \(Z_{DCCC}\) is then calculated like so: \[\begin{equation} \hat{Z}_{DCCC(i)} = \frac{\frac{y^*_1-\hat{\mu}_{1(i)}}{\hat{s}_{1(i)}}-\frac{y^*_2-\hat{\mu}_{2(i)}} {\hat{s}_{2(i)}}}{\sqrt{2-2\hat{\rho}_{12(i)}}}. \tag{2.8} \end{equation}\] The conditional standard deviations \(\hat{s}_{1(i)}\) and \(\hat{s}_{2(i)}\) are obtained from \(\hat{\Sigma}_{(i)}\) in ((2.6)), the conditional correlation between the variates (\(\hat{\rho}_{12(i)}\)) is given by \[\hat{\rho}_{12(i)} = \frac{\hat{s}_{12(i)}}{\sqrt{\hat{s}^2_{1(i)}\hat{s}^2_{2(i)}}}.\]
To get estimates of \(p\) we find the tail area under the standard normal distribution with more extreme values on \(\hat{Z}_{CCC(i)}\) or \(\hat{Z}_{DCCC(i)}\), depending on the problem and alternative hypothesis specified. If testing for a defict we would for example have: \[ \hat{p}_{(i)}= \Phi(\hat{Z}_{CCC(i)}). \]
Iterate the procedure a large number of times and save the estimations for each iteration. The mean of the resultant distribution of \(\hat{p}\) is taken at the point estimate of \(p\). Point estimates of \(Z_{CCC}\) and \(Z_{DCCC}\) are obtained by using ((2.7)) and ((2.8)) with the conditional means, standard deviations and correlation calculated directly from the control sample. The \(1-\alpha\) credible interval boundaries for each estimate is obtained from the quantiles of the estimated distributions as described for the previous methods.
singcar
To specify a Bayesian test of deficit with one or more covariates, call:
BTD_cov(case_task = caseA, case_covar = caseAGE, control_task = contA,
control_covar = contAGE, alternative = "less", int_level = 0.95,
iter = 10000, use_sumstats = FALSE)
##
## Bayesian Test of Deficit with Covariates
##
## data: Case = 0.03, Controls (m = 0.16, sd = 0.08, n = 28)
## df = 26, p-value = 0.05201
## alternative hypothesis: true diff. between case and controls is less than 0
## sample estimates:
## Std. case score (Z-CCC), 95% CI [-2.30, -1.10]
## -1.749556
## Proportion below case (%), 95% CI [1.08, 13.60]
## 5.200838
The covariates should for the case be specified as a single value or vector of values containing the case scores for each covariate included. For the controls the covariates should be specified as a vector or matrix where each column represents one of the covariates.
To specify a Bayesian test for a dissociation with one or more covariates present, call:
BSDT_cov(case_tasks = c(caseA, caseB), case_covar = caseAGE,
control_tasks = cbind(contA, contB), control_covar = contAGE,
alternative = "two.sided", int_level = 0.95, iter = 10000,
calibrated = TRUE, use_sumstats = FALSE)
##
## Bayesian Standardised Difference Test with Covariates
##
## data: Case A = 0.03, B = 0.10, Ctrl. (m, sd) A: (0.16,0.08), B: (0.18,0.10)
## df = 25, p-value = 0.3299
## alternative hypothesis: true difference between tasks is not equal to 0
## sample estimates:
## Std. case score, task A (Z-CC)
## -1.754857
## Std. case score, task B (Z-CC)
## -0.783696
## Std. discrepancy (Z-DCCC), 95% CI [-1.66, -0.41]
## -1.064152
## Proportion below case (%), 95% CI [4.84, 34.10]
## 16.500000
Here, the case scores as well as the control scores
from variates of interest must be concatenated. The covariates are specified
in the same manner as for BTD_cov()
. The prior used is again specified with
the argument calibrated
.
If using summary statistics for these functions, the argument use_sumstats
must be set to TRUE
. Information on how to specify the data in such a case
consult the documentation.
It should be noted that the TD of course is nothing other than a linear regression model with a dummy coded variable for belonging to the patient group. This equivalence might seem trivial but is included here for completeness. If \(\boldsymbol{y} = (y^*_1, y_2,...,y_n)\) is the vector of scores for the case \(y^*_1\) and the control sample \((y_2,...,y_n)\) on the variate of interest, then \(\boldsymbol{x} = (1, 0, ..., 0)\) is a dummy coded variable indicating inclusion in the patient group. The equation for a regression model with these variables is then: \[ y_i = \beta_0+\beta_1x_i+\epsilon_i, \] where the error term \(\epsilon_i\sim\mathcal{N}(0, \sigma^2)\). The least squares estimate of \(\beta_1\), \(\hat{\beta}_1\) equals the numerator in the equation for the TD, ((2.1)). We have: \[\begin{equation} \hat{\beta}_1 & = \frac{\frac{1}{n}\sum^n_{i =1}(x_i-\overline{x})(y_i-\overline{y})}{\frac{1}{n}\sum^n_{i = 1}(x_i-\overline{x})^2} =\frac{\sum^n_{i =1}y_ix_i-\frac{1}{n}\sum^n_{i =1}y_i\sum^n_{i =1}x_i}{\sum^n_{i=1}(x_i-\overline{x})^2} \\ & =\frac{\frac{ny^*_1-\sum^n_{i=1}y_i}{n}}{\left(1-\frac{1}{n}\right)^2+(n-1)\left(-\frac{1}{n}\right)^2} =\frac{\frac{ny^*_1-(y^*_1+\sum^n_{i=2}y_i)}{n}}{\frac{n-1}{n}} =\frac{(n-1)y^*_1-\sum^n_{i=2}y_i}{n-1} \\ & =\frac{(n-1)y^*_1-\sum^n_{i=2}y_i}{n-1} =y^*_1-\frac{\sum^n_{i=2}y_i}{n-1} =y^*_1-\overline{y}_c. \end{equation}\] Here \(\overline{y}_c\) denotes the sample mean of the controls, which is also the least squares estimate of the intercept, \(\hat{\beta}_0\): \[\begin{equation} \hat{\beta}_0 &= \overline{y}-\hat{\beta}_1\overline{x} = \frac{\sum^n_{i=1}y_i}{n} -\frac{ny_1-\sum^n_{i=1}y_i}{(n-1)}\frac{1}{n} \\ &= \frac{(n-1)\sum^n_{i=1}y_i-ny_1+\sum^n_{i=1}y_i}{n(n-1)}\\ &= \frac{n\sum^n_{i=1}y_i-ny_1}{n(n-1)} = \frac{\sum^n_{i=2}y_i}{n-1} = \overline{y}_c. \end{equation}\]
To test the slope in a simple regression model one uses the statistic \[\begin{equation} t_{n-2} = \frac{\hat{\beta}_1}{SE\left(\hat{\beta}_1\right)} \tag{2.9} \end{equation}\] following a \(t\) distribution on \(n-2\) degrees of freedom, given that the null hypothesis is true. The degrees of freedom for this distribution is of course the same as \(n_c-1\) degrees of freedom if \(n_c\) is the size of the control sample. Remember that the denominator in the TD was \(s_c \sqrt{\frac{n_c + 1}{n_c}}\), where \(s_c\) is the standard deviation of the control sample. We know that the standard error for the slope coefficient is: \[ SE\left(\hat{\beta}_1\right) = \sqrt{\frac{\frac{1}{n-2}\sum^n_{i =1}\hat{\epsilon}^2_i}{\sum^2_{i=1}(x_i-\overline{x})^2} } = \sqrt{\frac{\frac{1}{n-2}\sum^n_{i=1}(y_i-\hat{y}_i)^2}{\frac{n-1}{n}}}. \] Furthermore, \(\hat{y}_i = \hat{\beta}_0+\hat{\beta}_1x_i\) and \(\hat{y}_1 = \overline{y}_c+(y^*_1-\overline{y}_c) = y^*_1\), hence: \[\begin{equation} SE\left(\hat{\beta}_1\right) &= \sqrt{\frac{\frac{1}{n-2}\sum^n_{i=2}(y_i-\hat{y}_i)^2+\cancel{(y^*_1-\hat{y}_1)^2}}{\frac{n-1}{n}}} \\ &= \sqrt{\frac{\sum^n_{i=2}(y_i-\hat{y}_i)^2}{n-2}}\sqrt{\frac{n}{n-1}} = s_c\sqrt{\frac{n_c+1}{n_c}}. \tag{2.10}. \end{equation}\] As such the \(t\) test for the slope in the regression equation and the test of deficit are identical. To get \(Z_{CC}\) if using the regression approach simply divide \(\hat{\beta}_1\) by the standard deviation of the sample, \(Z_{CC} = \hat{\beta}_1/s_c\).
The fact that we can use simple linear regression instead of the TD to model case abnormality is in itself not very interesting. However, one can include covariates without going through the trouble of performing Bayesian regression for the methods as described in Section 2.3.4. To compute the conditional effect size \(Z_{CCC}\) in this case one must first calculate the conditional mean and standard deviation from the control sample. The conditional mean can be predicted from the fitted model given the case values on the covariates and the conditional variance can be computed like so: \[ \sigma^2|X = \frac{\sum^n_{i=2}\epsilon_i^2}{n-2}, \] where \(X\) denotes the covariates and \(\epsilon_i\) the residuals from the fitted model. Note also that the index starts at 2 to leave out the case. \(Z_{CCC}\) is then: \[ Z_{CCC} = \frac{y_1^*-\mu|X}{\sqrt{\sigma^2|X}}. \]
More importantly though, we can generalise the linear regression to linear mixed effect models (West, Welch, and Galecki 2014). With linear mixed effect models (LMM) one can model data with dependency between observations, opening for the possibility to use repeated measures data in single case comparisons and other, more complex, designs. This is because LMM can model both fixed and random effects (hence “mixed”). Fixed effects are in this case fixed or constant across participants and random effects are found for observations that are somehow more related to each other (e.g. several measurements from the same participant or hierarchically clustered observations). The concept of applying LMM to single case comparisons was first investigated by Huber et al. (2015).
Since impairments on tasks converging on a cognitive construct of interest provides stronger evidence of a potential deficit, the possibility to model random effects successfully is of great importance. Such a model would take the form: \[\begin{equation} y_{ij} = \beta_0 + \beta_1x_i + v_{0i}+\epsilon_{ij}, \ v_{0i} \sim \mathcal{N}(0, \sigma^2_v) \ \text{and} \ \epsilon_{ij} \sim \mathcal{N}(0, \sigma^2). \tag{2.11} \end{equation}\] Here \(i\) indicates participant and \(j\) indicates task or item. So, \(y_{ij}\) is then the response of participant \(i\) for task/item \(j\). \(\beta_0\) and \(\beta_1\) are still the intercept of the sample and raw effect of the case while \(v_{0i}\) denotes the random intercept for each participant. As such the term allows for participants to vary from the average intercept of the sample. However, while the model in ((2.11)) accounts for participant variability we must include yet a further term to account for the task/item variability: \[\begin{equation} y_{ij} = \beta_0 + \beta_1x_i + v_{0i} + w_{0j} +\epsilon_{ij}, \\ v_{0i} \sim \mathcal{N}(0, \sigma^2_v), \ w_{0j} \sim \mathcal{N}(0, \sigma^2_w) \ \text{and} \ \epsilon_{ij} \sim \mathcal{N}(0, \sigma^2), \tag{2.12} \end{equation}\] where \(w_{0j}\) denotes the random intercept for tasks/items, which should be included to avoid inflation of Type I errors (Baayen, Davidson, and Bates 2008).
A common method to obtain \(p\) values for LMMs is to use likelihood ratio tests. However, Huber et al. (2015) show that the Type I error rate for this method only is at an acceptable level when the size of the control sample is larger than 50. In contrast, using parametric bootstrapping or one-tailed \(t\) tests approximating the degrees of freedom with the Satterthwaite method, kept the error rate at its nominal level (or rather somewhat above, implying that a slightly more conservative \(\alpha\) value should be considered) even for small control sample sizes. As a reference they compared error rate and statistical power of the mixed model to the TD using aggregated data, i.e. taking the mean of all tasks/items for each participant, which is what Crawford and Howell (1998) suggests for this type of data. The \(p\) value of \(\beta_1\) in ((2.12)) using the Satterthwaite method will be approximately equal to the two sided \(p\) value from the TD if averaging the scores for each participant over the tasks/items, given a large enough sample size.
Furthermore, because we can model data with dependency between observations with LMM it is also possible to model effects within participants, i.e. between conditions. The model equation would then be: \[\begin{equation} y_{ij} = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i}+ v_{0i} + w_{0j} +\epsilon_{ij}, \\ v_{0i} \sim \mathcal{N}(0, \sigma^2_v), \ w_{0j} \sim \mathcal{N}(0, \sigma^2_w) \ \text{and} \ \epsilon_{ij} \sim \mathcal{N}(0, \sigma^2), \tag{2.13} \end{equation}\] where \(\beta_2\) is the raw effect of the condition indicated by the categorical variable \(x_2\). Introducing an interaction term between \(x_1\) and \(x_2\) in this model we can even test for the presence of a dissociation. This model would take the form: \[\begin{equation} y_{ij} = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i}+\beta_3(x_{1i}x_{2i})+ v_{0i} + v_{1i} + w_{0j} +\epsilon_{ij}, \\ \begin{pmatrix} v_{0i} \\ v_{1i} \end{pmatrix} \sim \mathcal{N}\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \Sigma_v = \begin{bmatrix} \sigma^2_{v_0} & \sigma_{v_0}\sigma_{v_1} \\ \sigma_{v_0}\sigma_{v_1} & \sigma^2_{v_1} \end{bmatrix} \right), \ w_{0j} \sim \mathcal{N}(0, \sigma^2_w) \ \text{and} \ \epsilon_{ij} \sim \mathcal{N}(0, \sigma^2), \tag{2.14} \end{equation}\] where \(\beta_3\) is the raw effect of the dissociation and \(v_{1i}\) is the random slope of the participants. That is, \(v_{1i}\) accounts for the variability of the effect of the condition, between participants. A caveat of using LMM is that it requires more observations than parameters specified, to be identifiable. Specifying random slopes and random intercepts for each participant then one must have \(>2n\) observations. Implying that it would not be a good model for testing a dissociation between two tasks without repeated measurements within each task.
There are already several implementations of LMMs in various open source
software which can be used directly in the context of single case comparisons
as described above. Therefore this methodology has not been implemented in
singcar
. However, due to the usefulness of the concept and the lack
of methodology in singcar
to handle repeated measures data I will give
a brief description of how one can specify an LMM in R
using the package
lme4
(Bates et al. 2015) and
lmerTest
(Kuznetsova, Brockhoff, and Christensen 2017) (the latter is used to
obtain \(p\) values from \(t\) tests with Satterthwaite approximation of
the degrees of freedom).
R
Begin by installing and loading required packages:
install.packages(c("lme4", "lmerTest", "MASS"))
library("lme4")
library("lmerTest")
library("MASS") # For multivariate simulation
Because the example data is not suited for repeated measures analysis a simulated dataset will be used. Say that we have 30 measurements from four normally distributed variates \(Y_i\), \(i = 1,2,3,4\). Call these variates items. They are thought to converge on some cognitive ability of interest and have the distribution: \[ \begin{pmatrix} Y_1 \\ Y_2 \\ Y_3 \\ Y_4 \end{pmatrix} \sim \mathcal{N}\left( \begin{bmatrix} 100 \\ 80 \\ 50 \\ 20 \end{bmatrix}, \begin{bmatrix} 15^2 & 108 & 78 & 30 \\ 108 & 10^2 & 12 & 15 \\ 78 & 12 & 10^2 & 25 \\ 30 & 15 & 25 & 5^2 \end{bmatrix} \right). \] The case has incurred a deficit of two standard deviations on all four variates. These data could then be simulated as follows:
<- expand.grid(case = c(1, rep(0, 29)), item = factor(1:4))
simdata $ID <- factor(rep(1:30, 4))
simdata
set.seed(123) # For replicability
<- c(100, 80, 50, 20)
mu <- matrix(c(15^2, 108, 78, 30,
sigma 108, 10^2, 12, 15,
78, 12, 10^2, 25,
30, 15, 25, 5^2), nrow = 4, byrow = T)
<- mvrnorm(30, mu = mu, Sigma = sigma) # Dependent variable
dv
1, ] <- dv[1, ] + c(-30, -20, -20, -10) # Case scores
dv[$dv <- c(dv) simdata
The linear mixed model to test the deficit of the case would then be
<- lmer(dv ~ case + (1|ID) + (1|item), data = simdata)
model1 summary(model1)[["coefficients"]]
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 62.15587 17.512303 3.030078 3.549269 0.037506018
## case -25.07172 7.793911 27.999869 -3.216834 0.003263006
Where (1|ID)
and (1|item)
specifies the random effects structure,
in this case the random intercepts for participant and item.
This is approximately the same \(t\) statistic that would be produced
by the test of deficit if averaging the scores for each participant:
<- rowMeans(dv)
agg_data TD(agg_data[1], agg_data[-1], alternative = "two.sided")[["statistic"]]
## t
## -3.21684
Now, say that we want to model a dissociation using LMM. We have collected 30 measurements from two items during two different conditions. It is thought that these two different conditions are tapping two independent cognitive abilities but that the items collected in each condition are converging on the same. Hence we have four variates \(Y_{ij}\), \(i=1,2\), \(j=1,2\), distributed according to the multivariate distribution: \[\begin{equation} \begin{pmatrix} Y_{11} \\ Y_{21} \\ Y_{12} \\ Y_{22} \end{pmatrix} \sim \mathcal{N}\left( \begin{bmatrix} 100 \\ 80 \\ 50 \\ 20 \end{bmatrix}, \begin{bmatrix} 15^2 & 120 & 45 & 7 \\ 120 & 10^2 & 11 & 15 \\ 45 & 11 & 10^2 & 40 \\ 7 & 15 & 40 & 5^2 \end{bmatrix} \right), \tag{2.15} \end{equation}\] where e.g. \(Y_{12}\) is the random variable corresponding to the first item in the second condition. Given that the two conditions in fact taps different cognitive structures, the incurred impairment of the case would potentially affect performance in one of the conditions more than the other. Hence, we impose a deficit of two standard deviations on the case scores. The data can then be simulated as follows:
<- expand.grid(case = c(1, rep(0, 29)),
simdata item = factor(1:2), condition = factor(1:2))
$ID <- factor(rep(1:30, 4))
simdatacontrasts(simdata$condition) <- contr.sum(2) # Effect coding
set.seed(123) # For replicability
<- c(100, 80, 50, 20)
mu <- matrix(c(15^2, 120, 45, 7,
sigma 120, 10^2, 11, 15,
45, 11, 10^2, 40,
7, 15, 40, 5^2), nrow = 4, byrow = T)
<- mvrnorm(30, mu = mu, Sigma = sigma) # Dependent variable
dv 1, ] <- dv[1, ] + c(-30, -20, -0, -0) # Case scores
dv[$dv <- c(dv) simdata
The fully crossed linear mixed model used to test a dissociation for these data would then be specified like so:
<- lmer(dv ~ case*condition + (condition|ID) + (1| item), data = simdata)
model2 round(summary(model2)[["coefficients"]], 5)
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 61.87447 12.18101 1.02152 5.07959 0.11994
## case -17.24035 7.66811 27.99986 -2.24832 0.03261
## condition1 28.08855 0.98493 28.00058 28.51834 0.00000
## case:condition1 -13.82757 5.39468 28.00058 -2.56319 0.01603
Comparing the \(p\) value of the interaction effect to the \(p\) value
obtained from RSDT()
we see that that they are far from equivalent.
<- rowMeans(dv[ , 1:2])
agg_data1 <- rowMeans(dv[ , 3:4])
agg_data2 RSDT(agg_data1[1], agg_data2[1], agg_data1[-1], agg_data2[-1])[["p.value"]]
## [1] 0.06737687
However, a small simulation study conducted with the data described above revealed that the RSDT and the LMM yields comparable results regarding power and Type I error. It takes the specified impairments of the case for each variate and returns the ratio of significant interaction effects from ((2.14)) divided by the total number of simulations run, as well as the same ratio obtained from either the RSDT or the BSDT. The distribution of the data is the same in this small simulation study as ((2.15)).
R> powerLMM(nsim = 1000, case_impairment = c(-30, -20, -0, -0),
+ compare_against = "RSDT", alpha = 0.05)
RSDT LMM
1 0.353 0.462
It seems like the method based on the linear mixed model has quite a good power advantage over the RSDT for data simulated from this particular distribution. To compare the error rate of two tests when can use the same ratio as for power, when there is no true effect present. In this case that would be either when the case has no incurred impairment at all or when the impairments are equal in both conditions.
R> powerLMM(nsim = 1000, case_impairment = c(-0, -0, -0, -0),
+ compare_against = "RSDT", alpha = 0.05)
RSDT LMM
1 0.032 0.039
Even though a power advantage of the mixed model was observed it seems to produce an error rate under the nominal level of \(0.05\) as well. If this was a consistent behaviour it would eliminate the usefulness of the RSDT. However, if the case exhibits extreme impairments in both conditions we would have another null scenario. This was discussed previously in Section 2.2 where it was noted that one of the reasons Crawford and Garthwaite (2007) developed a Bayesian approach to test for dissociations was the highly increasing error rate of the RSDT, for increasing deficits on both tasks with no discrepancy between them. Say that the case has impairments of six standard deviations on both items in both conditions and let us compare the LMM to the BSDT instead of the RSDT:
R> powerLMM(nsim = 1000, case_impairment = c(-90, -60, -60, -30),
+ compare_against = "BSDT", alpha = 0.05)
BSDT LMM
1 0.053 0.61
The BSDT seems to keep the error rate almost at its nominal level while it is enormous for the LMM. Even though a more extensive simulation study with various types of data structures should be performed to solidify these findings, the preliminary results shown here indicate some limitations of using LMM for dissociations and reiterates the recommendation of Crawford and Garthwaite (2007) to use the BSDT to test for dissociations, even when using aggregated repeated measures data. That is, because the effects of brain damage can be severe I would recommend against using LMM to model dissociations.
Up until now I have discussed univariate testing methods. However, as discussed in the introduction, Section 1, stronger evidence for either deficits or dissociations is provided by using multiple measures (tasks) for a cognitive function of interest. Conceptually, what we want to estimate or test in this scenario is the distance between a vector of observations for a single case and a vector of population means. A common way to do this is by using the Mahalanobis distance measure. If \(\boldsymbol{y}^*\) is a vector of task observations from a single case and \(\boldsymbol{\mu}\) is the population mean vector, the Mahalanobis distance, \(\delta\), is a measure of the distance between the case and the population mean, given by: \[ \delta= \sqrt{(\boldsymbol{y}^*-\boldsymbol{\mu})'\Sigma^{-1}(\boldsymbol{y}^*-\boldsymbol{\mu})}. \]
Given that the population follows a multivariate normal distribution we have that \(\delta^2 \sim \chi^2_{\nu_1}\) where \(\nu_1\) is the degrees of freedom and the number of dimensions in the multivariate distribution. This squared distance is sometimes referred to as the Mahalanobis index. Because the purpose of case-comparison methodology is to estimate abnormality we are interested in \(p\), that is the proportion of the control population that would exhibit larger \(\delta\) than a single case. If we need to estimate \(\boldsymbol{\mu}\) and \(\Sigma\) the sample Mahalanobis distance is given by \[ \hat\delta = \sqrt{(\boldsymbol{y}^*-\overline{\boldsymbol{y}})'\boldsymbol{S}^{-1}(\boldsymbol{y}^*-\overline{\boldsymbol{y}})}, \] where \(\overline{\boldsymbol{y}}\) is the sample mean vector and \(\boldsymbol{S}\) the sample covariance matrix. Using the Mahalanobis index between a case and the mean of a normative sample, the abnormality is estimated with \[ \hat{p} = \mathbb{P}[(\boldsymbol{y}-\overline{\boldsymbol{y}})'\boldsymbol{S}^{-1}(\boldsymbol{y}-\overline{\boldsymbol{y}}) > (\boldsymbol{y}^*-\overline{\boldsymbol{y}})'\boldsymbol{S}^{-1}(\boldsymbol{y}^*-\overline{\boldsymbol{y}})]. \]
An obvious way to estimate \(p\) would be to use the Hotelling’s \(T^2\) test. The \(T^2\) statistic is proportional to the \(F\) distribution and hence its related \(p\) value seems like an intuitive choice as an estimate of \(p\). This was in fact proposed by Huizenga et al. (2007), but Elfadaly, Garthwaite, and Crawford (2016) show that this estimate is biased and devise instead several other candidates. None of these was shown to be perfectly unbiased but several of the estimates performed better than the estimate based on the \(T^2\) statistic.
If we let \(\lambda = \delta^2 = (\boldsymbol{y}^*-\boldsymbol{\mu})'\Sigma^{-1}(\boldsymbol{y}^*-\boldsymbol{\mu})\) and \(\chi^2_{\nu_1}= (\boldsymbol{y}-\boldsymbol{\mu})'\Sigma^{-1}(\boldsymbol{y}-\boldsymbol{\mu})\) be the case’s Mahalanobis index and the Mahalanobis index of any vector \(\boldsymbol{x}\) respectively, what we want to estimate is: \[ p = \mathbb{P}[\chi^2_{\nu_1}> \lambda]. \] If \(\overline{\boldsymbol{y}}\) and \(\hat\Sigma = \frac{n-1}{n}\boldsymbol{S}\) denotes the maximum likelihood estimates of \(\boldsymbol{\mu}\) and \(\Sigma\) the Hotelling’s \(T^2\) based estimate of \(p\), \(\hat{p}_F\), is \[ \hat{p}_F = \mathbb{P}\left[(\boldsymbol{y}-\overline{\boldsymbol{y}})'\hat\Sigma^{-1}(\boldsymbol{y}-\overline{\boldsymbol{y}})>(\boldsymbol{y}^*-\overline{\boldsymbol{y}})'\hat\Sigma^{-1}(\boldsymbol{y}^*-\overline{\boldsymbol{y}})\right]. \] If we let \(\lambda_0 = (\boldsymbol{y}^*-\overline{\boldsymbol{y}})'\boldsymbol{S}^{-1}(\boldsymbol{y}^*-\overline{\boldsymbol{y}})\), we have that \[ T^2=\frac{n\lambda_0}{n+1} \ \text{and} \ \hat{p}_F = \mathbb{P}\left[F_{\nu_1, \nu_2}>\frac{\nu_2}{\nu_1(n-1)}T^2\right]. \] i.e., this estimate is the \(p\) value associated with testing the hypothesis that the case is coming from the control population, using a Hotelling’s \(T^2\) test. As such it is also often used as a point estimate of \(p\). However, we can also use the fact that \(\lambda \sim \chi^2_{\nu_1}\) and use the \(p\) value from the \(\chi^2\) distribution by only using the sample estimates on the right hand side of the inequality: \[ \hat{p}_{\chi^2} = \mathbb{P}\left[(\boldsymbol{y}-\boldsymbol{\mu})'\Sigma^{-1}(\boldsymbol{y}-\boldsymbol{\mu})>(\boldsymbol{y}^*-\overline{\boldsymbol{y}})'\hat\Sigma^{-1}(\boldsymbol{y}^*-\overline{\boldsymbol{y}})\right]. \] We have that: \[ \hat{p}_{\chi^2} = \mathbb{P}\left[\chi^2_{\nu_1}>\frac{n}{n-1}\lambda_0\right]. \] This is an intuitive estimator and it was shown to exhibit fairly low bias (lower than \(\hat{p}_F\)) for low true values of \(p\) and it had the smallest mean square error of all estimator considered by Elfadaly, Garthwaite, and Crawford (2016). But unfortunately, as \(p\) increases \(\hat{p}_{\chi^2}\) starts to show substantial bias.
Both of theses estimates are intuitive, however, since they both are biased
Elfadaly, Garthwaite, and Crawford (2016) recommend using other estimators, based
on the needs of the analysis. One of which is based on the a modified confidence
interval for Mahalanobis distance and two are based on quadrature polynomial approximation of
the \(\chi^2\) distribution. In Section 2.5.1 I describe two estimators
based on confidence intervals of \(\delta\) which are both implemented in singcar
along with \(\hat{p}_F\) and \(\hat{p}_{\chi^2}\).
Reiser (2001) suggested a method to construct confidence intervals on \(\delta^2\). Using this method Elfadaly, Garthwaite, and Crawford (2016) propose a way to construct confidence intervals on \(p\) as well. We have: \[ F_0 = \frac{n\nu_2}{(n-1)\nu_1}\lambda_0 \sim F_{\nu_1, \nu_2}(n\lambda), \] that is, the sample statistic \(F_0\) has a non-central \(F\) distribution on \(\nu_1\) and \(\nu_2\) degrees of freedom and non-centrality parameter \(n\lambda\). We call the value of \(n\lambda\), where \(F_0\) is the \(\alpha\)-quantile of this distribution, \(L_{\alpha}\). We can construct the lower and upper boundaries of a confidence interval for \(p\) using the CDF for the \(\chi^2\)-distribution on \(\nu_1\) degrees of freedom, denoted \(G(.)\), as such: \[\begin{equation} \left[1-G\left(\frac{L_{\alpha/2}}{n}\right), \ 1-G\left(\frac{L_{1-\alpha/2}}{n}\right)\right]. \tag{2.16} \end{equation}\]
Similarly, they suggest a point estimator of \(p\) based on the median of \(F_{\nu_1, \nu_2}(n\lambda)\). So that if we let \(L_{0.5}\) be the value of \(n\lambda\) where \(F_0\) is the median of \(F_{\nu_1, \nu_2}(n\lambda)\) the estimator is: \[\begin{equation} \hat{p}_D= 1-G\left(\frac{L_{0.5}}{n}\right). \tag{2.17} \end{equation}\]
Similarly we would construct a confidence interval around \(\lambda\) by choosing \(\lambda_u\) and \(\lambda_l\) such that \[\begin{equation} \mathbb{P}\left[F_{\nu_1,n- \nu_1}(n\lambda_l)\leq F_0\right] = \frac{\alpha}{2}, \ \text{and} \ \mathbb{P}\left[F_{\nu_1,n- \nu_1}(n\lambda_u)\leq F_0\right] =1- \frac{\alpha}{2}. \tag{2.18} \end{equation}\]
However, this method has a drawback. It exactly matches the nominal level, i.e. all of the say 95% intervals (assuming \(\alpha= 0.05\)) will in fact contain the true value of \(\delta\) 95% of the time, proved by Wunderlich et al. (2015). But when \(F_0\) is small enough so that \(\mathbb{P}[F_{\nu_1,n- \nu_1}(0)\leq F_0] \leq 1- \frac{\alpha}{2}\), we cannot find a \(\lambda_u\) so that \(\mathbb{P}\left[F_{\nu_1,n- \nu_1}(n\lambda_u)\leq F_0\right] =1- \frac{\alpha}{2}\) and it is instead set to 0. Similarly we set \(\lambda_l\) to 0 when \(\mathbb{P}[F_{\nu_1,n- \nu_1}(0)\leq F_0] \leq \frac{\alpha}{2}\). But we know with full certainty that an interval of [0, 0] will not contain \(\lambda\), since \(\boldsymbol{y}^*\) is continuous with \(\mathbb{P}[\boldsymbol{y}^* = \boldsymbol{\mu}] = 0\).
This problem follows in the construction of confidence intervals for \(p\), ((2.16)) as well as for the estimator \(\hat{p}_D\), ((2.17)). The median of the distribution \(F_{\nu_1, \nu_2}(n\lambda)\) decreases of course as the non-centrality parameter \(n\lambda\) decreases. We have that \(\lambda \geq 0\), therefore the lower limit of the median for \(F_{\nu_1,\nu_2}(n\lambda)\) is the median of of this distribution when the non-centrality parameter \(n\lambda = 0\), i.e. the median of a central \(F\) distribution on \(\nu_1\) and \(\nu_2\) degrees of freedom. So, using the approach of Reiser (2001), if \(F_0\) is smaller than this limit one set \(n\lambda\) to zero.
We see, however, a bias in this estimator when \(F_0\) is small even if it is not smaller than the lower limit. Elfadaly, Garthwaite, and Crawford (2016) gives as an example of this scenario where we have \(\nu_1 =4, \ \nu_2 = 20\) and \(\lambda_0 = 0.4\), hence \(n =24\) and we have \(F_0 = \frac{24* 20}{23 * 4}0.4 = 2.09\). The value of the non-centrality parameter \(n\lambda\) where the median of \(F_{\nu_1, \nu_2}(n\lambda)\) equals 2.09 is ~5.015 and \(\hat{p}_D = 1-G_{4}\left(5.015/24\right) = 0.9949\) or 99.49%. So when using the estimated Mahalanobis index of \(0.4\) we would infer that only 0.51% of the population would exhibit a lower Mahalanobis index. If instead calculating the true percentage of the population that would exhibit a smaller Mahalanobis index when 0.4 is the true case value, that is \(p = 1-G_4(0.4) = 0.9825\) or 98.25%, which is quite a considerable discrepancy.
Garthwaite, Elfadaly, and Crawford (2017) proposed a solution to this in the construction of confidence intervals of \(\delta\) by modifying the method by Reiser (2001). They suggested to form a Bayesian credible interval instead of the frequentist confidence interval, whenever \(F_0\) is small. Credible intervals have the more intuitive interpretation that they contain the true value of interest with a probability of \(1-\alpha\). As such this interval will not give rise to unreasonable values such as [0, 0]. But to form a credible interval, a prior must be specified.
Let \(\boldsymbol{y}\) be a vector of values from an observation of the control population and not necessarily the case. We denote the prior for \(\delta\) when \(\delta\) is the Mahalanobis distance of the case as \(\pi(\delta)\) and when \(\delta\) is the distance between the mean of the controls and a randomly selected observation as \(\psi(\delta)\). A priori \(\delta\) should be larger when it is calculated from the observation of the case than a randomly selected observation of the controls. Garthwaite, Elfadaly, and Crawford (2017) assume that for any \(\hat{\delta}^2\) and any specified quantile \(\pi_q(\delta | \hat{\delta}) \geq \psi_q(\delta | \hat{\delta})\), where \(\pi_q(.)\) and \(\psi_q(.)\) are the q:th quantile of the prior distributions.
Note that they do not aim to specify \(\pi(\delta)\), instead they want to put limits on the quantiles of the resultant posterior through \(\psi(\delta)\). This is because we know neither \(\pi(\delta)\) nor \(\pi_q(\delta | \hat{\delta})\) but it is possible to establish \(\psi(\delta|\hat{\delta})\) and the \(\alpha/2\) and \(1-\alpha/2\) quantiles of this distribution are then the upper and lower boundaries of a \(1-\alpha\) credible interval of \(\delta\).
The logic behind this modification is intuitive: treat the case like a randomly drawn observation from the control distribution whenever the case observation is more similar to the average of the controls than the majority of the control observations. That is, we take the boundaries that are lowest of the confidence and credible intervals.
Again, we set \(\lambda = \delta^2\) and \(\hat{\lambda} = \hat{\delta}^2\) and let \(\psi^*(\lambda)\) be the prior corresponding to \(\psi(\delta)\). This is a \(\chi^2\) distribution on \(\nu_1\) degrees of freedom. \[\begin{equation} \psi^*(\lambda) = \frac{1}{2^{\nu_1/2}\Gamma(\nu_1/2)}\lambda^{\nu_1/2-1}e^{-\lambda/2}, \ \lambda \geq 0. \tag{2.19} \end{equation}\] If we set \(m = n/(n-1)\) the likelihood of \(\lambda\), \(L(\lambda; \hat{\lambda})\), is given by \[\begin{equation} L(\lambda; \hat{\lambda}) = \sum_{r=0}^{\infty}\frac{(n\lambda/2)^re^{-n\lambda}}{B[(n-\nu_1)/2, \ \nu_1/2+r]r!}m^{\nu_1/2+r}\left(\frac{1}{1+m\hat{\lambda}}\right)^{n/2+r}\hat{\lambda}^{\nu_1/2+r-1}, \ \lambda \geq 0. \tag{2.20} \end{equation}\] Combining the prior with the likelihood, \(\psi^*(\lambda)L(\lambda; \hat{\lambda})\) yields the posterior \(\psi^*(\lambda | \hat{\lambda})\): \[\begin{equation} \psi^*(\lambda | \hat{\lambda}) = \frac{1}{c} \sum_{r=0}^{\infty}\frac{(n^r/2)(\lambda/2)^{\nu_1/2+r-1} e^{-(n+1)\lambda/2}}{B[(n-\nu_1)/2, \ \nu_1/2+r]\Gamma(\nu_1/2)r!}m^{\nu_1/2+r} \left(\frac{1}{1+m\hat{\lambda}}\right)^{n/2+r}\hat{\lambda}^{\nu_1/2+r-1}, \ \lambda \geq 0. \tag{2.21} \end{equation}\]
Here \(B(.,\ .)\) and \(\Gamma(.)\) are the beta and gamma function respectively, \(c\) is the norming constant which is obtained through numeric integration of (2.21) over \(0 < \lambda < \infty\).
Sums over infinite ranges are problematic for practical computation. One way do deal with this
is to set reasonable endpoints instead of infinity. For the implementation in singcar
the endpoint of the
infinite sum was set by noting that the denominator in ((2.21))
goes to infinity faster than the numerator. In fact, in R
and many other languages
\(r! = \infty\) for \(r > 170\), so a reasonable boundary for the infinite sum would then be \(170\),
because anything divided by infinity is 0. However, for some parameter values infinity will be
approached in the numerator before it is approached in the denominator. Infinity divided
by any real value is still infinity but infinity divided by infinity is undefined.
Therefore, to be able to sum, we remove any infinite or undefined value if present.
As such, the integral over the sum in ((2.21)) will always be convergent and the normalising constant
\(c\) can be calculated.
singcar
uses the function integrate
in the base R
package stats
to do this. This
numerical integration procedure, based on routines from the QUADPACK package (Piessens et al. 2012),
often yields correct integrals. However, it uses subdivision of the interval to be integrated
and when integrating to infinity the area of interest will be undersampled. Hence,
in singcar
we set:
\[
c = \int_0^{0.5\sqrt{\hat\lambda}}\psi^*(\lambda | \hat{\lambda}) d\lambda + \int_{0.5\sqrt{\hat\lambda}}^{\sqrt{\hat\lambda}}\psi^*(\lambda | \hat{\lambda}) d\lambda +
\int_{\sqrt{\hat\lambda}}^{1.5\sqrt{\hat\lambda}}\psi^*(\lambda | \hat{\lambda}) d\lambda +
\int_{1.5\sqrt{\hat\lambda}}^{2\sqrt{\hat\lambda}}\psi^*(\lambda | \hat{\lambda}) d\lambda +
\int_{2\sqrt{\hat\lambda}}^{\infty}\psi^*(\lambda | \hat{\lambda}) \ d\lambda.
\]
This ensures heavier sampling around \(\sqrt{\hat\lambda}\), which of course is the area we are most
interested in. Further, for large values of \(\hat\lambda\) some discontinuities will be introduced
in the posterior which can be problematic for some numerical integration techniques as well but
the above subdivision will alleviate this problem somewhat.
The q:th quantile of the posterior \(\psi^*(\lambda | \hat{\lambda})\), \(\lambda_q\) is then
found by an optimisation search procedure (nlminb
in stats
) that aims to minimise
\[
\left| \int_{\lambda_0}^{\lambda_q} \psi^*(\lambda | \hat{\lambda})d\lambda - q \right|.
\]
To get an estimate of \(p\) we use the same procedure to find the median of the posterior, \(\lambda_{0.5}\).
Elfadaly, Garthwaite, and Crawford (2016)’s modified estimator of \(p\), \(\hat{p}_{MD}\), is then:
\[
\hat{p}_{MD} = min\left[\hat{p}_D, \ 1-G\left(\lambda_{0.5}\right) \right].
\]
Note however that due to the fact that \(c\) will sometimes be very small and hence \(1/c\)
very large, computations of the normalised posterior can be very slow, but
this mostly happens when \(\hat\delta\) is so large that \(\hat{p}_D\) would
be greater than \(1-G(\lambda_{0.5})\) anyway. In singcar
it is therefore recommended
to use \(\hat{p}_{D}\) but to calculate \(\hat{p}_{MD}\) if the Mahalanobis distance
of the case seems suspiciously small.
The difference between the estimators \(\hat{p}_D\) and \(\hat{p}_{MD}\) is small and Elfadaly, Garthwaite, and Crawford (2016) show that their bias and means square error are almost identical, but they recommend \(\hat{p}_{MD}\). This is because, as \(\hat{\delta} \rightarrow 0, \ \hat{p}_D \rightarrow 1\) much faster than it, by common sense, should.
It should be noted that the implementation of \(\hat{p}_{MD}\) in singcar
is
unstable when compared to the implementation by
Garthwaite, Elfadaly, and Crawford (2017) written in C
. I have not been able
to find the cause of this discrepancy, but it only happens for relatively large
\(\hat{\delta}\). It is likely that the numerical integration procedures differ
and that the procedure used in the implementation by
Garthwaite, Elfadaly, and Crawford (2017) better handles discontinuous
functions, which we see in the posterior ((2.21)) for high values of
\(\hat{\delta}\). Alternatively, the discontinuities cause problems for the
optimisation routine. It is therefore recommended that both estimators are
calculated, and the smallest of the two are chosen, but if that is
\(\hat{p}_{MD}\) then compare the result to the implementation by
Garthwaite, Elfadaly, and Crawford (2017).
singcar
Say that we want to know the case abnormality in the example data on both tasks simultaneously. To estimate this abnormality on a multivariate space, call:
MTD(case = c(caseA, caseB), controls = cbind(contA, contB),
conf_level = 0.95, method = c("pd", "pchi", "pf", "pmd"),
mahalanobis_dist = NULL, k = NULL, n = NULL)
As can be seen, \(\hat{p}_D\) is the default method to estimate the abnormality.
This is because the difference in results between \(\hat{p}_D\) and \(\hat{p}_{MD}\)
is very small, but the difference in efficiency is huge. In addition, "pmd"
in
singcar
is actually \(1-G(\lambda_{0.5})\) rather than \(min[\hat{p}_D, 1-G(\lambda_{0.5})]\), hence taking the minimum is left to the user. The reason
for this was to minimise confusion. However, I realise that this could yield the
opposite effect and might be changed in future updates.
The \(p\) value in the output of MTD()
is the \(p\) value associated
with the Hotelling’s \(T^2\) statistic and the same as the abnormality
estimate if the method pf
is chosen. The three last arguments are
only required if summary statistics are used.
A further capacity of singcar
is that it can be used to calculate
statistical power of the tests. The notion of power when comparing cases to
control samples have been somewhat overlooked for this class of statistical tests.
In recent work (McIntosh and Rittmo 2020), we argued that,
even though power is inherently limited in this paradigm, a priori calculations are
still useful for study design and interpretation in neuropsychological and other applications.
Calculating power for the test of deficit is similar to calculating power for
any \(t\) test and can be done analytically.
\[\begin{equation} power = 1 - \beta =
T_{n-1}\left(t_{\alpha, \ n-1} \Bigg\rvert \frac{x^* - \overline{x}}{\sigma
\sqrt{\frac{n+1}{n}}}\right)
\tag{2.22}
\label{eq:TDpower}
\end{equation}\]
Where \(T_{n-1}(.\rvert \theta)\) is the cumulative distribution function for the
non-central \(t\) distribution with \(n-1\) degrees of freedom and non-centrality
parameter \(\frac{y^* - \overline{y}}{\sigma \sqrt{\frac{n+1}{n}}}\) (i.e., TD, Equation
(2.1) and \(t_{\alpha, \ n-1}\) is the \(\alpha\) quantile of the
\(t\) distribution on \(n-1\) degrees of freedom (note that this is for a one-sided
test). For the unstandardised difference test power is calculated in an
analogous way by putting Equation (2.3) as the non-centrality parameter. Deriving
power for the other functions in an analytic manner is however not possible (the
RSDT is only approximately \(t\) distributed) and a Monte Carlo approach has been
used for these tests. To call any power calculator in the package one simply
uses the function names with _power
added as a suffix.
So, for example, to calculate power for the test of deficit we call TD_power()
.
The expected case score and either sample size or desired power must
be supplied. The mean and standard deviation of the control sample
can also be specified with the arguments mean
and sd
.
If not, they take the default values of 0 and 1 respectively so
that the case score is interpreted as distance from the mean
in standard deviations. A conventional \(\alpha\)-level of
\(0.05\) is assumed if nothing else is supplied. The alternative
hypothesis can also be specified by the argument alternative
:
specify "less"
(default) or "greater"
for a one-tailed test, specify
"two.sided"
for a two-tailed test.
TD_power(case = 70,
mean = 100,
sd = 15,
sample_size = 16,
power = NULL,
alternative = "less",
alpha = 0.05,
spec = 0.005)
## [1] 0.5819579
TD_power()
can also be used to calculate required sample size for a
desired level of power. For example, if we specify a desired power level of 0.6,
leave sample_size
to its default and let the rest of the arguments
be as in the previous example, we see from the output of the function that
power will not increase more than 0.5% for any additional participant after a sample
size of 15. That is, the algorithm stops searching
when this level of specificity has been reached and we are nearing the
asymptotic maximum power for this effect size. We can increase the specificity
by lowering the spec
argument.
TD_power(case = 70,
mean = 100,
sd = 15,
sample_size = NULL,
power = 0.6,
alternative = "less",
alpha = 0.05,
spec = 0.005)
## Power (0.578) will not increase more than 0.5% per participant for n > 15
## n power
## 1 15 0.5780555
Power calculators for the Bayesian tests of deficit cannot calculate
required sample size. This is because they rely on simulation methods
to estimate approximate power and deploying a search algorithm to find the required sample
size for a given level of power would be computationally too intense. The syntax
is otherwise relatively similar to that of TD_power()
. For BTD_power()
we have the two extra arguments nsim
and iter
, indicating the number
of simulations used in the power function and by BTD()
, respectively.
BTD_power(case = 70,
mean = 100,
sd = 15,
sample_size = 15,
alternative = "less",
alpha = 0.05,
nsim = 1000,
iter = 1000)
## [1] 0.574
The only difference in syntax of BTD_cov_power()
is due to the inclusion of covariates.
The variate of interest must be specified as a vector where the first element
gives the mean and the second the standard deviation in the argument control_task
.
The covariates can be specified similarly or as an \(m \times 2\) matrix where the first
column gives the means of each covariate and the second column gives the standard
deviations. The correlation matrix of the variates must be given as well. In the example
below, power is evaluated for a test taking two covariates into account, both with a mean
of 0 and a standard deviation of 1. The correlation is specified as a \(3\times 3\)
matrix with pairwise correlations of \(0.3\). The default settings include only one
covariate having a \(0.3\) correlation with the variate of interest.
This function is computationally intense and hence, the number
of simulations has, for the example below, been decreased.
<- matrix(c(0, 1,
covars 0, 1), ncol = 2, byrow = TRUE)
BTD_cov_power(case = -2,
case_cov = c(0.2, -0.6),
control_task = c(0, 1),
control_covar = covars,
cor_mat = diag(3) + 0.3 - diag(c(0.3, 0.3, 0.3)),
sample_size = 15,
alternative = "less",
alpha = 0.05,
nsim = 100,
iter = 100)
## [1] 0.49
For the difference tests one must supply the expected case scores on both
variates as well as sample size. The means and standard deviations of the
control sample can also be specified. If unspecified, they take on the default values of
0 and 1 respectively, so that the expected case scores are interpreted as
distances from the means in standard deviations. RSDT_power()
,
BSDT_power()
and UDT_power()
additionally require an estimate of
the sample correlation between the variates of interest, r_ab
. If this is
not specified a correlation of 0.5 is assumed by default. For
BSDT_cov_power()
the correlation matrix between the variates of interest
and the covariates must instead be supplied (i.e., at least a \(3\times3\) matrix
where the first correlation is the correlation between the variates of
interest).
The alternative hypothesis is by default assumed to be
"two.sided"
since the direction of the effect is dependent on the
order of the inputs, but can be specified to be "less"
or
"greater"
as well. The syntax is similar for all three functions but with small
differences. For UDT_power()
one can request required sample size for a
desired power, as for TD_power()
. Calculators for the Bayesian tests
have the extra argument calibrated
as to be able to specify the prior.
BSDT_cov_power()
requires input in the same format as BTD_cov_power()
for both control_tasks
and control_covar
. The two examples
below demonstrate usage for RSDT_power()
and BSDT_cov_power()
.
RSDT_power(case_a = 70,
case_b = 55,
mean_a = 100,
mean_b = 50,
sd_a = 15,
sd_b = 10,
r_ab = 0.5,
sample_size = 15,
alternative = "two.sided",
alpha = 0.05,
nsim = 1000)
## [1] 0.604
<- matrix(c(1, 0.5, 0.6,
cor_mat 0.5, 1, 0.3,
0.6, 0.3, 1), ncol = 3, byrow = TRUE)
BSDT_cov_power(case_tasks = c(70, 55),
case_cov = 65,
control_tasks = matrix(c(100, 15,
50, 10), ncol = 2, byrow = TRUE),
control_covar = c(50, 25),
cor_mat = cor_mat,
sample_size = 15,
alternative = "two.sided",
alpha = 0.05,
nsim = 100,
iter = 100,
calibrated = TRUE)
## [1] 0.76
This vignette has reviewed several statistical methods to compare a single case to
a small control sample. It has exemplified practical usage of these methods with
implementations in the package,
where possible, and in lme4
where not. Using repeated measures data and linear
mixed models has been shown to potentially produce higher power when testing for
a discrepancy between two variates than the more standard RSDT, in a very small
simulation study. However, when no discrepancy is present but the case exhibits
severe impairments on both variates the mixed model yielded an extremely high
error rate, for the specific scenario investigated. So when estimating dissociations
it is recommended to use some of the Bayesian tests and aggregating the data.
Since more complex model structures can be created with linear mixed models compared to the more standard procedures it is unfortunate that their usage for finding dissociations seems limited. This is on the other hand somewhat made up for by the implementation of the Bayesian regression techniques in Section 2.3.4. In addition, implementation of (relatively) unbiased techniques to estimate case abnormality in multidimensional space, Section 2.5, also offers the possibility of increased complexity.
The methods described have been developed for keeping transparent control over Type I errors, but power calculators have been implemented in the package as well. Consideration of power can assist researchers in study design and in setting realistic expectations for what these types of statistical hypothesis tests can achieve (McIntosh and Rittmo 2020).
Code for the function powerLMM()
which evaluates power for the linear
mixed model used to test for the presence of a dissociation and
compares it to the power for either the RSDT or the BSDT, with data
distributed as described in Section 2.4.1.
<- function(nsim, case_impairment = c(0, 0, 0, 0),
powerLMM compare_against = c("RSDT", "BSDT"), alpha = 0.05) {
require(lme4)
require(lmerTest)
require(MASS)
<- match.arg(compare_against)
comptest <- vector(length = nsim)
pdt <- vector(length = nsim)
plm <- expand.grid(case = c(1, rep(0, 29)),
simdata item = factor(1:2), condition = factor(1:2))
$ID <- factor(rep(1:30, 4))
simdatacontrasts(simdata$condition) <- contr.sum(2)
<- c(100, 80, 50, 20)
mu <- matrix(c(15^2, 120, 45, 7,
sigma 120, 10^2, 11, 15,
45, 11, 10^2, 40,
7, 15, 40, 5^2), nrow = 4, byrow = T)
for (i in 1:nsim){
<- mvrnorm(30, mu = mu, Sigma = sigma)
dv 1, ] <- dv[1, ] + case_impairment
dv[$dv <- c(dv)
simdata<- lmer(dv ~ case*condition + (condition|ID) + (1| item), data = simdata)
model2
<- rowMeans(dv[ ,1:2])
agg_data1 <- rowMeans(dv[ ,3:4])
agg_data2 if (comptest == "RSDT"){
<- RSDT(agg_data1[1], agg_data2[1],
pdt[i] -1], agg_data2[-1])[["p.value"]]
agg_data1[else {
} <- BSDT(agg_data1[1], agg_data2[1],
pdt[i] -1], agg_data2[-1])[["p.value"]]
agg_data1[
}<- summary(model2)[["coefficients"]][4, 5]
plm[i]
}if (comptest == "RSDT"){
data.frame(RSDT = sum(pdt < 0.05)/nsim, LMM = sum(plm < 0.05)/nsim)
else {
} data.frame(BSDT = sum(pdt < 0.05)/nsim, LMM = sum(plm < 0.05)/nsim)
} }