Interest in the featured study centres on the LifeSkills Training curriculum rather than on the family programme. The curriculum was common to all the intervention schools and (since in this report no advantage was gained by adding the family programme) appears to have been mainly responsible for curbing the growth of substance use relative to education-as-usual. In a Findings review and in other commentaries (1 2 3), its impressive research record has however been criticised for methodological and other features which weaken confidence in the practical significance of the findings. Reports from the featured study itself have been criticised (1 2 3) on methodological grounds and defences mounted by the authors (1 2).
Criticisms which might to a degree be levelled at the featured report include testing for many outcomes, an approach justified partly The authors also note (personal communication from Dr Spoth, July 2009) "our exception to this statement, based on the two intervention conditions [having been] tested separately, along with the typical convention for intervention studies expecting effects on multiple outcomes. Also keep in mind the implication of adjustment for Type I error rates (increasing the Type II error rate) and the trade offs that are important to consider in this context". The last point refers to the fact that raising the bar for finding significant differences increases the risk that real differences will be judged insignificant. The authors also point out that the tested outcomes were all derived on theoretical grounds rather than chosen at random. by the range of outcomes expected from the interventions, but which at the same time increases the risk of finding some results which appear statistically significant purely by chance. By the study's own yardstick, there were 13 significant findings out of a presumed 44 tests, There were 20 tests of initiation of substance use (table 2). Tests for more serious forms of substance use (frequency of substance use, monthly poly-substance, use, and the advanced poly-substance use index) were not tabulated for the full samples. All were non-significant. Similar tests were tabulated (table 3) for the higher risk pupils. This table specified 24 tests. On the assumption that the full sample was subject to the same tests, tests totalled 44. 13 of the tests were described as significant in the study to at least the conventional '1 in 20 by chance' level. many more than the roughly two which would have been expected purely by chance. However, this was on the basis of so-called 'one-tailed' statistical tests which effectively assume As opposed to 'two-tailed' tests. Sometimes one-tailed statistical tests are justified by appeal to the study's aim to test whether the intervention is better than the alternative, not whether it is better or worse, or by virtue of the expectation that the intervention will be better. But "a one-tailed test is only well justified if in addition to the existence of a strong directional hypothesis, it can be convincingly argued that an outcome in the wrong tail [referred to in the main text as a negative finding] is meaningless and might as well be dismissed as a chance occurrence." (Abelson R.P. Statistics as principled argument. Lawrence Erlbaum Associates, 1995) that a negative finding (in this case, that the interventions made things worse relative the control schools) must be a meaningless fluke. This kind of test roughly doubles the chance of finding a significant positive effect. The argument for this assumption rests on the past record of the interventions. However, school drug education in the USA is evolving, and some of the comparison schools may also have implemented effective drug education programmes. Since the researchers did know what the control schools were doing, they cannot have known that it was bound to be inferior to the lessons implemented in the intervention schools. Without this assumption, fewer of the tests 13 of the presumed 44 tests were described as significant in the study to at least the conventional '1 in 20 by chance' level, but six did not also meet the more stringent 1 in 100 criterion. Roughly doubling these probabilities by applying two-tailed tests might have rendered these six findings non-significant. It is impossible to be sure because the findings at issue were reported as equal to or bettering the conventional '1 in 20 by chance' standard, but not by how much. The lead author (personal communication from Dr Spoth, June 2009) reports that "although there was a large number of null findings, the positive findings well exceeded the number likely to be observed by chance alone, even when two-tailed tests are applied". He also argues (personal communication from Dr Spoth, July 2009) that one-tailed tests were justified by the direction of findings in "all the preceding randomised controlled studies evaluating the interventions in question, including multiple types of outcomes, multiple types of informants, across numerous waves of data within primary outcomes studies, across decades (different time periods). From this evidence-based perspective, the picture on negative outcomes is very clear, certainly with respect to all we have studied or critically reviewed". of whether across the whole samples the interventions retarded the uptake of substance use might have been significant, though still more than would have been expected by chance. On the other hand, though many tests did not show the interventions were superior to education-as-usual, none Personal communication from Dr Spoth, July 2009. indicated that it was inferior; the only significant findings favoured the interventions.
Additionally, to cater for the risk of finding some significant differences purely by chance, studies which report many related outcomes are advised to consider (1 2) adjusting the significance bar Concerns arise partly because the outcomes are in practice and in logic related (for example, getting drunk requires initiation of drinking) and because the programmes concerned present themselves as substance use prevention programmes, so each test for each substance can be seen as a partial test of this global hypothesis. For a counterargument from the authors, see explanatory note above attached to the text "justified partly". Additionally the lead author (personal communication from Dr Spoth, July 2009) argues that such an adjustment "is appropriate primarily when there are interrelated outcomes under one hypothesis (multivariate outcome modelling would be a case in point)". upwards. If this had been done, some of the positive findings might well have been rendered non-significant.
Results for the high risk sample were more robust and consistent, 18 of 24 tests were described as significant in the study to at least the conventional '1 in 20 by chance' level using one-tailed tests. Using two-tailed tests 16 remained significant at this level (personal communication from Dr Spoth, July 2009). However, there remains the issue of whether the significance bar should have been raised to cater for multiple tests. but here there were other problems. First, selecting pupils this way robs the analysis of the reassurance offered by randomisation that like is being compared with like. This is inevitable when any universally applied intervention is probed to see if it affects some subsamples more than others; it is not a reason for avoiding such an analysis, but (as the authors acknowledged) demands a cautious interpretation of the results. Further caution is warranted Post-hoc subsample analyses of this kind are best seen as generating hypotheses for testing in a study specially designed for this purpose. The main problems are that they rob the results of the reassurance of the level playing field created by randomising patients to different treatments, they build on what may be chance variation in the effectiveness of the intervention between different subsamples, test effects not derived from the theory of how the intervention is supposed to work, and (there is no implication that this was a problem in this case) can capitalise on the fact that samples can be sub-sampled in any number of ways until one (perhaps purely by chance) results in a significant finding. As a result, "any conclusion of treatment efficacy (or lack thereof ) or safety based solely on exploratory subgroup analyses are unlikely to be accepted" (Lewis J.A. "Statistical principles for clinical trials (ICH E9): an introductory note on an international guideline." Statistics in Medicine: 1999, 18, p. 1903–1904. http://www3.interscience.wiley.com/journal/63000985/abstract?). These risks are eliminated or reduced by specifying the subsamples in advance at the time the trial is designed but often this is not the case (Al-Marzouki S., Roberts I. "Selective reporting in clinical trials: analysis of trial protocols accepted by The Lancet." The Lancet: 2008, 372, 19 July, p. 201). because in this case the decision on how to subdivide the sample was not made in advance, but on the basis of earlier results from the same study. Also, there were substantially and significantly more higher risk pupils in intervention schools than in control schools (22% versus 15%), and pre-intervention experience of each of the three substances included in the index used to segment the samples was more common in both sets of intervention schools than in control schools. Since 'higher risk' use was relatively common in the intervention schools, it might also have been less of a real marker of future more serious forms of substance use, possibly explaining The authors contest this point: " ... our study design entails randomisation of those schools. Thus, the factors related to 'Risks' had been addressed at the stage of the randomisation. Therefore, the level of differences in risk over 36 schools should be a 'random noise'; the way we evaluated the risk level is the correct way to address that noise. Also, the control group students overall showed a lower level of risk relative to the two intervention condition students overall, with a subset within each school demonstrating higher risk." (Personal communication from Dr Spoth, July 2009) why this risk came to fruition less often in intervention schools. Impacts on the forms of drug use of greatest concern emerged solely from this analysis, meaning that the interventions' ability to reduce these cannot be considered to have been demonstrated. Finally, the reported analysis left open the possibility that among the lower risk pupils there were significant findings in the opposite direction, intervention schools doing worse than control schools. However, there were in fact no such findings. Personal communication from Dr Spoth, June 2009.
Though the study did relatively well in maintaining contact with the pupils, the fact that over a quarter were missing at the final follow-up could undermine the findings, especially in respect of rare behaviours measured at the final follow-up point. Analyses which indicated that, on the measured variables, drop-out was similar across all the groups, offer some reassurance but cannot eliminate this risk.
Comment on these background notes Return to main entry