• The context set by the moderator was for the panel to imagine that they had been called on to advise a government official who had the legal responsibility to make a hazard determination for a particular exposure, such as whether a substance should be designated a known carcinogenic hazard, or something lesser, such as a possible hazard, which would then be publicized by the government and possibly used for regulatory purposes. For this project, the panel's review of the "London Principles" was organized around the topic of whether the body of studies forming the materials for the conference indicated a causal relationship between induced abortion and breast cancer.
  • The points summarized below indicate views expressed by one or more of the discussants; they do not necessarily represent consensus views unless so indicated.

A. Consideration of Biological Information

The Panel's discussion was begun by the two non-epidemiologists with reproductive toxicology experience commenting on the biological basis for judging whether there was a causal relationship at work.

First Discussant

  • In the most relevant animal studies, a portion of a group of pregnant rats, of an inbred strain known to be susceptible to 7,12-dimethylbenz(a)anthracene ("DMBA"), a known animal carcinogen, were given hysterectomies while another portion of the group were allowed to go full term and give birth. Subsequently, both portions of the group were administered DMBA. Those that had their pregnancy interrupted responded the same as virgin rats -- i.e., the DMBA produced a higher incidence of breast cancer in both virgins and those whose pregnancies had been interrupted. Those that were

    p. 10

    allowed to go to full term were protected from breast cancer that might have resulted from the DMBA.
  • Studies have also been carried out on human breast development which indicate that pregnancy allows the breast to mature, and therefore the assumption is made that such maturation confers a degree of protection from carcinogens because further cell differentiation, which contributes to carcinogenesis, is limited.
  • The assumption has been made that the animal experiments are relevant to the human situation. There are significant flaws in this reasoning. The first is the assumption that the human breast is similarly susceptible to DMBA. The second is that the DMBA was administered after pregnancy rather than before; and if it had been administered before, the reasoning would seem to lead to the conclusion that the cell differentiation that occurs with full-term pregnancy would enhance the carcinogenic effect of the DMBA. In addition, the experiments were carried out only with nulliparous rats.
  • To sum up, the reviewer thought that the biological information was significantly incomplete, and that it would not strongly support a causal link with induced abortion, although it could provide some support. The animal data did not either strongly support or strongly detract from a hypothesis that induced abortion increases the risk of breast cancer. The toxicologist also thought that it was important that DMBA would be characterized as an initiator, whereas estrogen would be more likely to be characterized as a promoter. (There was a discussion over the accuracy of this terminology, and one of the epidemiologists noted that in their field they were usually pleased just to know whether an agent acts at an early or later stage, as opposed to understanding the mechanism involved.)

Second Discussant

  • The Second Discussant spoke more to the human biological data. Many reviewers of the issue seem to have assumed


    an estrogen-mediated event. In reality, the endocrine biology is far more complex, involving many other hormones, such as progesterone, cortisol, thyroid, and growth hormones. Also, proliferative effects on the breast, including swelling, occur during menstruation. Further complicating the situation, there are psychopharmacologic changes that accompany childbearing, particularly in the hypothalmic-pituitary-adrenal axis, and women modify their behavior in connection with pregnancy -- abstinence from alcohol and tobacco being examples. Still a further consideration is what happens with the immune system during pregnancy. Since the immune system is suppressed during pregnancy, one might suppose that a full-term pregnancy could enhance the opportunity for carcinogenesis. Thus, it seemed naive to regard the biology as supporting a conclusion that induced abortion causes breast cancer based on a focus on estrogen. The changes that occur in early pregnancy, and therefore the effects that could occur from interference with those changes, are not minimal; but pregnancy involves a complex cascade of biological events, and those changes, while significant, are not very different from many other hormonal excursions that occur during other aspects of life.

  • With regard to spontaneous vs. induced abortion, there are some differences in hormone levels in a normal pregnancy vs. one destined to have problems, but they usually do not occur until after seven or eight weeks, whereas most induced abortions occur before that point.
  • It would be difficult to assess whether an abortion during the second trimester would have more impact than one during the first trimester. The biological events that occur as pregnancy progresses involve many factors other than estrogen, such as insulin levels, so that what happens in the second trimester would be different from what happens in the first trimester; but whether it would involve more risk from a biological perspective probably could not be answered.

  • 12

  • Another member of the panel noted that this seemed to mean that it would be very difficult for an epidemiologist to try to figure out what dose-response would mean in terms of this issue.

Panel Discussion

  • Both Reviewers on the issue of biology agreed that there were substantial gaps in the biological understanding, and that the data did not indicate that there were dose-response effects that would be likely as a result of increasing numbers of induced abortions.
  • The reaction of the epidemiologists to the above comments on the relevant biological information was generally that it was useful in the sense that it let them know that there could be biological plausibility, but that the data were equivocal and did not give them good information from which to make decisions on study design, such as how to look at dose-response or timing or which biomarkers were of particular importance. They considered it significant that the biological data did not argue against a possible causal relationship. EMF (electro-magnetic fields) was discussed as an example of an exposure regarding which non-epidemiologists had raised strong arguments against the biological plausibility of a causal relationship.

B. Comments on Individual Assigned Epidemiologic Studies

1. Nishiyama et al. (1982, RR of approximately 2.5)

First Reviewer

  • Using the existing Principles for this exercise was "torture", because, while much of the content of the Principles was good, the subquestions were filled with compound questions that supposedly required a single Yes or No response.

  • 13

  • The study report was fairly crude by modern standards in this country, and he had to respond to many of the subquestions with a simple "do not know".
  • This led the reviewer to comment on the important distinction between study reporting and study quality. This study report did not give sufficient information to answer many questions the reviewer had, but that would not lead him to infer that the study was of poor quality. He would have to contact the investigator(s) and ask about many things. It was noted by the reviewer and others on the panel that some of the most prestigious journals, in which authors/investigators most aim to publish, put strict page limits on articles, so that the description of the study is necessarily truncated. In reviewing a body of literature on an issue, this reviewer would give less weight to a less informative article, not because it was of poorer quality, but because it had a higher degree of uncertainty as published.
  • Subquestion A-1(b) was particularly in need of revision. It represented an outdated conception of subject selection. It does not matter whether exposed and unexposed subjects are "comparable at baseline" if the proper adjustments are made. The appropriate question is whether exposed and unexposed subjects are expected to be at comparable risk at baseline with consideration of possible confounding factors.

Second Reviewer

  • There were clearly a lot of gaps in the study report, such as not making clear how the control group was comprised, but the reviewer would give some weight to it because there was a large sample size and it considered most of the other risk factors.
  • Use of the London Principles showed many of the gaps, but many of the questions in the Principles did not fit well with case-control studies, which this was, and which many of the other studies were.


Panel Discussion

  • All the controls or adjustments in the world will not make things right if the subject selection process is inherently biased.
  • There should be a list of considerations for subject selection that touches on those several areas where big errors are possible. One must also frame the questions in terms of whether there was a problem with a specific aspect of the study, not simply whether something was considered.
  • It was noted that, from experience in medical practice, women who choose to have a child vs. those who choose to have an abortion have discernibly different behavioral/lifestyle patterns (such as drug use), and this is a significant confounding factor.
  • There was discussion about adjustment for confounding. The issue raised was the validity of many studies stating that they had statistically adjusted for confounding and had found that it made little difference in the outcome, and even giving a very precise figure for the amount of adjustment. One panel member commented that the flaw in this is that the reason adjustment for confounding often does not seem to make much difference is because the investigators have not measured well for key confounders. Another commented that often the search for confounding is like "the drunk looking under the lamp post". Another related this to the current topic by noting that we know there are differences between those who get abortions and those who give birth, but we do not have a handle on those differences as possible sources of bias.
  • There was discussion about the possible need to expand on Principle A-5, which addresses the need to adjust for confounding and bias. Several panelists noted that they thought the Principle was heartening because it made a start on an important issue, but that they thought it


    should be expanded, for example to say that the investigators should explain how they adjusted.

  • One panelist expressed consternation over the common practice of simply using a standard computer model, such as Mantel-Haenszel, to adjust, characterizing it as "alchemy" and expressing the opinion that what was needed was more sensitivity analysis, because the standard statistical analyses do not begin to touch on the real problems. Sensitivity analysis is what you do when you do not have the data, the measurements. The current issue is a good example of having many possible confounders and biases for which there is no good information that would permit confident adjustments, and therefore the uncertainties would be large, and the sensitivity analysis would reflect those uncertainties.

2. Howe et al. (1989, overall RR of 1.9)

Both Reviewers

The study had some positive aspects, but some substantial weaknesses: It lacked information on most potential confounding factors, and the method of exposure ascertainment was confusing and possibly badly flawed in using fetal death certificates, which would not give good information on spontaneous abortions. As a result, the reviewers would give the study little weight.

Another possible weakness was that the study did not seem to have uniform criteria for inclusion/exclusion of cases and controls.

Although the Principles were helpful on the better studies, on this one, because the reviewers felt the significant flaws were so obvious, the Principles did not add much. For someone relatively new to the field, the Principles might cause points such as the ones relevant to this study to be brought to attention; but for an experienced epidemiologist, such significant flaws are conspicuous.


Panel Discussion

There were comments that the Principles might be trying to do too much in a condensed form. The Principles themselves seemed fine; but it seemed that many of the subquestions did not fit depending on the type of study (cohort or case-control) or the type of exposure, and perhaps what was needed was different layers of principles -- a general set, and then more targeted sets for different types of studies and types of exposures. For example, with regard to the study being reviewed, the subquestion that seemed most pertinent to the significant issues was A-3(g) (on accuracy of exposure measurement or estimation), but it did not capture the issues very well. There seemed to be too much in the subquestions about whether something or other was reported in the study, when the real question was whether there was likely to be bias, such as false negative or false positive attribution of exposure. Another question should be whether the exposure measurement was reflective of the relevant time period. The bottom-line question should be whether the exposure measurements were accurate and relevant.

3. Adami et al. (1990, overall RR of 0.9)

First Reviewer

That it was a multi-center study was a strength, and it was basically sound methodologically. However, there appeared to be heterogeneity in the case-control sampling and response rates; data collection methods could have influenced results; and there was no stratified analysis by center.

The relative homogeneity of the populations in terms of racial makeup and cultural factors seemed to be a distinct advantage for this study, as opposed to one conducted in the United States, where one would expect large differences among the study populations with


regard to such factors. The Principles did not seem to capture this point well.

The main Principles seemed to be generally sound, but, again, the reviewer had difficulty with the subquestions, in large part because they often contained compound questions, so that it was difficult to give a single Yes or No answer.

For a number of Principles subquestions, such as blinding and quality control, one could not answer the question, but one would assume that that aspect of the study was done right if they knew the study team was reputable. The quality control aspects are important, but no journal is going to publish that much detail. There is a question of what is realistic to expect, particularly if it is a routine matter, as opposed to, say, a formal validation study. One would like to know there is a more detailed write-up somewhere that documents everything, but one cannot expect such detail in the journal.

Second Reviewer

  • Under Principle A-6, the reviewer thought that subquestions (e) (explaining contradictory or implausible results) and (f) (exploring alternative explanations) were largely irrelevant, because an experienced reviewer could do this if the data from the study were reported correctly. It could earn the authors a pat on the back, but really what was important was the particular findings of their study. If they do what is suggested in the Principles, it does not change the quality of their data. However, it could help another reviewer in thinking about a particular aspect of their study and how much weight to give the study.
  • Under the same Principle, subquestions (g) and (h) (interpretation of results and discussing public health implications) should actually be discouraged in the reporting of individual studies, whereas the Principles seemed to be encouraging it. The presence of this type of


    discussion in an individual study would raise questions regarding the motives and biases of the study author(s).

Panel Discussion

  • This seemed to be the only study they had ever seen which did not find a protective effect with pregnancy, which raised concerns. In addition, one could not tell from the article what the latency was. Also, there were concerns about the range of the confidence intervals, and it seemed that the study was not designed to assess the issue under consideration.
  • It would be gratifying to have the opportunity to really refine the Principles through several iterations.

4. La Vecchia et al. (1993, RR of 0.9)

Both Reviewers

  • There were problems with selection of controls because it was a hospital-based study with controls drawn out of other patients in the hospital. The use of cases of all ages without looking at age-specific risks probably also biased results.
  • Additionally, the study, like the previous one, was not really designed to address the issue under consideration.
  • In general, the study was not very useful, and the London Principles were simply in line with how an experienced epidemiologist would analyze it.

Panel Discussion

  • There is some schizophrenia inherent in the Principles at present. On the one hand, they appear to be informative to non-epidemiologists;


    but it would be very dangerous to give non-epidemiologists the idea that they could apply the Principles and make accurate appraisals. On the other hand, the Principles as they stand, while they should be applied by expert epidemiologists, seem to largely reflect basic precepts that all expert epidemiologists should be familiar with. If the Principles are to further the state of the art, or bring more consistency to individual epidemiologic studies, evaluation of studies, and evaluation of bodies of studies, they should be refined. It was emphasized that the Principles should not be employed by non-epidemiologists in an official capacity. This would be like trying to have epidemiologists evaluate toxicology studies. Assuming the Principles are to be used by epidemiologists to advise non-epidemiologists, the aim should be to make evaluations more systematic and consistent; but the Principles should be advisory, not dogmatic.

  • Doubt was expressed about the statement in Principle A-6 that a study should be of sufficiently high quality to be publishable in a peer-reviewed journal. Publication is not much of a quality control these days; and much that is not published may be of higher quality than that which is published.
  • Epidemiologists like to publish their best work in journals that will give them sufficient space to provide the details of their analysis, which are usually the more specialized journals rather than the ones intended for a wider audience.
  • The Panel seemed to agree that what was really needed was three separate sets of principles or questions: One consisting of principles or questions for epidemiologists to use in evaluating individual studies; the second for epidemiolgists evaluating a body of studies; and a third for lead risk assessors and risk managers to employ in consultation with the epidemiolgists and other risk assessors.


5. Laing et al. (1993, overall RR of 3.1)

First Reviewer

  • The use of histologically confirmed cases and large study size were strengths.
  • But there were what could be called fatal flaws that made it of minimal or no utility for risk assessment purposes. The controls were selected in a way that put them at the extreme of non-comparability; and the questioning strategy opened up the likelihood of response bias and bias due to confounding from use of oral contraceptives (cases were probably asked about oral contraceptive use, while the controls probably were not)-- in other words, problems with exposure ascertainment.
  • With regard to the London Principles, they added little that was not already apparent to an experienced epidemiologist on first reading.

Second Reviewer

  • Agreed with the First Reviewer's points.

Panel Discussion

  • The London Principles do not allow for gradations of quality; everything is just yes, no, or don't know/not applicable. There should be a way to allow for at least a scale of three of four points on a particular aspect of study quality, such as control of confounding, exposure measurement, and suitability of the control group. This is not the same as giving an overall quality score for a study; it would allow for integration across diverse dimensions. This would be for the purpose of assisting the epidemiologist in evaluating the individual studies, and then he/she would make a judgment as to utility for risk assessment based on the issue-by-issue evaluation of the study.


6. Daling et al. (1994, overall RR of 1.36)

First Reviewer

  • In general, this appeared to be a careful study; however, there were some minor weaknesses or uncertainties.
  • It was not clear from the study report how soon after diagnosis the cases were interviewed. It appeared that there might have been a substantial time lapse, so that they would have had more time to speculate on the cause of their disease, which could have introduced more potential for recall bias.
  • It would have been useful to have information on risk factors for non-respondents vs. respondents.

Second Reviewer

  • The efforts to obtain self-reported data on induced abortion seemed to be as good as it could get -- the investigators were up against the limits of what could be done.
  • It was a good example of the state-of-the art in case-control studies; but nevertheless the mundane, familiar, annoying potential sources of error remained. Non-response is not likely to be random, and the reasons may well not be the same for cases as for controls. But there does not seem to be a way to fix that. There is also still the potential for lack of truthfulness or inaccurate recall in response to questions about having had an induced abortion, and that also appears unresolvable.

Panel Discussion

  • There could be an issue of statistical bias in this study of a type that is not generally recognized: Adjustment for multiple confounding


    factors in a small subgroup can cause the risk ratio to inflate. This artifact needs to be better understood by investigators.

7. Lipworth et al. (1995, overall RR of 1.5)

Both Reviewers

  • It was difficult to tell how much weight to give this study. One reviewer would give it a fair amount of weight; the other would be faced with a lot of uncertainty. The analysis in the study report seemed to be fairly good; but it was difficult to tell how the controls were selected, and it appeared that the method of selection was peculiar. The controls did not seem to be well-related to the population from which the cases came. This goes to the suitability of controls and the possibility of selection bias. And yet, the way the study was presented, and the fact that the usual risk factors seemed concordant between the cases and controls, gave one the feeling that perhaps the study was better than it was entitled to be regarded. Control selection is probably the most important aspect of a case-control study; and if it is flawed, the whole study will be flawed.
  • In addition, the study was not designed to examine the issue under consideration.
  • The Principles tended to confirm the problem with controls that was noted at the outset. If one had to genuinely confront and respond to the Principles, it could force one to be more rational about the control selection issue and give the study less weight than one might otherwise be inclined to give it after forming a subjective impression from a simple reading.

Panel Discussion

  • On the matter of control selection, this again is an area where the Principles need improvement. The way it is expressed is outmoded and flawed. The issue is not whether the cases and controls are


    comparable; it is whether they were selected from the same population as the cases in some identifiable fashion, or by random method.

8. Daling et al (1996, RR in range of 1.2 to 2.0)

Both Reviewers

  • A strength was the use of multiple centers to increase sample size and the heterogeneity of the population.
  • The main problem with the study was that the control sampling method resulted in an unusual number of poor matches -- differences in matching cases and controls across the different geographic sites -- but the investigators recognized this issue and appeared to have made a valid statistical adjustment. Control selection was by random digit dialing, and it appeared that making the calls cross-country could have been a problem.
  • The study was well-conducted and would be entitled to equal weight with other well-conducted retrospective studies.

Panel Discussion

  • It was meritorious that the problem with controls was noted in the article.
  • But it is hard to correct for a control selection problem and be assured of having eliminated bias. It is much better to do it right in the first place.
  • This is a good example of a study that could get a reduced quality score because it identified a problem, whereas another report might actually be of lower quality but perceived as of higher quality because the investigators did not flag a problem and, using the Principles, one would mark down "Not Known".

  • 24

  • One advantage in using the Principles is that the "Not Known"s could lead one to go back and check with the investigator(s), which is something that should be encouraged, particularly when one is doing a meta-analysis.

9. Lindefors-Harris et al. (1991, response bias study comparing two studies with range of overall RR's of 1.1 to 2.0)

First Reviewer

  • The reviewer had recently done a study on fat intake and breast cancer in which the controls apparently did not under-report fat intake.
  • The subject study, which attempted to look for response bias, appeared to be fairly solid because it compared responses with abortion registry information.
  • However, the investigators calculated a relatively crude response rate, and they did not attempt to consider potential confounders.

Second Reviewer

  • A possible weakness was treating false positives similarly to false negatives, when the psychological factors appear to be dissimilar -- in other words, it would be odd for someone to report that they had an induced abortion if they had not.

Panel Discussion

  • Much of the response difference found in the study came from cases that reported an abortion that did not show up in the registry. There


    was also a comment that it had been heard that the reporting to this registry was poor.

  • Age is an important factor in recall, of course. Many older women (post-menopausal) cannot remember that they had a spontaneous abortion that is recorded in their medical records. In the subject study, women were being asked to recall quite a long time ago and remember fairly accurately when an induced abortion occurred.
  • In the case of teratogenesis studies, the discussants thought that the literature showed that both cases and controls did a poor job of recalling, and it was mostly non-differential, although the cases did slightly less under-reporting. There was little in the way of false positives. There is both random bias and differential bias involved; but the bias seems to be towards the null.
  • But the subject studies introduced unusual factors such as religion and social desirability, and those factors seem to depend on the cultural milieu, so that there is a question over the generalizability of the results from one culture or region to another.
  • Perhaps the only way to handle the issue of potential response/recall bias as a general proposition is to do sensitivity analysis.

10. Rookus et al. (1996 study in which authors concluded that regional reporting bias could account for an RR of 1.9)

First Reviewer

  • The study was not entitled to much confidence. The power of the study was very low; the response rate was low; and the comparison between reporting of induced abortions and reporting of duration of oral contraceptive use (rather than any use of contraceptives) was dubious, because women seem to have a difficult time recalling the latter accurately. If they were going to mis-report on oral


    contraceptive use, one would expect that they would deny any use, rather than distort the duration of use.

Second Reviewer

  • Agreed with the above observations regarding weaknesses, but thought the concept of the study -- comparing responses from populations with different religions or cultures -- was interesting and promising.

11. Melbye et al. (1997 registry study with overall RR of 1.0, and RR of 1.39 at >12 wks.)

First Reviewer

  • There was a great potential for exposure misclassification that would tend to bias towards the null. The study used subjects of all ages, and the incidence of breast cancer increases with age, and many of the older women would have had induced abortions 25 or 30 years before, but the data only went back to 1974, and they assumed that there were no induced abortions for subjects before then. So there was bias in the data, even though it purported to be unbiased because it was registry data.
  • It was not clear how the crude risk of 1.4 had been adjusted down to 1.0.
  • The study included nulliparous women without analyzing them separately, and there is a lot of evidence that such women are at higher risk. It also included nulligravid women, who would be likely to lack comparability to gravid women for a number of reasons. It would have been useful to see more subgroup analysis.


Second Reviewer

  • Agreed with the main points made by the First Reviewer, and noted that this study appeared to be like a mirror image of the other studies they had reviewed, in that it had none of the weaknesses in common and none of the strengths in common.

Panel Discussion

  • The main issue here was a truncated exposure history. The London Principles contain a lot of questions about exposure under Principle A-3, but none appear to address the issue of exposure data that are incomplete in a temporal sense. There should be a question about whether there were substantial missing data, and, if so, how that was handled. This is often a critical issue. There are textbooks on how to handle missing data; but some investigators do bizarre things when faced with missing data.
  • Some interested parties (outside the panel) had apparently focused on this study as settling the response bias issue; but at the same time it created other problems. So in epidemiology things are often not as simple as they seem.

12. Brind et al. (1996 meta-analysis with synthetic RR of 1.3)

First Reviewer

  • Statistical principles underlying a sound meta-analysis are no different from those for an individual study. Many reviewers conducting meta-analyses have ignored standard statistical principles and invented, or utilized, statistical approaches which would not withstand criticism if used in an individual study.

  • 28

  • Meta-analysis is not about combining evidence to arrive at a synthesis -- i.e., a single highly precise number. One should expect a good meta-analysis to examine for disparities, contradictions, etc.
  • Selecting for a meta-analysis the "best" studies based on apparent quality is a distortion -- it is like deciding to select study subjects based on whether you like their appearance. The reviewer presented a quotation from Kelsey et al.: "If the data show a marked heterogeneity, then attempts to summarize on a single rate ratio can obscure important features of the underlying rates and should be avoided."
  • Heterogeneity means differences in study results beyond what would be expected by chance, whether they are case-control or cohort. When you see heterogeneity in the results, you have to go back and look for heterogeneity in the study design and methodology and attempt to discover why it is present.
  • The subject meta-analysis did not attempt to look for heterogeneity in the results in a meaningful way. Meta-analysis must be an analysis. There was almost no statistical analysis in this paper, and the reviewer would rate its quality as very poor.
  • An initial serious flaw was that the synthetic analysis, when the data were re-analyzed and replicated, showed that the authors utilized a fixed effects summary rather than a random effects summary. The study concluded that the included studies showed consistent significant positive associations, but it did not examine the statistical agreement. For example, it did not analyze for differences in confounding.
  • The results of the meta-analysis look precise, but if just one subsequent study is added -- the Melbye et al. study (see above) -- the synthetic number drops from 1.3 to 1.1. In other words, the results, even as they are, are extremely fragile.

  • 29

  • If an analysis for global heterogeneity is done (i.e., assuming the studies are all perfect and just using their results), the range of heterogeneity is extreme -- that is, the analysis indicates that the results are very likely to be due to chance. Although the mean is about the same as the study's synthetic number, the distribution from the random effects analysis is extremely broad -- from 0.67 to 2.9. The 95 percent confidence interval in the random effects distribution is usually misinterpreted. What it means is that it is only a confidence interval for the mean of the random effects distribution, and that is of virtually no scientific interest.
  • The bottom line was that there was so much heterogeneity apparent from just going no further than taking individual study results at face value that a summary number was not justified. The meta-analysis was fatally flawed.
  • A summary chart showing the confidence intervals of the various studies is a necessary starting point, but it is only a starting point.
  • If one were to examine the various studies for heterogeneity just by geographical location, the amount of heterogeneity is extremely large and unaccounted for and cannot be due to chance.
  • The authors did some sub-analysis and stratification, but they completely misinterpreted it. They found that the results changed little in the sub-analyses, and from that they concluded that the sub-analyses supported a finding of a real effect. Just the opposite should be the case. If there is great heterogeneity in results to start off with, the fact that the sub-analyses show little variation means that they do not account for the great initial heterogeneity. The homogeneity in the sub-analyses implies that there is tremendous, significant, unexplained heterogeneity. Additionally, if the biological hypothesis is correct, one would expect significant heterogeneity in the sub-analyses, but it is not present.

  • 30

  • It is probably less deleterious simply to do a narrative analysis rather than a synthetic meta-analysis when there is substantial initial heterogeneity in study results, unless one is prepared to do a great deal of hard work in analyzing the data to see if they can explain the heterogeneity. Using a random effects model to summarize study results is not as misleading as using a fixed effects model, but it still sidesteps the main issues. One of the problems with using a random effects model is that it downweights the larger studies and is more sensitive to the smaller ones.
  • The problem with unpublished studies (sometimes referred to as the file drawer problem) is very real. Usually, if you are doing a meta-analysis and you beat the bushes you will come up with unpublished work. It usually tends to be null, but it could go either way.

Second Reviewer

  • Had thought the analysis was objective and unbiased and looked reasonable, particularly with regard to inclusion of all the relevant studies. Also had been under the impression that a random effects analysis had been used.
  • Although the summary statistical synthesis was followed by a qualitative review, the qualitative review was really only helpful to the extent it gave clues as to what should be looked at in a statistical analysis of heterogeneity. The qualitative analysis was not systematic because it did not look at the same factors across all the studies.
  • Study attributes should be organized systematically and then compared/contrasted and analyzed. It should be done like a single study, in which the individual studies are like individual subjects in the single study. You analyze for what are the important likely confounders or effect modifiers, possible biases, and you do subgroup analysis and stratification.


Panel Discussion

  • It should be taught that meta-analysis is a comparison, not a combination, of results of different studies -- that is, detection and description of systematic variation among study results.
  • If the studies are all fairly homogenous with tight confidence intervals, then you would not even bother with meta-analysis; on the other hand, it should be kept in mind that the consistency could be due to a consistent bias.
  • One panel member brought a copy of a 1997 article by Wingo et al. on abortion and breast cancer that was not originally included in the workshop materials. It was characterized as a systematic narrative review. It considered the body of literature to be too inconsistent to draw conclusions from, and it suggested the future studies focus on certain identified issues. It was commented that this identification of gaps that should be addressed in future studies was particularly useful and should be a function of review articles.
  • It was noted that it is surprising on a controversial issue to see reviewers come to such firm conclusions based on such a weak (synthesized) relative risk, in this case 1.29, when some reviewers will take a weak relative risk as a reassurance that there is no risk.

C. Discussion of Overall Views and Desired Work Product

  • It is important that the majority of the studies were not designed to address the issue.
  • Some of the epidemiologists commented that some aspects of Hill's factors are considered outdated or faulty, such as endpoint specificity and dose-response. The toxicologist on the panel disagreed,


    contending that dose-response and specificity should be considered very important, if not essential, factors in considering causality.10

  • Done well and adhered to, the Principles could make IARC-type hazard determinations more transparent than simply the result of a majority show of hands by a group of experts.
  • There is a large body of often-overlooked literature, which has been successfully utilized, for rationally integrating expert views, such as those arising from a meta-analysis. Reference was made (non-specific) to a classic example of an issue over whether a drug was causing an outbreak of disease, and various experts expressed their views, and when those views were rationally analyzed using such


    decision principles, they found there was much more uncertainty than they had previously realized.

  • As epidemiologists, they would be comfortable in converting the existing hazard Principles into two separate sets: One for evaluation of individual studies; and one for evaluation of a body of studies. Then there could be a third set of principles or questions designed to help integrate the first two sets of information for purposes of making a policy decision.
  • Evaluating a body of studies goes beyond the formal training of many epidemiologists. Their training focuses mainly on conducting and evaluating individual studies.
  • Much of what has been written on establishing a causal relationship or statistically analyzing a body of studies is flawed. Some of the people expounding on this subject are not trained in the appropriate body of literature and methods. Most epidemiologists should undergo further training on how to conduct a proper meta-analysis for the purpose of examining whether there is a causal relationship.
  • There is a body of decision analysis literature that addresses how one maintains objectivity and quantifies uncertainty. There are experts in this type of analysis who are neither epidemiologists nor statisticians. The job of such decision analysts is to draw out the specialists, the epidemiologists in this case, without the specialists necessarily knowing what they are up to, and what principles or methods they are utilizing. To draw up a set of principles for utilizing such a process would probably not be an "off the shelf" effort, but would require convening a mixed group of epidemiologists and other experts who had this particular type of integrative decision logic expertise.
  • There was some disagreement over whether epidemiologic expertise should be regarded as ending with the analysis of individual studies.

  • 34

  • The broad goal of the Principles should be to assist in avoiding significant mistakes, not dictating exactly how to do analysis.
  • The London meeting and Principles and report were a good first step; this panel was now willing to try to refine that work; and then a larger group should conduct a further review.
  • The panel (the four epidemiologists remaining at the very end of the meeting) agreed to allocate assignments so that two would divide up work on draft revisions of the six principles and subquestions for evaluating individual studies, and the other two would initially work on the meta-analysis principle (B-6). The non-epidemiologists and Federal Focus staff would draft a third set of principles or questions aimed at decision integration for hazard identification and risk assessment. The drafts would then be circulated for review by the full panel.


10There were further comments exchanged on this subject following the Denver meeting; however, it is not possible to say whether the issues were resolved completely. Some of the epidemiologists clearly attach less significance to some of Hill's factors, such as dose response and specificity, than this toxicologist. Several comments by the moderator seem in order here: First, some of the epidemiologists seemed to view "Hill's factors" (they dislike the term "criteria" as connoting hard and fast rules) as referring to dose-response as a simple monotonic upward gradient; whereas the toxicologist, and probably most or all toxicologists, are of the view that this factor pertains to some recognizable pattern of dose-response, which might in different cases be, as examples, a "J", "hockey-stick", threshold, or saturation pattern -- but in any event not a random or zig-zag response. Second, some of the epidemiologists seemed to view Hill's factor of specificity as pertaining to a single organ or disease endpoint; whereas the toxicologist is again looking for a recognizable and plausible consistency, which might involve multiple sites with one or more sites predominating -- but again with a pattern rather than a random response. (It seemed to be agreed that exposures to heterogeneous mixtures are a different matter.) There appears to be considerable overlap among "Hill's factors" of dose-response, specificity, consistency, coherence, and biological plausibility, which can lead to difficulties in discussing individual factors. It also appears that there is a need for epidemiologists and toxicologists/pharmacologists to communicate better on these issues, and that the London panel's recommendations to convene multi-disciplinary panels to evaluate the evidence was well-advised. A somewhat detailed discussion of Hill's factors appears in Appendix B of the London report; and there is less specific reference to some of those factors in Appendices D, E, G, and H.