Briefing on Data Access
February 26, 1999
The Proposed OMB Revision and the Federal Agencies
Wendy Baldwin
National Institutes of Health
In the spirit of full-disclosure, I must say that I come from
a field in demography that has grappled with issues of data sharing. I am
a big advocate of data sharing for research purposes within the scientific
community. If one looks at the complexity of the National Institutes of Health
(NIH), one can see fields where data sharing has been very well established,
other fields where it is more nascent, and fields where it is very, very difficult.
That complexity poses problems when one confronts language that does not reflect
the nuances in scientific data and other consequences that might be associated
with data sharing. It is very important then that discussions of these issues
take place. The NIH is endeavoring to define these issues so that the community
can identify solutions that will work and those that will not. This way unintended
consequences can be anticipated.
The law refers to the Freedom of Information Act (FOIA), which
has some considerable strengths, but also some gaps. It is important to look
at just what FOIA will do and what it will not do. First of all, though, it
is important to remember that this application of FOIA is very different from
the present application of FOIA. Currently, when someone makes an FOIA request
of me, he is asking me to provide him data documents that are currently in
my possession. Under the new application, I would have to go get the data
from my grantee and then make it available. It is important not to underestimate
the difference between those two strategies.
The strengths of FOIA do protect confidential information. They
protect private information, identifying information, and proprietary data,
and there are some statutory protections for CRADAs (cooperative research
and development agreements) in particular. On the other hand, concerns have
already been raised about the particular aspects the FOIA exemption would
protect in those cases. When some of the difficult cases are posed to the
FOIA offices, they respond that those questions must be settled in the courts.
Bear in mind that there are some simple applications of FOIA that are very,
very good and must not be underestimated. There are other areas, however,
in which the answers are not clear, and situations arise which are dependent
on a legal process to clarify them.
The first area that the scientific community should weigh into
is what, exactly, is the definition of "data." At the NIH, there is a tremendous
variability in what constitutes scientific data. There are demographic studies,
many of which have already dealt with issues of data sharing. There are x-ray
crystallography studies, which have also dealt with the issues of data sharing.
And there are studies where the underlying data would be the individual investigator's
laboratory notebooks. It is hard to construe how these would be used in a
data-sharing setting. There are also videotapes of family interactions which
provide the underlying data for research questions about family interaction
and child development. The FOIA protections would probably dictate that those
data could not be released because there is no way to protect the privacy
of the subjects.
But, as someone has already pointed out, that is not a determination
for the investigator to make; it is a determination for the scientific community
to make. So, we would require the data to make that determination. When one
looks at protection in that case, it is important to understand how that protection
would be effected. And it would be very, very cumbersome. We must think about
what the nature of data is and what is actually intended.
Certainly, this community is very concerned about when data
would have to be released. Some studies are very straightforward. Do the study,
publish the results, and release the data. Other studies involve a sequential
release of data. Part of the data is released; more data is analyzed; there
is another data release; more data may be collected; more data is released.
Would this require a sequential data release to go with each publication?
Does the publication of any of the data from an underlying data set imply
the release of all of the data? It is important to raise these questions because
different constituencies and different research communities will have different
concerns and needs.
Longitudinal data present very special cases. If data are released
early in a longitudinal study, the possibility is raised that people can violate
the confidentiality of that data set. I think the FOIA protections would tell
us that we can take away identifying information - names, social security
numbers, telephone numbers, ID numbers - but, in fact, in a complex data set,
or in a data set that refers to a very small geographic area, it may not be
possible to truly protect the confidentiality of those individuals. So, I
think that in this case the FOIA takes us part of the way there, but it is
not clear to me that it takes us all of the way there. The research community
must weigh in on how different types of data might be affected by this.
We have already commented on the protection for an individual's
privacy under FOIA, but the FOIA protection for privacy is for individuals,
not for what I will call entities. In a research project that is doing a study
of six clinics, the identity of those clinics is not protected under FOIA.
So, we have to remember that, although we may have a very general view of
privacy protection, FOIA has a very narrow view of it. It is an individual's
privacy. In addition, privacy is only extended to living individuals, which,
in some cases, one might argue is not sufficient.
The problem is that making the determination to eliminate certain
data because they would identify individuals is not something an investigator
would simply do. That is a decision we make; the FOIA office would receive
the full document and the redacted document and then make the determination
as to whether the redaction was appropriate, bearing in mind how FOIA operates
now for records we already hold. We are then obliged to keep those records
for six years. I do not know how we would handle the administrative burden
of doing this if we had any volume of requests. As I am sure most of you are
aware, accessing data without an understanding of the accompanying documentation
is not terribly useful. So, there are many areas that must be clarified, and
one is certainly what the documentation would be.
There has been some discussion about what would happen if data
were totally in the private sector, but the way the legislation is written,
it refers to data that is supported by federal funds, regardless of the level
of federal funding. We, of course, have many studies where the federal funding
is only a portion of the funding. Other funding may come from the private
sector, from pharmaceutical manufacturers, or from foundations. It may come
from a medical research council in another country. It may come from other
entities such as State governments or managed care organizations. Some of
those organizations, in fact, have incredibly rich data that can be very valuable
for the research community, but they are not willing to make it available
to anyone. They are willing to make it available under certain conditions;
terms of reference as to what it will be used for, whether it is used for
a peer-reviewed grant, whether the investigator will make an attempt to identify
individuals, whether it will be used only for research purposes, not marketing
purposes or publicity. FOIA does not allow one to put conditions on what the
use of the data will be. In managed care, for example, or even in State agencies
where the State medical care data would be very valuable, they are very selective
about the circumstances under which they would make data available, and very
wary of a process that would make data available to anyone regardless of his
or her purpose.
The language so far is not clear about how long this access
would remain in effect. Certainly, in areas that have established data sharing
policies, these policies are in effect in perpetuity. Circular A-110 requires
that data be maintained for three years following the termination of the grant.
We must have clarification of the intent before determining how long one could
maintain a reach-through to data.
Now, I must address the issues about the cost of compliance.
When I think about how long the access would be, it becomes very complicated,
because the grantee is the institution, not the individual. If I have a grant,
two years after its completion I presumably still have the right to access
the data, and I must access them if I receive an FOIA request currently. But
if an investigator were no longer at an institution, what would be the obligation
of the grantee institution to maintain the data, maintain the documentation,
and be able to fulfill an agency request? I am not presenting unsolvable problems.
I am presenting issues that the scientific community, the agencies, Congressional
committees and the OMB need to grapple with so that we can understand what
is on the table, what we are agreeing to and what we will now be doing.
Cost of compliance - there is very supportive language in the
bill, in the OMB Circular, and in the Federal Register notice that acknowledges
that this could be a costly process, and that, in fact, if it incurs a cost
to the agency or the grantee, this is chargeable to the requestor. There is
a problem with the current FOIA in that I can charge, for example, if someone
makes a request and it costs me $1,000 for photocopying to fulfill the request,
but I will not receive the money. The money goes to the Treasury. One thousand
dollars worth of photocopying may not be an issue; however, the administrative
burdens of this application of FOIA are qualitatively different than the administrative
burdens of the current FOIA, which are already significant. The question is
not only whether I can charge, but how can I actually implement a billing
strategy so that both the agency bearing a considerable burden can be compensated,
and the grantee bearing a considerable burden can be compensated. Again, is
this insolvable? Probably not, but it must be solved before we have a very
difficult case involving very large expenses and we are unable to compensate
people.
I believe there are appropriate ways to share data. There are
many different fields in which the scientific community has come together
and said, "Yes, it is appropriate to share data. This is how the integrity
of the data is ensured. Here is how the documentation of the data is ensured,
and here is how it is made available." Currently, that exists to some degree
through data archives. These are very valuable. However, archives may place
conditions on data sets they receive which I call "value-added." Consider
data funded by a private entity willing to share its data, to put it in a
public archive, and to make it available for a modest cost. This is very good.
They may also require that the person who accesses it guarantee it will only
be used for research purposes. Currently, I think there is a potential risk
that FOIA does not allow conditions on what that use might be. I would hate
to see anything have a perverse effect on the very valuable data archives
that I think are one of the most constructive ways we have of sharing data.
The OMB notice, the NPRM (notice of proposed rulemaking), focuses
attention on published data and on data used for federal regulations and policies.
I think many in the scientific community feel that this is a constructive
step toward shaping how this might work. But, it is not enough. For example,
I do not know what is covered by the word "published." "Published in a peer-reviewed
journal" I understand. I understand that the data has been vetted. I understand
what rules have been applied before those data can be used for publication.
But "published" by itself I do not understand. Is a PowerPoint demonstration
a publication? Is a poster of very preliminary findings a publication? I do
not know the answer, and I think it needs to be clarified.
Let me give you a quintessential NIH example: Imagine a clinical
trial of two drugs. There is a data safety monitoring board that is going
to review interim data, and they discover that Drug A is so spectacularly
successful that it becomes unethical to continue the trial. The trial is stopped,
and a clinical alert is issued to physicians across the country saying, "If
you have patients who present with these circumstances, Drug A is the drug
of choice." This becomes a standard of care. Is that a publication? The investigators
have not even analyzed their data fully yet. Only the data safety monitoring
board has seen it. The NPRM right now puts out published data as well as use
and policy and regulation, but these are terms that need to be refined, or
there will be an unintended consequence, which must be addressed right now.
Do we mean by "policy" or "regulation" those that are published in the Federal
Register through a normal policy-making procedure? Or do we mean statements
that come out from federal agencies recommending a certain course of action?
I do not know. I would suggest that during this sixty-day period we definitely
need to understand the answers to those questions. I have focused on the scientific
aspects of data sharing, but I suspect that we need to focus on the process
of rulemaking, which is really not an NIH activity. However, since it might
be our data that are implicated in rules, we are focusing on the scientific
aspects of data sharing. These are some of the issues about which we at the
NIH community are most concerned and are hoping to clarify during this comment
period. Thank you.
Q [April Burke, Association of Independent Research Institutes]:
Ms. Casey characterized agencies as having the right to access data. That
is actually not correct. What A-110 says is that an agency can obtain data
for a federal purpose, not for a private purpose
A [Kathy Casey]: I do not think I made that distinction.
I simply said that the federal agencies have the ability to obtain the data.
Q [April Burke]: When you were asked the question earlier
about whether it would be appropriate for private information that was not
funded by the federal government to be subject to FOIA, you seemed to think
that it would be. I think there is a point here being made about public versus
private. Dr. Baldwin, what would have been a federal purpose under A-110 for
an agency to obtain data and not to use the waiver that is currently in A-110
which would be eliminated in the NPRM?
A [Wendy Baldwin]: We do not generally obtain data.
We have that right. I can imagine circumstances under which we would obtain
data for a federal purpose; for example, if we were involved in a fraud investigation.
However, we make a very clear distinction between data that are collected
internally by us, directly, in our intramural program or by contract as an
extension, and those collected from grants. Grants are an assistance mechanism,
and there really is quite a distinction there. While we put expectations out
there for our grantees, we would not reach through to get those data for our
purposes.
A [Kathy Casey]: I did not mean to suggest that it was
somehow currently available for a private purpose. What I did say earlier
about data that might not be federally funded is that if data were privately
funded and used as the basis for a federal rule or policy, then I think our
expectation is that there should be access to that data because they are being
used as the basis for a federal rule. I hope that clarifies my point.
Q [April Burke]: I do not see how the federal government
could reach through that activity and do that.
A [Kathy Casey]: The tangible circumstance that I would
cite is a situation in which there were studies used to underlie particular
EPA (Environmental Protection Agency) regulations. The viewpoint was that
if this data were used as the basis for federal policy, then there should
be public access to that information.
Q [April Burke]: Would you feel, for example, that if
the Federal Reserve Board were going to make a federal policy with respect
to financial issues and they wanted to rely on stock exchange information,
those companies that had information about their ownership or their own financial
data should make it available to the federal government and the public through
FOIA?
A [Kathy Casey]: I do not want to comment on Federal
Reserve policy, necessarily, but if the information is being used for a federal
rule or a federal policy that affects millions of people, there should be
reasonable access to the underlying data that are used to support it.
A [Jean Fruci]: I want to clarify something for a moment.
Let's take the example you have raised in the EPA regulation. The data that
underlie the EPA regulation were a whole series of studies. But what they
relied on for crafting a policy was the peer-reviewed published paper. That
is what they had in hand, and that is what they used as the basis of their
rulemaking. That, in fact, is publicly available because it is published in
a series of journals that anyone can obtain. But EPA did not go through all
of the underlying data that went into producing the peer-reviewed published
paper. They used the summary data that were provided in the peer-reviewed
paper. You raised a second issue that I think is also very important. That
is, if there are a couple of studies that are extremely critical to a rulemaking
or a federal policymaking, does the public have a right to a higher level
of scrutiny over particular studies, something beyond the peer-review process,
to ensure that the data are really high quality? Again, I think that is a
very legitimate point, and we would agree. But we question whether FOIA is
the way that one would go about doing that.
A [Kathy Casey]: I agree with you that in this case we
were trying to get to the underlying data, rather than just the peer-reviewed
study that was provided. And I think the disagreement here is the mechanism
that is being used to provide this information. Again, I would like to make
clear that I think the idea behind getting access to the data is to allow
people to duplicate it, verify it, and validate it.
Q [Bill Gardner, University of Pittsburgh, member of the
National Conference of Lawyers and Scientists, which is a joint AAAS/American
Bar Association committee]: Ms. Casey, how do you feel about the following
danger: Not only might a vested interest take exception to work that someone
is doing and try to harass it, but also there are ideologically motivated
groups that resist work in certain fields of science. What would stop anyone
from doing this? Dr. Baldwin hinted that there are areas of research where
data sharing is abysmally bad. It is not just nascent. When people try to
obtain data in some fields of clinical research, it simply is not being made
available. If FOIA is not the right way to do this, exactly why is FOIA not
the right way to do this? And if it is not FOIA, what is the way to go about
this?
A [Kathy Casey]: You raise some very legitimate concerns,
which should and can still be addressed. Our view was that FOIA was the only
mechanism that we knew of at that time that would necessarily be able to make
data publicly accessible while also having sufficient exceptions to address
the concerns that we had. It was, at least, a very good starting point from
our perspective.
A [Jean Fruci]: You raise some very good points. Having
formerly been a researcher who worked on synthesizing different databases
from different existing studies, I did not encounter many problems with people
sharing their data with me, but I have heard of instances where they do occur.
Again, when one uses something like FOIA, he is taking a one-size-fits-all
approach. I think what may work very, very well for a data sharing mechanism
and maintenance of a database that could be accessible to everyone for physics
might be a terrible approach to use in the medical research community. One
must look at what are the contexts of the types of data that we have available,
and what are some good, cost-effective mechanisms to make the most data possible
available to the widest audience. I do not know that I have for you a simple
answer, or that I am able to say, for example, "Well, it is not FOIA, but
it is this." I do not know that there is one "this," because of the many different
ways in which we study science, and the many different ways in which it would
be most useful to make data available for sharing. Let us look at what is
already out there that works well. Let us look at specific areas where things
are working very badly, as you suggest. Let us get some creative solutions,
because I agree that if we do not make data available - and in some cases
it is definitely needed, whatever goes beyond the peer-reviewed published
paper - for sharing among the community, we do not make progress. It is the
data that we pass on to the next group of scientists that allow us to make
progress in these fields. If we do not do that efficiently and effectively,
we are not going to make progress.
A [Wendy Baldwin]: If I could reinforce that, I think
different scientific fields must evaluate the risks, problems, and issues
specific to their fields. Those different constituencies must come up with
strategies as to what actually makes sense, is workable, and can be built
in from the beginning. There are many, many issues there, and I encourage
the scientific community to step up to that challenge, but it is not simple.