Office of Shared Solutions and Performance Improvement (OSSPI); Chief Data Officers Council (CDO); Request for Information-Synthetic Data Generation, 783-786 [2024-00036]
Download as PDF
lotter on DSK11XQN23PROD with NOTICES1
Federal Register / Vol. 89, No. 4 / Friday, January 5, 2024 / Notices
Æ Proposed actions that can be
adopted to reduce the burden and cost
of FedRAMP authorizations for
agencies.
• Collect information and feedback
on agency compliance with, and
implementation of, FedRAMP
requirements.
• Serve as a forum that facilitates
communication and collaboration
among the FedRAMP stakeholder
community.
The FSCAC will meet no fewer than
three (3) times a calendar year. Meetings
shall occur as frequently as needed,
called, and approved by the DFO.
Meetings may be held virtually or in
person. Members will serve without
compensation and may be allowed
travel expenses, including per diem, in
accordance with 5 U.S.C. 5703.
The Committee shall be comprised of
not more than 15 members who are
qualified representatives from the
public and private sectors, appointed by
the Administrator, in consultation with
the Director of OMB, as follows:
i. The GSA Administrator or the GSA
Administrator’s designee, who shall be
the Chair of the Committee.
ii. At least one representative each
from the Cybersecurity and
Infrastructure Security Agency and the
National Institute of Standards and
Technology.
iii. At least two officials who serve as
the Chief Information Security Officer
within an agency, who shall be required
to maintain such a position throughout
the duration of their service on the
Committee.
iv. At least one official serving as
Chief Procurement Officer (or
equivalent) in an agency, who shall be
required to maintain such a position
throughout the duration of their service
on the Committee.
v. At least one individual representing
an independent assessment organization
vi. At least five representatives from
unique businesses that primarily
provide cloud computing services or
products, including at least two
representatives from a small business
(as defined by section 3(a) of the Small
Business Act (15 U.S.C. 632(a))).
vii. At least two other representatives
from the Federal Government as the
Administrator determines to be
necessary to provide sufficient balance,
insights, or expertise to the Committee.
Each member shall be appointed for a
term of three (3) years, except the initial
terms, which were staggered into one
(1), two (2) or three (3) year terms to
establish a rotation in which one third
of the members are selected. No member
shall be appointed for more than two (2)
consecutive terms nor shall any member
VerDate Sep<11>2014
17:32 Jan 04, 2024
Jkt 262001
serve for more than six (6) consecutive
years. GSA values opportunities to
increase diversity, equity, inclusion and
accessibility on its federal advisory
committees.
Members will be designated as
Regular Government Employees (RGEs)
or Representative members as
appropriate and consistent with Section
3616(d) of the FedRAMP Authorization
Act of 2022. GSA’s Office of General
Counsel will assist the Designated
Federal Officer (DFO) to determine the
advisory committee member
designations. Representatives are
members selected to represent a specific
point of view held by a particular group,
organization, or association. Members
who are full time or permanent parttime Federal civilian officers or
employees shall be appointed to serve
as Regular Government Employee (RGE)
members. In accordance with OMB
Final Guidance published in the
Federal Register on October 5, 2011 and
revised on August 13, 2014, federally
registered lobbyists may not serve on
the Committee in an individual capacity
to provide their own individual best
judgment and expertise, such as RGEs
members. This ban does not apply to
lobbyists appointed to provide the
Committee with the views of a
particular group, organization, or
association, such as Representative
members.
Applications
Applications are being accepted to fill
the remaining terms of two vacant seats
and to fill two seats with upcoming
expiring terms. These four seats will be
designated as Representative members:
Two (2) seats for representatives of a
unique business that primarily provides
cloud computing products or services.
One seat will be appointed to serve for
the remainder of the vacant term,
scheduled to end in May 2025, and the
other will be appointed for a three year
term.
Two (2) seats for representatives of a
unique business that primarily provides
cloud computing products or services
from a small business (as defined by
section 3(a) of the Small Business Act
(15 U.S.C. 632(a))). One seat will be
appointed to serve for the remainder of
the vacant term, scheduled to end in
July 2026, and the other will be
appointed for a three year term.
Applications for membership on the
Committee will be accepted until 5:00
p.m. Eastern Standard Time on Monday,
January 22, 2024.
There are two parts to submitting an
application. First, complete the
information requested via this electronic
form https://forms.gle/
PO 00000
Frm 00036
Fmt 4703
Sfmt 4703
783
kxscdjX6P7oB9vua7. Next, email your
CV or resume and a letter of
endorsement from your organization or
organization’s leadership, endorsing you
to represent your company, to fscac@
gsa.gov with the subject line: FSCAC
APPLICATION—[Applicant Name]. The
letter of endorsement must come from
your organization or organization’s
leadership. If you are the CEO, then it
must come from another member of the
executive team of your organization, as
you cannot endorse yourself. The letter
must be signed and specifically state
that you are authorized to apply to
FSCAC as a representative of your
organization.
Please note: Letters of
‘‘recommendation’’ or other unsolicited
deliverables will neither be accepted
nor acknowledged. Do not include
them.
Applications that do not include the
completion of the above instructions
will not be considered.
Elizabeth Blake,
Senior Advisor, Federal Acquisition Service,
General Services Administration.
[FR Doc. 2023–28602 Filed 1–4–24; 8:45 am]
BILLING CODE 6820–34–P
GENERAL SERVICES
ADMINISTRATION
[Notice–MY–2023–03; Docket No. 2023–
0002; Sequence No. 37]
Office of Shared Solutions and
Performance Improvement (OSSPI);
Chief Data Officers Council (CDO);
Request for Information—Synthetic
Data Generation
Federal Chief Data Officers
(CDO) Council; General Services
Administration, (GSA).
ACTION: Notice.
AGENCY:
The Federal CDO Council was
established by the Foundations for
Evidence-Based Policymaking Act. The
Council’s vision is to improve
government mission achievement and
increase benefits to the nation through
improving the management, use,
protection, dissemination, and
generation of data in government
decision-making and operations. The
CDO Council is publishing this Request
for Information (RFI) for the public to
provide input on key questions
concerning synthetic data generation.
Responses to this RFI will inform the
CDO Council’s work to establish best
practices for synthetic data generation.
DATES: We will consider comments
received by February 5, 2024.
SUMMARY:
E:\FR\FM\05JAN1.SGM
05JAN1
784
Federal Register / Vol. 89, No. 4 / Friday, January 5, 2024 / Notices
lotter on DSK11XQN23PROD with NOTICES1
Targeted Audience
This RFI is intended for Chief Data
Officers, data scientists, technologists,
data stewards and data- and evidencebuilding related subject matter experts
from the public, private, and academic
sectors.
ADDRESSES: Respondents should submit
comments identified by Notice–MY–
2023–03 via the Federal eRulemaking
Portal at https://www.regulations.gov
and follow the instructions for
submitting comments. All public
comments received are subject to the
Freedom of Information Act and will be
posted in their entirety at
regulations.gov, including any personal
and/or business confidential
information provided. Do not include
any information you would not like to
be made publicly available.
Written responses should not exceed
six pages, inclusive of a one-page cover
page as described below. Please respond
concisely, in plain language, and specify
which question(s) you are responding
to. You may also include links to online
materials or interactive presentations,
but please ensure all links are publicly
available. Each response should
include:
• The name of the individual(s) and/
or organization responding.
• A brief description of the
responding individual(s) or
organization’s mission and/or areas of
expertise.
• The section(s) that your submission
and materials are related to.
• A contact for questions or other
follow-up on your response.
By responding to the RFI, each
participant (individual, team, or legal
entity) warrants that they are the sole
author or owner of, or has the right to
use, any copyrightable works that the
submission comprises, that the works
are wholly original (or is an improved
version of an existing work that the
participant has sufficient rights to use
and improve), and that the submission
does not infringe any copyright or any
other rights of any third party of which
participant is aware.
By responding to the RFI, each
participant (individual, team, or legal
entity) consents to the contents of their
submission being made available to all
Federal agencies and their employees on
an internal-to-government website
accessible only to agency staff persons.
Participants will not be required to
transfer their intellectual property rights
to the CDO Council, but participants
must grant to the Federal Government a
nonexclusive license to apply, share,
and use the materials that are included
in the submission. To participate in the
VerDate Sep<11>2014
17:32 Jan 04, 2024
Jkt 262001
RFI, each participant must warrant that
there are no legal obstacles to providing
the above-referenced nonexclusive
licenses of participant rights to the
Federal Government. Interested parties
who respond to this RFI may be
contacted for follow-on questions or
discussion.
FOR FURTHER INFORMATION CONTACT:
Issues regarding submission or
questions can be sent to Ken Ambrose
and Ashley Jackson, Senior Advisors,
Office of Shared Solutions and
Performance Improvement, General
Services Administration, at 202–215–
7330 (Kenneth Ambrose) and 202–538–
2897 (Ashley Jackson), or cdocstaff@
gsa.gov.
SUPPLEMENTARY INFORMATION:
Background
Pursuant to the Foundations for
Evidence-Based Policy Making Act of
2018,1 the CDO Council is charged with
establishing best practices for the use,
protection, dissemination, and
generation of data in the Federal
Government. In reviewing existing
activities and literature from across the
Federal Government, the CDO Council
has determined that:
• the Federal Government would
benefit from developing consensus of a
more formalized definition for synthetic
data generation,
• synthetic data generation has wideranging applications, and
• there are challenges and limitations
with synthetic data generation.
The CDO council is interested in
consolidating feedback and inputs from
qualified experts to gain additional
insight and assist with establishing a
best practice guide around synthetic
data generation. The CDO Council has
preliminarily drafted a working
definition of synthetic data generation
and several key questions to better
inform its work.
Information and Key Questions
Section 1: Defining Synthetic Data
Generation
Synthetic data generation is an
important part of modern data science
work. In the broadest sense, synthetic
data generation involves the creation of
a new synthetic or artificial dataset
using computational methods. Synthetic
data generation can be contrasted with
real-world data collection. Real-world
data collection involves gathering data
1 H.R. 4174—115th Congress (2017–2018):
Foundations for Evidence-Based Policymaking Act
of 2018 | Congress.gov | Library of Congress https://
www.congress.gov/bill/115th-congress/house-bill/
4174/text.
PO 00000
Frm 00037
Fmt 4703
Sfmt 4703
from a first-hand source, such as
through surveys, observations,
interviews, forms, and other methods.
Synthetic data generation is a broad
field that employs varied techniques
and can be applied to many different
kinds of problems. Data may be fully or
partially synthetic. A fully synthetic
dataset wholly consists of points created
using computational methods, whereas
a partially synthetic dataset may involve
a mix of first-hand and computationally
generated synthetic data.
Throughout this RFI, we use the
following definitions:
• data—recorded information,
regardless of form or the media on
which the data is recorded; 2
• data asset—a collection of data
elements or data sets that may be
grouped together; 3
• open government data asset—a
public data asset that is (A) machinereadable; (B) available (or could be
made available) in an open format; (C)
not encumbered by restrictions, other
than intellectual property rights,
including under titles 17 and 35, that
would impede the use or reuse of such
asset; and (D) based on an underlying
open standard that is maintained by a
standards organization.4
The National Institute of Standards
and Technology (NIST) defines
synthetic data generation as ‘‘a process
in which seed data is used to create
artificial data that has some of the
statistical characteristics as the seed
data’’.5
The CDO Council believes that this
definition of synthetic data generation
includes techniques such as using
statistics to create data from a known
distribution, generative adversarial
networks (GANs),6 variational
autoencoding (VAE),7 building test data
for use in software development,8
privacy-preserving synthetic data
generation 9 and others.
The CDO Council also believes that it
is important to draw contrasts between
synthetic data generation and other
activities. For example, synthetic data
generation does not include collection
2 44
U.S.C. 3502(16).
U.S.C. 3502(17).
4 44 U.S.C. 3502(20).
5 https://csrc.nist.gov/glossary/term/synthetic_
data_generation.
6 15 U.S.C. 9204.
7 A useful definition of this technique is available
in the abstract of this paper: https://www.ncbi.
nlm.nih.gov/pmc/articles/PMC8774760/.
8 This technique is described in the Department
of Defense DevSecOps Fundamentals Guidebook
https://dodcio.defense.gov/Portals/0/Documents/
Library/DevSecOpsTools-ActivitiesGuidebook.pdf,
page 23.
9 NIST Special Publication 800–188, Section 4.4
https://doi.org/10.6028/NIST.SP.800-188.
3 44
E:\FR\FM\05JAN1.SGM
05JAN1
Federal Register / Vol. 89, No. 4 / Friday, January 5, 2024 / Notices
of data without any inference. Synthetic
data generation does not include signal
processing, such as automated
differential translations of global
positioning satellite data. Synthetic data
generation also does not include
enriching data during data analysis—
such intermediate steps that involve
augmenting or enhancing existing data
but do not involve the creation of
artificial data.
Other analysis techniques, such as
distribution fitting and parametric
modeling, are closely related to
synthetic data generation. The CDO
Council believes the key difference;
however, is the purpose of the
computational methods. Synthetic data
generation seeks to create wholly new
data points based on the statistical
properties of a dataset, whereas
distribution fitting seeks to ‘fill in’ a
dataset based on a known distribution.
Notably, the fitted distribution can be
used to generate points that are not part
of the original dataset—which is an
application of synthetic data generation.
Questions
• Are there any limitations to relying
on the NIST definition to describe the
field of synthetic data generation? How
should it be improved?
• How well does the CDO Council’s
list of examples and contrasts improve
understanding? How should these be
improved?
lotter on DSK11XQN23PROD with NOTICES1
Section 2: Applying Synthetic Data
Generation
Questions
Synthetic data generation can enable
the creation of larger and more diverse
datasets, enhance model performance,
and protect individual privacy. The
CDO Council’s review of potential
applications of synthetic data generation
found examples in:
• Data augmentation.10 This
application involves creating new data
points or datasets from existing data.
This application can be particularly
useful in developing training datasets
for machine learning and advanced
analytics.
• Data synthesis.11 This application
involves using an existing dataset to
create a new dataset, sharing similar
statistical properties with the original
dataset, to protect individual privacy.
Generating such datasets has wideranging applications including, but not
limited to, facilitating reproducible
10 This application is briefly described at https://
frederick.cancer.gov/initiatives/scientific-standardshub/ai-and-data-science, Section 4.
11 A definition of this technique is available in the
abstract of this paper https://par.nsf.gov/servlets/
purl/10187206.
VerDate Sep<11>2014
17:32 Jan 04, 2024
Jkt 262001
investigation of clinical data while
preserving individual privacy.
• Modeling and simulation.12 This
application involves setting
assumptions, parameters and rules to
develop data for further analysis. The
synthetic dataset can be used for
developing insights, testing hypotheses,
and/or understanding a model’s
behavior. This application supports the
conduct of controlled experiments,
predicting potential future outcomes
from current conditions, generating
scenarios for rare or extreme events, and
validating or calibrating a model.
• Software development.13 This
application involves using existing
database schemas to simulate real-world
scenarios and ensure that a software
application can handle different types of
data and errors effectively. This
application assists in the creation of
representative data, makes it easier to
generate edge cases, protects individual
privacy, and improves testing efficiency.
Notably, the CDO Council believes
that not all applications of modeling
and simulation would meet the
definition of synthetic data generation.
For example, weather forecasting
applies numerical models and applies a
complex mix of data analysis,
meteorological science, and
computation methods but does not
involve the creation of synthetic or
artificial data points. Instead, the
purpose of these models is to predict
future conditions.
• How are these examples
representative of synthetic data
generation? How should they be
revised?
• What other examples of synthetic
data generation should the CDO Council
know about?
• What are the key advantages for the
use of synthetic data generation?
Section 3: Challenges and Limitations in
Synthetic Data Generation
The CDO Council recognizes that
synthetic data generation can be a
valuable technique. However, it should
be noted that there are some challenges
and limitations with the technique. For
example, there can be challenges
generating data that realistically
simulates the real world and the
diversity of real data. Additionally,
evaluating the quality of a synthetic
dataset may also be extremely
challenging.
12 A definition a computer simulation is proposed
at https://builtin.com/hardware/computersimulation.
13 DoD DevSecOps Fundamentals, ibid.
PO 00000
Frm 00038
Fmt 4703
Sfmt 4703
785
Synthetic data generation is also
subject to challenges commonly facing
any statistical methods, such as
overfitting and imbalances in the source
data. These challenges reduce the utility
of the generated synthetic data because
they may not be properly representative,
including failing to represent rare
classes.
Questions
• What other challenges and
limitations are important to consider in
synthetic data generation?
• What tools or techniques are
available for effectively communicating
the limitations of generated synthetic
data?
• What are best practices for CDOs to
coordinate with statistical officials on
synthetic data?
• What approaches can CDOs
consider to help address these
challenges?
Section 4: Ethics and Equity
Considerations in Synthetic Data
Generation
Synthetic data generation techniques
hold great promise, but also introduce
questions of ethics and equity.
Consistent with Federal privacy
practices,14 any data generation
technique involving individuals must
respect their privacy rights and obtain
informed consent before using realworld data to generate synthetic data.
As noted in Section 3, synthetic data
generation is also subject to challenges
commonly facing any statistical
methods and has the potential to
introduce and encode errors or bias,
potentially leading to discriminatory
outcomes.
Uses of generated synthetic data must
also be carefully considered. The
context and quality of the generated
synthetic data will impact its practical
utility and impact. Assessing and
understanding the fitness of a generated
synthetic dataset is essential. For
instance, a generated synthetic dataset
may not sufficiently represent the
diversity of the source dataset. In
addition, a generated synthetic dataset
may not contain sufficient variables to
fully represent the system and the
drivers of differences in the
phenomenon it is meant to represent.
Questions
• What techniques are available to
facilitate transparency around generated
synthetic data?
• What are best practices for CDOs to
coordinate with privacy officials on
14 OMB Circular A–130, Appendix II https://
www.whitehouse.gov/wp-content/uploads/legacy_
drupal_files/omb/circulars/A130/a130revised.pdf.
E:\FR\FM\05JAN1.SGM
05JAN1
786
Federal Register / Vol. 89, No. 4 / Friday, January 5, 2024 / Notices
ethics and equity matters related to
synthetic data generation?
• How can we apply the Federal Data
Ethics Framework 15 to address these
ethics and equity concerns?
Section 5: Synthetic Data Generation
and Evidence-Building
Synthetic data generation can enable
the production of evidence for use in
policymaking. Applications such as
simulation or modeling can help
policymakers explore scenarios and
their potential impacts. Likewise,
policymakers can conduct controlled
experiments of potential policy
interventions to better understand their
impacts. Data synthesis may help
policymakers make more data publicly
available to spur research and other
foundational fact-finding activities that
can inform policymaking.
Questions
• What other applications of
synthetic data generation support
evidence-based policymaking? 16
• What is the relationship between
synthetic data generation and open
government data? 17
• How can CDOs and Evaluation
Officers best collaborate on synthetic
data generation to support evidencebuilding? 18 What about other evidence
officials? 19
Kenneth Ambrose,
Senior Advisor CDO Council, Office of Shared
Solutions and Performance Improvement,
General Services Administration.
[FR Doc. 2024–00036 Filed 1–4–24; 8:45 am]
BILLING CODE 6820–69–P
DEPARTMENT OF DEFENSE
GENERAL SERVICES
ADMINISTRATION
NATIONAL AERONAUTICS AND
SPACE ADMINISTRATION
[OMB Control No. 9000–0064; Docket No.
2024–0053; Sequence No. 1]
Information Collection; Certain Federal
Acquisition Regulation Part 36
Construction Contract Requirements
Department of Defense (DOD),
General Services Administration (GSA),
and National Aeronautics and Space
Administration (NASA).
lotter on DSK11XQN23PROD with NOTICES1
AGENCY:
15 https://resources.data.gov/assets/documents/
fds-data-ethics-framework.pdf.
16 OMB Memorandum M–19–23.
17 44 U.S.C. 3520(20).
18 OMB Memorandum M–19–23, Appendix A.
19 OMB Memorandum M–19–23, Section II (Key
Senior Officials).
VerDate Sep<11>2014
17:32 Jan 04, 2024
Jkt 262001
A. OMB Control Number, Title, and
Any Associated Form(s)
Notice and request for
comments.
ACTION:
In accordance with the
Paperwork Reduction Act of 1995, and
the Office of Management and Budget
(OMB) regulations, DoD, GSA, and
NASA invite the public to comment on
a revision concerning certain Federal
Acquisition Regulation part 36
construction contract requirements.
DoD, GSA, and NASA invite comments
on: whether the proposed collection of
information is necessary for the proper
performance of the functions of Federal
Government acquisitions, including
whether the information will have
practical utility; the accuracy of the
estimate of the burden of the proposed
information collection; ways to enhance
the quality, utility, and clarity of the
information to be collected; and ways to
minimize the burden of the information
collection on respondents, including the
use of automated collection techniques
or other forms of information
technology. OMB has approved this
information collection for use through
April 30, 2024. DoD, GSA, and NASA
propose that OMB extend its approval
for use for three additional years beyond
the current expiration date.
SUMMARY:
DoD, GSA, and NASA will
consider all comments received by
March 5, 2024.
DATES:
DoD, GSA, and NASA
invite interested persons to submit
comments on this collection through
https://www.regulations.gov and follow
the instructions on the site. This website
provides the ability to type short
comments directly into the comment
field or attach a file for lengthier
comments. If there are difficulties
submitting comments, contact the GSA
Regulatory Secretariat Division at 202–
501–4755 or GSARegSec@gsa.gov.
Instructions: All items submitted
must cite OMB Control No. 9000–0064,
Certain Federal Acquisition Regulation
Part 36 Construction Contract
Requirements. Comments received
generally will be posted without change
to https://www.regulations.gov,
including any personal and/or business
confidential information provided. To
confirm receipt of your comment(s),
please check www.regulations.gov,
approximately two-to-three days after
submission to verify posting.
ADDRESSES:
FOR FURTHER INFORMATION CONTACT:
Zenaida Delgado, Procurement Analyst,
at telephone 202–969–7207, or
zenaida.delgado@gsa.gov.
SUPPLEMENTARY INFORMATION:
PO 00000
Frm 00039
Fmt 4703
Sfmt 4703
9000–0064, Certain Federal
Acquisition Regulation Part 36
Construction Contract Requirements.
B. Need and Uses
DoD, GSA, and NASA are combining
OMB Control Nos. by FAR part. This
consolidation is expected to improve
industry’s ability to easily and
efficiently identify burdens associated
with a given FAR part. This review of
the information collections by FAR part
allows improved oversight to ensure
there is no redundant or unaccounted
for burden placed on industry. Lastly,
combining information collections in a
given FAR part is also expected to
reduce the administrative burden
associated with processing multiple
information collections.
This justification supports the
extension of OMB Control No. 9000–
0064 and combines it with the
previously approved information
collection under OMB Control No.
9000–0062, with the new title ‘‘Certain
Federal Acquisition Regulation Part 36
Construction Contract Requirements’’.
Upon approval of this consolidated
information collection, OMB Control
No. 9000–0062 will be discontinued.
The burden requirements previously
approved under the discontinued
number will be covered under OMB
Control No. 9000–0064.
This clearance covers the information
that contractors must submit to comply
with the following FAR requirements:
• FAR 52.236–5, Material and
Workmanship. This clause requires
contractors to obtain contracting officer
approval of the machinery, equipment,
material, or articles to be incorporated
into the work. The contractor’s request
must include: the manufacturer’s name,
the model number, and other
information concerning the
performance, capacity, nature, and
rating of the machinery and mechanical
and other equipment; and full
information concerning the material or
articles. When directed by the
contracting officer, the contractor must
submit samples of the items requiring
approval for incorporating into the
work. The contracting officer uses this
information to determine whether the
machinery, equipment, material, or
articles meet the standards of quality
specified in the contract. A contracting
officer may reject work, if the contractor
installs machinery, equipment, material,
or articles in the work without obtaining
the contracting officer’s approval.
• FAR 52.236–13, Accident
Prevention, Alternate I. This alternate to
E:\FR\FM\05JAN1.SGM
05JAN1
Agencies
[Federal Register Volume 89, Number 4 (Friday, January 5, 2024)]
[Notices]
[Pages 783-786]
From the Federal Register Online via the Government Publishing Office [www.gpo.gov]
[FR Doc No: 2024-00036]
-----------------------------------------------------------------------
GENERAL SERVICES ADMINISTRATION
[Notice-MY-2023-03; Docket No. 2023-0002; Sequence No. 37]
Office of Shared Solutions and Performance Improvement (OSSPI);
Chief Data Officers Council (CDO); Request for Information--Synthetic
Data Generation
AGENCY: Federal Chief Data Officers (CDO) Council; General Services
Administration, (GSA).
ACTION: Notice.
-----------------------------------------------------------------------
SUMMARY: The Federal CDO Council was established by the Foundations for
Evidence-Based Policymaking Act. The Council's vision is to improve
government mission achievement and increase benefits to the nation
through improving the management, use, protection, dissemination, and
generation of data in government decision-making and operations. The
CDO Council is publishing this Request for Information (RFI) for the
public to provide input on key questions concerning synthetic data
generation. Responses to this RFI will inform the CDO Council's work to
establish best practices for synthetic data generation.
DATES: We will consider comments received by February 5, 2024.
[[Page 784]]
Targeted Audience
This RFI is intended for Chief Data Officers, data scientists,
technologists, data stewards and data- and evidence-building related
subject matter experts from the public, private, and academic sectors.
ADDRESSES: Respondents should submit comments identified by Notice-MY-
2023-03 via the Federal eRulemaking Portal at https://www.regulations.gov and follow the instructions for submitting
comments. All public comments received are subject to the Freedom of
Information Act and will be posted in their entirety at
regulations.gov, including any personal and/or business confidential
information provided. Do not include any information you would not like
to be made publicly available.
Written responses should not exceed six pages, inclusive of a one-
page cover page as described below. Please respond concisely, in plain
language, and specify which question(s) you are responding to. You may
also include links to online materials or interactive presentations,
but please ensure all links are publicly available. Each response
should include:
The name of the individual(s) and/or organization
responding.
A brief description of the responding individual(s) or
organization's mission and/or areas of expertise.
The section(s) that your submission and materials are
related to.
A contact for questions or other follow-up on your
response.
By responding to the RFI, each participant (individual, team, or
legal entity) warrants that they are the sole author or owner of, or
has the right to use, any copyrightable works that the submission
comprises, that the works are wholly original (or is an improved
version of an existing work that the participant has sufficient rights
to use and improve), and that the submission does not infringe any
copyright or any other rights of any third party of which participant
is aware.
By responding to the RFI, each participant (individual, team, or
legal entity) consents to the contents of their submission being made
available to all Federal agencies and their employees on an internal-
to-government website accessible only to agency staff persons.
Participants will not be required to transfer their intellectual
property rights to the CDO Council, but participants must grant to the
Federal Government a nonexclusive license to apply, share, and use the
materials that are included in the submission. To participate in the
RFI, each participant must warrant that there are no legal obstacles to
providing the above-referenced nonexclusive licenses of participant
rights to the Federal Government. Interested parties who respond to
this RFI may be contacted for follow-on questions or discussion.
FOR FURTHER INFORMATION CONTACT: Issues regarding submission or
questions can be sent to Ken Ambrose and Ashley Jackson, Senior
Advisors, Office of Shared Solutions and Performance Improvement,
General Services Administration, at 202-215-7330 (Kenneth Ambrose) and
202-538-2897 (Ashley Jackson), or [email protected].
SUPPLEMENTARY INFORMATION:
Background
Pursuant to the Foundations for Evidence-Based Policy Making Act of
2018,\1\ the CDO Council is charged with establishing best practices
for the use, protection, dissemination, and generation of data in the
Federal Government. In reviewing existing activities and literature
from across the Federal Government, the CDO Council has determined
that:
---------------------------------------------------------------------------
\1\ H.R. 4174--115th Congress (2017-2018): Foundations for
Evidence-Based Policymaking Act of 2018 [verbar] Congress.gov
[verbar] Library of Congress https://www.congress.gov/bill/115th-congress/house-bill/4174/text.
---------------------------------------------------------------------------
the Federal Government would benefit from developing
consensus of a more formalized definition for synthetic data
generation,
synthetic data generation has wide-ranging applications,
and
there are challenges and limitations with synthetic data
generation.
The CDO council is interested in consolidating feedback and inputs
from qualified experts to gain additional insight and assist with
establishing a best practice guide around synthetic data generation.
The CDO Council has preliminarily drafted a working definition of
synthetic data generation and several key questions to better inform
its work.
Information and Key Questions
Section 1: Defining Synthetic Data Generation
Synthetic data generation is an important part of modern data
science work. In the broadest sense, synthetic data generation involves
the creation of a new synthetic or artificial dataset using
computational methods. Synthetic data generation can be contrasted with
real-world data collection. Real-world data collection involves
gathering data from a first-hand source, such as through surveys,
observations, interviews, forms, and other methods. Synthetic data
generation is a broad field that employs varied techniques and can be
applied to many different kinds of problems. Data may be fully or
partially synthetic. A fully synthetic dataset wholly consists of
points created using computational methods, whereas a partially
synthetic dataset may involve a mix of first-hand and computationally
generated synthetic data.
Throughout this RFI, we use the following definitions:
data--recorded information, regardless of form or the
media on which the data is recorded; \2\
---------------------------------------------------------------------------
\2\ 44 U.S.C. 3502(16).
---------------------------------------------------------------------------
data asset--a collection of data elements or data sets
that may be grouped together; \3\
---------------------------------------------------------------------------
\3\ 44 U.S.C. 3502(17).
---------------------------------------------------------------------------
open government data asset--a public data asset that is
(A) machine-readable; (B) available (or could be made available) in an
open format; (C) not encumbered by restrictions, other than
intellectual property rights, including under titles 17 and 35, that
would impede the use or reuse of such asset; and (D) based on an
underlying open standard that is maintained by a standards
organization.\4\
---------------------------------------------------------------------------
\4\ 44 U.S.C. 3502(20).
---------------------------------------------------------------------------
The National Institute of Standards and Technology (NIST) defines
synthetic data generation as ``a process in which seed data is used to
create artificial data that has some of the statistical characteristics
as the seed data''.\5\
---------------------------------------------------------------------------
\5\ https://csrc.nist.gov/glossary/term/synthetic_data_generation.
---------------------------------------------------------------------------
The CDO Council believes that this definition of synthetic data
generation includes techniques such as using statistics to create data
from a known distribution, generative adversarial networks (GANs),\6\
variational autoencoding (VAE),\7\ building test data for use in
software development,\8\ privacy-preserving synthetic data generation
\9\ and others.
---------------------------------------------------------------------------
\6\ 15 U.S.C. 9204.
\7\ A useful definition of this technique is available in the
abstract of this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8774760/.
\8\ This technique is described in the Department of Defense
DevSecOps Fundamentals Guidebook https://dodcio.defense.gov/Portals/0/Documents/Library/DevSecOpsTools-ActivitiesGuidebook.pdf, page 23.
\9\ NIST Special Publication 800-188, Section 4.4 https://doi.org/10.6028/NIST.SP.800-188.
---------------------------------------------------------------------------
The CDO Council also believes that it is important to draw
contrasts between synthetic data generation and other activities. For
example, synthetic data generation does not include collection
[[Page 785]]
of data without any inference. Synthetic data generation does not
include signal processing, such as automated differential translations
of global positioning satellite data. Synthetic data generation also
does not include enriching data during data analysis--such intermediate
steps that involve augmenting or enhancing existing data but do not
involve the creation of artificial data.
Other analysis techniques, such as distribution fitting and
parametric modeling, are closely related to synthetic data generation.
The CDO Council believes the key difference; however, is the purpose of
the computational methods. Synthetic data generation seeks to create
wholly new data points based on the statistical properties of a
dataset, whereas distribution fitting seeks to `fill in' a dataset
based on a known distribution. Notably, the fitted distribution can be
used to generate points that are not part of the original dataset--
which is an application of synthetic data generation.
Questions
Are there any limitations to relying on the NIST
definition to describe the field of synthetic data generation? How
should it be improved?
How well does the CDO Council's list of examples and
contrasts improve understanding? How should these be improved?
Section 2: Applying Synthetic Data Generation
Synthetic data generation can enable the creation of larger and
more diverse datasets, enhance model performance, and protect
individual privacy. The CDO Council's review of potential applications
of synthetic data generation found examples in:
Data augmentation.\10\ This application involves creating
new data points or datasets from existing data. This application can be
particularly useful in developing training datasets for machine
learning and advanced analytics.
---------------------------------------------------------------------------
\10\ This application is briefly described at https://frederick.cancer.gov/initiatives/scientific-standards-hub/ai-and-data-science, Section 4.
---------------------------------------------------------------------------
Data synthesis.\11\ This application involves using an
existing dataset to create a new dataset, sharing similar statistical
properties with the original dataset, to protect individual privacy.
Generating such datasets has wide-ranging applications including, but
not limited to, facilitating reproducible investigation of clinical
data while preserving individual privacy.
---------------------------------------------------------------------------
\11\ A definition of this technique is available in the abstract
of this paper https://par.nsf.gov/servlets/purl/10187206.
---------------------------------------------------------------------------
Modeling and simulation.\12\ This application involves
setting assumptions, parameters and rules to develop data for further
analysis. The synthetic dataset can be used for developing insights,
testing hypotheses, and/or understanding a model's behavior. This
application supports the conduct of controlled experiments, predicting
potential future outcomes from current conditions, generating scenarios
for rare or extreme events, and validating or calibrating a model.
---------------------------------------------------------------------------
\12\ A definition a computer simulation is proposed at https://builtin.com/hardware/computer-simulation.
---------------------------------------------------------------------------
Software development.\13\ This application involves using
existing database schemas to simulate real-world scenarios and ensure
that a software application can handle different types of data and
errors effectively. This application assists in the creation of
representative data, makes it easier to generate edge cases, protects
individual privacy, and improves testing efficiency.
---------------------------------------------------------------------------
\13\ DoD DevSecOps Fundamentals, ibid.
---------------------------------------------------------------------------
Notably, the CDO Council believes that not all applications of
modeling and simulation would meet the definition of synthetic data
generation. For example, weather forecasting applies numerical models
and applies a complex mix of data analysis, meteorological science, and
computation methods but does not involve the creation of synthetic or
artificial data points. Instead, the purpose of these models is to
predict future conditions.
Questions
How are these examples representative of synthetic data
generation? How should they be revised?
What other examples of synthetic data generation should
the CDO Council know about?
What are the key advantages for the use of synthetic data
generation?
Section 3: Challenges and Limitations in Synthetic Data Generation
The CDO Council recognizes that synthetic data generation can be a
valuable technique. However, it should be noted that there are some
challenges and limitations with the technique. For example, there can
be challenges generating data that realistically simulates the real
world and the diversity of real data. Additionally, evaluating the
quality of a synthetic dataset may also be extremely challenging.
Synthetic data generation is also subject to challenges commonly
facing any statistical methods, such as overfitting and imbalances in
the source data. These challenges reduce the utility of the generated
synthetic data because they may not be properly representative,
including failing to represent rare classes.
Questions
What other challenges and limitations are important to
consider in synthetic data generation?
What tools or techniques are available for effectively
communicating the limitations of generated synthetic data?
What are best practices for CDOs to coordinate with
statistical officials on synthetic data?
What approaches can CDOs consider to help address these
challenges?
Section 4: Ethics and Equity Considerations in Synthetic Data
Generation
Synthetic data generation techniques hold great promise, but also
introduce questions of ethics and equity. Consistent with Federal
privacy practices,\14\ any data generation technique involving
individuals must respect their privacy rights and obtain informed
consent before using real-world data to generate synthetic data. As
noted in Section 3, synthetic data generation is also subject to
challenges commonly facing any statistical methods and has the
potential to introduce and encode errors or bias, potentially leading
to discriminatory outcomes.
---------------------------------------------------------------------------
\14\ OMB Circular A-130, Appendix II https://www.whitehouse.gov/wp-content/uploads/legacy_drupal_files/omb/circulars/A130/a130revised.pdf.
---------------------------------------------------------------------------
Uses of generated synthetic data must also be carefully considered.
The context and quality of the generated synthetic data will impact its
practical utility and impact. Assessing and understanding the fitness
of a generated synthetic dataset is essential. For instance, a
generated synthetic dataset may not sufficiently represent the
diversity of the source dataset. In addition, a generated synthetic
dataset may not contain sufficient variables to fully represent the
system and the drivers of differences in the phenomenon it is meant to
represent.
Questions
What techniques are available to facilitate transparency
around generated synthetic data?
What are best practices for CDOs to coordinate with
privacy officials on
[[Page 786]]
ethics and equity matters related to synthetic data generation?
How can we apply the Federal Data Ethics Framework \15\ to
address these ethics and equity concerns?
---------------------------------------------------------------------------
\15\ https://resources.data.gov/assets/documents/fds-data-ethics-framework.pdf.
---------------------------------------------------------------------------
Section 5: Synthetic Data Generation and Evidence-Building
Synthetic data generation can enable the production of evidence for
use in policymaking. Applications such as simulation or modeling can
help policymakers explore scenarios and their potential impacts.
Likewise, policymakers can conduct controlled experiments of potential
policy interventions to better understand their impacts. Data synthesis
may help policymakers make more data publicly available to spur
research and other foundational fact-finding activities that can inform
policymaking.
Questions
What other applications of synthetic data generation
support evidence-based policymaking? \16\
---------------------------------------------------------------------------
\16\ OMB Memorandum M-19-23.
---------------------------------------------------------------------------
What is the relationship between synthetic data generation
and open government data? \17\
---------------------------------------------------------------------------
\17\ 44 U.S.C. 3520(20).
---------------------------------------------------------------------------
How can CDOs and Evaluation Officers best collaborate on
synthetic data generation to support evidence-building? \18\ What about
other evidence officials? \19\
---------------------------------------------------------------------------
\18\ OMB Memorandum M-19-23, Appendix A.
\19\ OMB Memorandum M-19-23, Section II (Key Senior Officials).
Kenneth Ambrose,
Senior Advisor CDO Council, Office of Shared Solutions and Performance
Improvement, General Services Administration.
[FR Doc. 2024-00036 Filed 1-4-24; 8:45 am]
BILLING CODE 6820-69-P