Office of Shared Solutions and Performance Improvement (OSSPI); Chief Data Officers Council (CDO); Request for Information-Synthetic Data Generation, 783-786 [2024-00036]

Download as PDF lotter on DSK11XQN23PROD with NOTICES1 Federal Register / Vol. 89, No. 4 / Friday, January 5, 2024 / Notices Æ Proposed actions that can be adopted to reduce the burden and cost of FedRAMP authorizations for agencies. • Collect information and feedback on agency compliance with, and implementation of, FedRAMP requirements. • Serve as a forum that facilitates communication and collaboration among the FedRAMP stakeholder community. The FSCAC will meet no fewer than three (3) times a calendar year. Meetings shall occur as frequently as needed, called, and approved by the DFO. Meetings may be held virtually or in person. Members will serve without compensation and may be allowed travel expenses, including per diem, in accordance with 5 U.S.C. 5703. The Committee shall be comprised of not more than 15 members who are qualified representatives from the public and private sectors, appointed by the Administrator, in consultation with the Director of OMB, as follows: i. The GSA Administrator or the GSA Administrator’s designee, who shall be the Chair of the Committee. ii. At least one representative each from the Cybersecurity and Infrastructure Security Agency and the National Institute of Standards and Technology. iii. At least two officials who serve as the Chief Information Security Officer within an agency, who shall be required to maintain such a position throughout the duration of their service on the Committee. iv. At least one official serving as Chief Procurement Officer (or equivalent) in an agency, who shall be required to maintain such a position throughout the duration of their service on the Committee. v. At least one individual representing an independent assessment organization vi. At least five representatives from unique businesses that primarily provide cloud computing services or products, including at least two representatives from a small business (as defined by section 3(a) of the Small Business Act (15 U.S.C. 632(a))). vii. At least two other representatives from the Federal Government as the Administrator determines to be necessary to provide sufficient balance, insights, or expertise to the Committee. Each member shall be appointed for a term of three (3) years, except the initial terms, which were staggered into one (1), two (2) or three (3) year terms to establish a rotation in which one third of the members are selected. No member shall be appointed for more than two (2) consecutive terms nor shall any member VerDate Sep<11>2014 17:32 Jan 04, 2024 Jkt 262001 serve for more than six (6) consecutive years. GSA values opportunities to increase diversity, equity, inclusion and accessibility on its federal advisory committees. Members will be designated as Regular Government Employees (RGEs) or Representative members as appropriate and consistent with Section 3616(d) of the FedRAMP Authorization Act of 2022. GSA’s Office of General Counsel will assist the Designated Federal Officer (DFO) to determine the advisory committee member designations. Representatives are members selected to represent a specific point of view held by a particular group, organization, or association. Members who are full time or permanent parttime Federal civilian officers or employees shall be appointed to serve as Regular Government Employee (RGE) members. In accordance with OMB Final Guidance published in the Federal Register on October 5, 2011 and revised on August 13, 2014, federally registered lobbyists may not serve on the Committee in an individual capacity to provide their own individual best judgment and expertise, such as RGEs members. This ban does not apply to lobbyists appointed to provide the Committee with the views of a particular group, organization, or association, such as Representative members. Applications Applications are being accepted to fill the remaining terms of two vacant seats and to fill two seats with upcoming expiring terms. These four seats will be designated as Representative members: Two (2) seats for representatives of a unique business that primarily provides cloud computing products or services. One seat will be appointed to serve for the remainder of the vacant term, scheduled to end in May 2025, and the other will be appointed for a three year term. Two (2) seats for representatives of a unique business that primarily provides cloud computing products or services from a small business (as defined by section 3(a) of the Small Business Act (15 U.S.C. 632(a))). One seat will be appointed to serve for the remainder of the vacant term, scheduled to end in July 2026, and the other will be appointed for a three year term. Applications for membership on the Committee will be accepted until 5:00 p.m. Eastern Standard Time on Monday, January 22, 2024. There are two parts to submitting an application. First, complete the information requested via this electronic form https://forms.gle/ PO 00000 Frm 00036 Fmt 4703 Sfmt 4703 783 kxscdjX6P7oB9vua7. Next, email your CV or resume and a letter of endorsement from your organization or organization’s leadership, endorsing you to represent your company, to fscac@ gsa.gov with the subject line: FSCAC APPLICATION—[Applicant Name]. The letter of endorsement must come from your organization or organization’s leadership. If you are the CEO, then it must come from another member of the executive team of your organization, as you cannot endorse yourself. The letter must be signed and specifically state that you are authorized to apply to FSCAC as a representative of your organization. Please note: Letters of ‘‘recommendation’’ or other unsolicited deliverables will neither be accepted nor acknowledged. Do not include them. Applications that do not include the completion of the above instructions will not be considered. Elizabeth Blake, Senior Advisor, Federal Acquisition Service, General Services Administration. [FR Doc. 2023–28602 Filed 1–4–24; 8:45 am] BILLING CODE 6820–34–P GENERAL SERVICES ADMINISTRATION [Notice–MY–2023–03; Docket No. 2023– 0002; Sequence No. 37] Office of Shared Solutions and Performance Improvement (OSSPI); Chief Data Officers Council (CDO); Request for Information—Synthetic Data Generation Federal Chief Data Officers (CDO) Council; General Services Administration, (GSA). ACTION: Notice. AGENCY: The Federal CDO Council was established by the Foundations for Evidence-Based Policymaking Act. The Council’s vision is to improve government mission achievement and increase benefits to the nation through improving the management, use, protection, dissemination, and generation of data in government decision-making and operations. The CDO Council is publishing this Request for Information (RFI) for the public to provide input on key questions concerning synthetic data generation. Responses to this RFI will inform the CDO Council’s work to establish best practices for synthetic data generation. DATES: We will consider comments received by February 5, 2024. SUMMARY: E:\FR\FM\05JAN1.SGM 05JAN1 784 Federal Register / Vol. 89, No. 4 / Friday, January 5, 2024 / Notices lotter on DSK11XQN23PROD with NOTICES1 Targeted Audience This RFI is intended for Chief Data Officers, data scientists, technologists, data stewards and data- and evidencebuilding related subject matter experts from the public, private, and academic sectors. ADDRESSES: Respondents should submit comments identified by Notice–MY– 2023–03 via the Federal eRulemaking Portal at https://www.regulations.gov and follow the instructions for submitting comments. All public comments received are subject to the Freedom of Information Act and will be posted in their entirety at regulations.gov, including any personal and/or business confidential information provided. Do not include any information you would not like to be made publicly available. Written responses should not exceed six pages, inclusive of a one-page cover page as described below. Please respond concisely, in plain language, and specify which question(s) you are responding to. You may also include links to online materials or interactive presentations, but please ensure all links are publicly available. Each response should include: • The name of the individual(s) and/ or organization responding. • A brief description of the responding individual(s) or organization’s mission and/or areas of expertise. • The section(s) that your submission and materials are related to. • A contact for questions or other follow-up on your response. By responding to the RFI, each participant (individual, team, or legal entity) warrants that they are the sole author or owner of, or has the right to use, any copyrightable works that the submission comprises, that the works are wholly original (or is an improved version of an existing work that the participant has sufficient rights to use and improve), and that the submission does not infringe any copyright or any other rights of any third party of which participant is aware. By responding to the RFI, each participant (individual, team, or legal entity) consents to the contents of their submission being made available to all Federal agencies and their employees on an internal-to-government website accessible only to agency staff persons. Participants will not be required to transfer their intellectual property rights to the CDO Council, but participants must grant to the Federal Government a nonexclusive license to apply, share, and use the materials that are included in the submission. To participate in the VerDate Sep<11>2014 17:32 Jan 04, 2024 Jkt 262001 RFI, each participant must warrant that there are no legal obstacles to providing the above-referenced nonexclusive licenses of participant rights to the Federal Government. Interested parties who respond to this RFI may be contacted for follow-on questions or discussion. FOR FURTHER INFORMATION CONTACT: Issues regarding submission or questions can be sent to Ken Ambrose and Ashley Jackson, Senior Advisors, Office of Shared Solutions and Performance Improvement, General Services Administration, at 202–215– 7330 (Kenneth Ambrose) and 202–538– 2897 (Ashley Jackson), or cdocstaff@ gsa.gov. SUPPLEMENTARY INFORMATION: Background Pursuant to the Foundations for Evidence-Based Policy Making Act of 2018,1 the CDO Council is charged with establishing best practices for the use, protection, dissemination, and generation of data in the Federal Government. In reviewing existing activities and literature from across the Federal Government, the CDO Council has determined that: • the Federal Government would benefit from developing consensus of a more formalized definition for synthetic data generation, • synthetic data generation has wideranging applications, and • there are challenges and limitations with synthetic data generation. The CDO council is interested in consolidating feedback and inputs from qualified experts to gain additional insight and assist with establishing a best practice guide around synthetic data generation. The CDO Council has preliminarily drafted a working definition of synthetic data generation and several key questions to better inform its work. Information and Key Questions Section 1: Defining Synthetic Data Generation Synthetic data generation is an important part of modern data science work. In the broadest sense, synthetic data generation involves the creation of a new synthetic or artificial dataset using computational methods. Synthetic data generation can be contrasted with real-world data collection. Real-world data collection involves gathering data 1 H.R. 4174—115th Congress (2017–2018): Foundations for Evidence-Based Policymaking Act of 2018 | Congress.gov | Library of Congress https:// www.congress.gov/bill/115th-congress/house-bill/ 4174/text. PO 00000 Frm 00037 Fmt 4703 Sfmt 4703 from a first-hand source, such as through surveys, observations, interviews, forms, and other methods. Synthetic data generation is a broad field that employs varied techniques and can be applied to many different kinds of problems. Data may be fully or partially synthetic. A fully synthetic dataset wholly consists of points created using computational methods, whereas a partially synthetic dataset may involve a mix of first-hand and computationally generated synthetic data. Throughout this RFI, we use the following definitions: • data—recorded information, regardless of form or the media on which the data is recorded; 2 • data asset—a collection of data elements or data sets that may be grouped together; 3 • open government data asset—a public data asset that is (A) machinereadable; (B) available (or could be made available) in an open format; (C) not encumbered by restrictions, other than intellectual property rights, including under titles 17 and 35, that would impede the use or reuse of such asset; and (D) based on an underlying open standard that is maintained by a standards organization.4 The National Institute of Standards and Technology (NIST) defines synthetic data generation as ‘‘a process in which seed data is used to create artificial data that has some of the statistical characteristics as the seed data’’.5 The CDO Council believes that this definition of synthetic data generation includes techniques such as using statistics to create data from a known distribution, generative adversarial networks (GANs),6 variational autoencoding (VAE),7 building test data for use in software development,8 privacy-preserving synthetic data generation 9 and others. The CDO Council also believes that it is important to draw contrasts between synthetic data generation and other activities. For example, synthetic data generation does not include collection 2 44 U.S.C. 3502(16). U.S.C. 3502(17). 4 44 U.S.C. 3502(20). 5 https://csrc.nist.gov/glossary/term/synthetic_ data_generation. 6 15 U.S.C. 9204. 7 A useful definition of this technique is available in the abstract of this paper: https://www.ncbi. nlm.nih.gov/pmc/articles/PMC8774760/. 8 This technique is described in the Department of Defense DevSecOps Fundamentals Guidebook https://dodcio.defense.gov/Portals/0/Documents/ Library/DevSecOpsTools-ActivitiesGuidebook.pdf, page 23. 9 NIST Special Publication 800–188, Section 4.4 https://doi.org/10.6028/NIST.SP.800-188. 3 44 E:\FR\FM\05JAN1.SGM 05JAN1 Federal Register / Vol. 89, No. 4 / Friday, January 5, 2024 / Notices of data without any inference. Synthetic data generation does not include signal processing, such as automated differential translations of global positioning satellite data. Synthetic data generation also does not include enriching data during data analysis— such intermediate steps that involve augmenting or enhancing existing data but do not involve the creation of artificial data. Other analysis techniques, such as distribution fitting and parametric modeling, are closely related to synthetic data generation. The CDO Council believes the key difference; however, is the purpose of the computational methods. Synthetic data generation seeks to create wholly new data points based on the statistical properties of a dataset, whereas distribution fitting seeks to ‘fill in’ a dataset based on a known distribution. Notably, the fitted distribution can be used to generate points that are not part of the original dataset—which is an application of synthetic data generation. Questions • Are there any limitations to relying on the NIST definition to describe the field of synthetic data generation? How should it be improved? • How well does the CDO Council’s list of examples and contrasts improve understanding? How should these be improved? lotter on DSK11XQN23PROD with NOTICES1 Section 2: Applying Synthetic Data Generation Questions Synthetic data generation can enable the creation of larger and more diverse datasets, enhance model performance, and protect individual privacy. The CDO Council’s review of potential applications of synthetic data generation found examples in: • Data augmentation.10 This application involves creating new data points or datasets from existing data. This application can be particularly useful in developing training datasets for machine learning and advanced analytics. • Data synthesis.11 This application involves using an existing dataset to create a new dataset, sharing similar statistical properties with the original dataset, to protect individual privacy. Generating such datasets has wideranging applications including, but not limited to, facilitating reproducible 10 This application is briefly described at https:// frederick.cancer.gov/initiatives/scientific-standardshub/ai-and-data-science, Section 4. 11 A definition of this technique is available in the abstract of this paper https://par.nsf.gov/servlets/ purl/10187206. VerDate Sep<11>2014 17:32 Jan 04, 2024 Jkt 262001 investigation of clinical data while preserving individual privacy. • Modeling and simulation.12 This application involves setting assumptions, parameters and rules to develop data for further analysis. The synthetic dataset can be used for developing insights, testing hypotheses, and/or understanding a model’s behavior. This application supports the conduct of controlled experiments, predicting potential future outcomes from current conditions, generating scenarios for rare or extreme events, and validating or calibrating a model. • Software development.13 This application involves using existing database schemas to simulate real-world scenarios and ensure that a software application can handle different types of data and errors effectively. This application assists in the creation of representative data, makes it easier to generate edge cases, protects individual privacy, and improves testing efficiency. Notably, the CDO Council believes that not all applications of modeling and simulation would meet the definition of synthetic data generation. For example, weather forecasting applies numerical models and applies a complex mix of data analysis, meteorological science, and computation methods but does not involve the creation of synthetic or artificial data points. Instead, the purpose of these models is to predict future conditions. • How are these examples representative of synthetic data generation? How should they be revised? • What other examples of synthetic data generation should the CDO Council know about? • What are the key advantages for the use of synthetic data generation? Section 3: Challenges and Limitations in Synthetic Data Generation The CDO Council recognizes that synthetic data generation can be a valuable technique. However, it should be noted that there are some challenges and limitations with the technique. For example, there can be challenges generating data that realistically simulates the real world and the diversity of real data. Additionally, evaluating the quality of a synthetic dataset may also be extremely challenging. 12 A definition a computer simulation is proposed at https://builtin.com/hardware/computersimulation. 13 DoD DevSecOps Fundamentals, ibid. PO 00000 Frm 00038 Fmt 4703 Sfmt 4703 785 Synthetic data generation is also subject to challenges commonly facing any statistical methods, such as overfitting and imbalances in the source data. These challenges reduce the utility of the generated synthetic data because they may not be properly representative, including failing to represent rare classes. Questions • What other challenges and limitations are important to consider in synthetic data generation? • What tools or techniques are available for effectively communicating the limitations of generated synthetic data? • What are best practices for CDOs to coordinate with statistical officials on synthetic data? • What approaches can CDOs consider to help address these challenges? Section 4: Ethics and Equity Considerations in Synthetic Data Generation Synthetic data generation techniques hold great promise, but also introduce questions of ethics and equity. Consistent with Federal privacy practices,14 any data generation technique involving individuals must respect their privacy rights and obtain informed consent before using realworld data to generate synthetic data. As noted in Section 3, synthetic data generation is also subject to challenges commonly facing any statistical methods and has the potential to introduce and encode errors or bias, potentially leading to discriminatory outcomes. Uses of generated synthetic data must also be carefully considered. The context and quality of the generated synthetic data will impact its practical utility and impact. Assessing and understanding the fitness of a generated synthetic dataset is essential. For instance, a generated synthetic dataset may not sufficiently represent the diversity of the source dataset. In addition, a generated synthetic dataset may not contain sufficient variables to fully represent the system and the drivers of differences in the phenomenon it is meant to represent. Questions • What techniques are available to facilitate transparency around generated synthetic data? • What are best practices for CDOs to coordinate with privacy officials on 14 OMB Circular A–130, Appendix II https:// www.whitehouse.gov/wp-content/uploads/legacy_ drupal_files/omb/circulars/A130/a130revised.pdf. E:\FR\FM\05JAN1.SGM 05JAN1 786 Federal Register / Vol. 89, No. 4 / Friday, January 5, 2024 / Notices ethics and equity matters related to synthetic data generation? • How can we apply the Federal Data Ethics Framework 15 to address these ethics and equity concerns? Section 5: Synthetic Data Generation and Evidence-Building Synthetic data generation can enable the production of evidence for use in policymaking. Applications such as simulation or modeling can help policymakers explore scenarios and their potential impacts. Likewise, policymakers can conduct controlled experiments of potential policy interventions to better understand their impacts. Data synthesis may help policymakers make more data publicly available to spur research and other foundational fact-finding activities that can inform policymaking. Questions • What other applications of synthetic data generation support evidence-based policymaking? 16 • What is the relationship between synthetic data generation and open government data? 17 • How can CDOs and Evaluation Officers best collaborate on synthetic data generation to support evidencebuilding? 18 What about other evidence officials? 19 Kenneth Ambrose, Senior Advisor CDO Council, Office of Shared Solutions and Performance Improvement, General Services Administration. [FR Doc. 2024–00036 Filed 1–4–24; 8:45 am] BILLING CODE 6820–69–P DEPARTMENT OF DEFENSE GENERAL SERVICES ADMINISTRATION NATIONAL AERONAUTICS AND SPACE ADMINISTRATION [OMB Control No. 9000–0064; Docket No. 2024–0053; Sequence No. 1] Information Collection; Certain Federal Acquisition Regulation Part 36 Construction Contract Requirements Department of Defense (DOD), General Services Administration (GSA), and National Aeronautics and Space Administration (NASA). lotter on DSK11XQN23PROD with NOTICES1 AGENCY: 15 https://resources.data.gov/assets/documents/ fds-data-ethics-framework.pdf. 16 OMB Memorandum M–19–23. 17 44 U.S.C. 3520(20). 18 OMB Memorandum M–19–23, Appendix A. 19 OMB Memorandum M–19–23, Section II (Key Senior Officials). VerDate Sep<11>2014 17:32 Jan 04, 2024 Jkt 262001 A. OMB Control Number, Title, and Any Associated Form(s) Notice and request for comments. ACTION: In accordance with the Paperwork Reduction Act of 1995, and the Office of Management and Budget (OMB) regulations, DoD, GSA, and NASA invite the public to comment on a revision concerning certain Federal Acquisition Regulation part 36 construction contract requirements. DoD, GSA, and NASA invite comments on: whether the proposed collection of information is necessary for the proper performance of the functions of Federal Government acquisitions, including whether the information will have practical utility; the accuracy of the estimate of the burden of the proposed information collection; ways to enhance the quality, utility, and clarity of the information to be collected; and ways to minimize the burden of the information collection on respondents, including the use of automated collection techniques or other forms of information technology. OMB has approved this information collection for use through April 30, 2024. DoD, GSA, and NASA propose that OMB extend its approval for use for three additional years beyond the current expiration date. SUMMARY: DoD, GSA, and NASA will consider all comments received by March 5, 2024. DATES: DoD, GSA, and NASA invite interested persons to submit comments on this collection through https://www.regulations.gov and follow the instructions on the site. This website provides the ability to type short comments directly into the comment field or attach a file for lengthier comments. If there are difficulties submitting comments, contact the GSA Regulatory Secretariat Division at 202– 501–4755 or GSARegSec@gsa.gov. Instructions: All items submitted must cite OMB Control No. 9000–0064, Certain Federal Acquisition Regulation Part 36 Construction Contract Requirements. Comments received generally will be posted without change to https://www.regulations.gov, including any personal and/or business confidential information provided. To confirm receipt of your comment(s), please check www.regulations.gov, approximately two-to-three days after submission to verify posting. ADDRESSES: FOR FURTHER INFORMATION CONTACT: Zenaida Delgado, Procurement Analyst, at telephone 202–969–7207, or zenaida.delgado@gsa.gov. SUPPLEMENTARY INFORMATION: PO 00000 Frm 00039 Fmt 4703 Sfmt 4703 9000–0064, Certain Federal Acquisition Regulation Part 36 Construction Contract Requirements. B. Need and Uses DoD, GSA, and NASA are combining OMB Control Nos. by FAR part. This consolidation is expected to improve industry’s ability to easily and efficiently identify burdens associated with a given FAR part. This review of the information collections by FAR part allows improved oversight to ensure there is no redundant or unaccounted for burden placed on industry. Lastly, combining information collections in a given FAR part is also expected to reduce the administrative burden associated with processing multiple information collections. This justification supports the extension of OMB Control No. 9000– 0064 and combines it with the previously approved information collection under OMB Control No. 9000–0062, with the new title ‘‘Certain Federal Acquisition Regulation Part 36 Construction Contract Requirements’’. Upon approval of this consolidated information collection, OMB Control No. 9000–0062 will be discontinued. The burden requirements previously approved under the discontinued number will be covered under OMB Control No. 9000–0064. This clearance covers the information that contractors must submit to comply with the following FAR requirements: • FAR 52.236–5, Material and Workmanship. This clause requires contractors to obtain contracting officer approval of the machinery, equipment, material, or articles to be incorporated into the work. The contractor’s request must include: the manufacturer’s name, the model number, and other information concerning the performance, capacity, nature, and rating of the machinery and mechanical and other equipment; and full information concerning the material or articles. When directed by the contracting officer, the contractor must submit samples of the items requiring approval for incorporating into the work. The contracting officer uses this information to determine whether the machinery, equipment, material, or articles meet the standards of quality specified in the contract. A contracting officer may reject work, if the contractor installs machinery, equipment, material, or articles in the work without obtaining the contracting officer’s approval. • FAR 52.236–13, Accident Prevention, Alternate I. This alternate to E:\FR\FM\05JAN1.SGM 05JAN1

Agencies

[Federal Register Volume 89, Number 4 (Friday, January 5, 2024)]
[Notices]
[Pages 783-786]
From the Federal Register Online via the Government Publishing Office [www.gpo.gov]
[FR Doc No: 2024-00036]


-----------------------------------------------------------------------

GENERAL SERVICES ADMINISTRATION

[Notice-MY-2023-03; Docket No. 2023-0002; Sequence No. 37]


Office of Shared Solutions and Performance Improvement (OSSPI); 
Chief Data Officers Council (CDO); Request for Information--Synthetic 
Data Generation

AGENCY: Federal Chief Data Officers (CDO) Council; General Services 
Administration, (GSA).

ACTION: Notice.

-----------------------------------------------------------------------

SUMMARY: The Federal CDO Council was established by the Foundations for 
Evidence-Based Policymaking Act. The Council's vision is to improve 
government mission achievement and increase benefits to the nation 
through improving the management, use, protection, dissemination, and 
generation of data in government decision-making and operations. The 
CDO Council is publishing this Request for Information (RFI) for the 
public to provide input on key questions concerning synthetic data 
generation. Responses to this RFI will inform the CDO Council's work to 
establish best practices for synthetic data generation.

DATES: We will consider comments received by February 5, 2024.

[[Page 784]]

Targeted Audience

    This RFI is intended for Chief Data Officers, data scientists, 
technologists, data stewards and data- and evidence-building related 
subject matter experts from the public, private, and academic sectors.

ADDRESSES: Respondents should submit comments identified by Notice-MY-
2023-03 via the Federal eRulemaking Portal at https://www.regulations.gov and follow the instructions for submitting 
comments. All public comments received are subject to the Freedom of 
Information Act and will be posted in their entirety at 
regulations.gov, including any personal and/or business confidential 
information provided. Do not include any information you would not like 
to be made publicly available.
    Written responses should not exceed six pages, inclusive of a one-
page cover page as described below. Please respond concisely, in plain 
language, and specify which question(s) you are responding to. You may 
also include links to online materials or interactive presentations, 
but please ensure all links are publicly available. Each response 
should include:
     The name of the individual(s) and/or organization 
responding.
     A brief description of the responding individual(s) or 
organization's mission and/or areas of expertise.
     The section(s) that your submission and materials are 
related to.
     A contact for questions or other follow-up on your 
response.
    By responding to the RFI, each participant (individual, team, or 
legal entity) warrants that they are the sole author or owner of, or 
has the right to use, any copyrightable works that the submission 
comprises, that the works are wholly original (or is an improved 
version of an existing work that the participant has sufficient rights 
to use and improve), and that the submission does not infringe any 
copyright or any other rights of any third party of which participant 
is aware.
    By responding to the RFI, each participant (individual, team, or 
legal entity) consents to the contents of their submission being made 
available to all Federal agencies and their employees on an internal-
to-government website accessible only to agency staff persons.
    Participants will not be required to transfer their intellectual 
property rights to the CDO Council, but participants must grant to the 
Federal Government a nonexclusive license to apply, share, and use the 
materials that are included in the submission. To participate in the 
RFI, each participant must warrant that there are no legal obstacles to 
providing the above-referenced nonexclusive licenses of participant 
rights to the Federal Government. Interested parties who respond to 
this RFI may be contacted for follow-on questions or discussion.

FOR FURTHER INFORMATION CONTACT: Issues regarding submission or 
questions can be sent to Ken Ambrose and Ashley Jackson, Senior 
Advisors, Office of Shared Solutions and Performance Improvement, 
General Services Administration, at 202-215-7330 (Kenneth Ambrose) and 
202-538-2897 (Ashley Jackson), or [email protected].

SUPPLEMENTARY INFORMATION:

Background

    Pursuant to the Foundations for Evidence-Based Policy Making Act of 
2018,\1\ the CDO Council is charged with establishing best practices 
for the use, protection, dissemination, and generation of data in the 
Federal Government. In reviewing existing activities and literature 
from across the Federal Government, the CDO Council has determined 
that:
---------------------------------------------------------------------------

    \1\ H.R. 4174--115th Congress (2017-2018): Foundations for 
Evidence-Based Policymaking Act of 2018 [verbar] Congress.gov 
[verbar] Library of Congress https://www.congress.gov/bill/115th-congress/house-bill/4174/text.
---------------------------------------------------------------------------

     the Federal Government would benefit from developing 
consensus of a more formalized definition for synthetic data 
generation,
     synthetic data generation has wide-ranging applications, 
and
     there are challenges and limitations with synthetic data 
generation.
    The CDO council is interested in consolidating feedback and inputs 
from qualified experts to gain additional insight and assist with 
establishing a best practice guide around synthetic data generation. 
The CDO Council has preliminarily drafted a working definition of 
synthetic data generation and several key questions to better inform 
its work.

Information and Key Questions

Section 1: Defining Synthetic Data Generation

    Synthetic data generation is an important part of modern data 
science work. In the broadest sense, synthetic data generation involves 
the creation of a new synthetic or artificial dataset using 
computational methods. Synthetic data generation can be contrasted with 
real-world data collection. Real-world data collection involves 
gathering data from a first-hand source, such as through surveys, 
observations, interviews, forms, and other methods. Synthetic data 
generation is a broad field that employs varied techniques and can be 
applied to many different kinds of problems. Data may be fully or 
partially synthetic. A fully synthetic dataset wholly consists of 
points created using computational methods, whereas a partially 
synthetic dataset may involve a mix of first-hand and computationally 
generated synthetic data.
    Throughout this RFI, we use the following definitions:
     data--recorded information, regardless of form or the 
media on which the data is recorded; \2\
---------------------------------------------------------------------------

    \2\ 44 U.S.C. 3502(16).
---------------------------------------------------------------------------

     data asset--a collection of data elements or data sets 
that may be grouped together; \3\
---------------------------------------------------------------------------

    \3\ 44 U.S.C. 3502(17).
---------------------------------------------------------------------------

     open government data asset--a public data asset that is 
(A) machine-readable; (B) available (or could be made available) in an 
open format; (C) not encumbered by restrictions, other than 
intellectual property rights, including under titles 17 and 35, that 
would impede the use or reuse of such asset; and (D) based on an 
underlying open standard that is maintained by a standards 
organization.\4\
---------------------------------------------------------------------------

    \4\ 44 U.S.C. 3502(20).
---------------------------------------------------------------------------

    The National Institute of Standards and Technology (NIST) defines 
synthetic data generation as ``a process in which seed data is used to 
create artificial data that has some of the statistical characteristics 
as the seed data''.\5\
---------------------------------------------------------------------------

    \5\ https://csrc.nist.gov/glossary/term/synthetic_data_generation.
---------------------------------------------------------------------------

    The CDO Council believes that this definition of synthetic data 
generation includes techniques such as using statistics to create data 
from a known distribution, generative adversarial networks (GANs),\6\ 
variational autoencoding (VAE),\7\ building test data for use in 
software development,\8\ privacy-preserving synthetic data generation 
\9\ and others.
---------------------------------------------------------------------------

    \6\ 15 U.S.C. 9204.
    \7\ A useful definition of this technique is available in the 
abstract of this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8774760/.
    \8\ This technique is described in the Department of Defense 
DevSecOps Fundamentals Guidebook https://dodcio.defense.gov/Portals/0/Documents/Library/DevSecOpsTools-ActivitiesGuidebook.pdf, page 23.
    \9\ NIST Special Publication 800-188, Section 4.4 https://doi.org/10.6028/NIST.SP.800-188.
---------------------------------------------------------------------------

    The CDO Council also believes that it is important to draw 
contrasts between synthetic data generation and other activities. For 
example, synthetic data generation does not include collection

[[Page 785]]

of data without any inference. Synthetic data generation does not 
include signal processing, such as automated differential translations 
of global positioning satellite data. Synthetic data generation also 
does not include enriching data during data analysis--such intermediate 
steps that involve augmenting or enhancing existing data but do not 
involve the creation of artificial data.
    Other analysis techniques, such as distribution fitting and 
parametric modeling, are closely related to synthetic data generation. 
The CDO Council believes the key difference; however, is the purpose of 
the computational methods. Synthetic data generation seeks to create 
wholly new data points based on the statistical properties of a 
dataset, whereas distribution fitting seeks to `fill in' a dataset 
based on a known distribution. Notably, the fitted distribution can be 
used to generate points that are not part of the original dataset--
which is an application of synthetic data generation.
Questions
     Are there any limitations to relying on the NIST 
definition to describe the field of synthetic data generation? How 
should it be improved?
     How well does the CDO Council's list of examples and 
contrasts improve understanding? How should these be improved?

Section 2: Applying Synthetic Data Generation

    Synthetic data generation can enable the creation of larger and 
more diverse datasets, enhance model performance, and protect 
individual privacy. The CDO Council's review of potential applications 
of synthetic data generation found examples in:
     Data augmentation.\10\ This application involves creating 
new data points or datasets from existing data. This application can be 
particularly useful in developing training datasets for machine 
learning and advanced analytics.
---------------------------------------------------------------------------

    \10\ This application is briefly described at https://frederick.cancer.gov/initiatives/scientific-standards-hub/ai-and-data-science, Section 4.
---------------------------------------------------------------------------

     Data synthesis.\11\ This application involves using an 
existing dataset to create a new dataset, sharing similar statistical 
properties with the original dataset, to protect individual privacy. 
Generating such datasets has wide-ranging applications including, but 
not limited to, facilitating reproducible investigation of clinical 
data while preserving individual privacy.
---------------------------------------------------------------------------

    \11\ A definition of this technique is available in the abstract 
of this paper https://par.nsf.gov/servlets/purl/10187206.
---------------------------------------------------------------------------

     Modeling and simulation.\12\ This application involves 
setting assumptions, parameters and rules to develop data for further 
analysis. The synthetic dataset can be used for developing insights, 
testing hypotheses, and/or understanding a model's behavior. This 
application supports the conduct of controlled experiments, predicting 
potential future outcomes from current conditions, generating scenarios 
for rare or extreme events, and validating or calibrating a model.
---------------------------------------------------------------------------

    \12\ A definition a computer simulation is proposed at https://builtin.com/hardware/computer-simulation.
---------------------------------------------------------------------------

     Software development.\13\ This application involves using 
existing database schemas to simulate real-world scenarios and ensure 
that a software application can handle different types of data and 
errors effectively. This application assists in the creation of 
representative data, makes it easier to generate edge cases, protects 
individual privacy, and improves testing efficiency.
---------------------------------------------------------------------------

    \13\ DoD DevSecOps Fundamentals, ibid.
---------------------------------------------------------------------------

    Notably, the CDO Council believes that not all applications of 
modeling and simulation would meet the definition of synthetic data 
generation. For example, weather forecasting applies numerical models 
and applies a complex mix of data analysis, meteorological science, and 
computation methods but does not involve the creation of synthetic or 
artificial data points. Instead, the purpose of these models is to 
predict future conditions.
Questions
     How are these examples representative of synthetic data 
generation? How should they be revised?
     What other examples of synthetic data generation should 
the CDO Council know about?
     What are the key advantages for the use of synthetic data 
generation?

Section 3: Challenges and Limitations in Synthetic Data Generation

    The CDO Council recognizes that synthetic data generation can be a 
valuable technique. However, it should be noted that there are some 
challenges and limitations with the technique. For example, there can 
be challenges generating data that realistically simulates the real 
world and the diversity of real data. Additionally, evaluating the 
quality of a synthetic dataset may also be extremely challenging.
    Synthetic data generation is also subject to challenges commonly 
facing any statistical methods, such as overfitting and imbalances in 
the source data. These challenges reduce the utility of the generated 
synthetic data because they may not be properly representative, 
including failing to represent rare classes.
Questions
     What other challenges and limitations are important to 
consider in synthetic data generation?
     What tools or techniques are available for effectively 
communicating the limitations of generated synthetic data?
     What are best practices for CDOs to coordinate with 
statistical officials on synthetic data?
     What approaches can CDOs consider to help address these 
challenges?

Section 4: Ethics and Equity Considerations in Synthetic Data 
Generation

    Synthetic data generation techniques hold great promise, but also 
introduce questions of ethics and equity. Consistent with Federal 
privacy practices,\14\ any data generation technique involving 
individuals must respect their privacy rights and obtain informed 
consent before using real-world data to generate synthetic data. As 
noted in Section 3, synthetic data generation is also subject to 
challenges commonly facing any statistical methods and has the 
potential to introduce and encode errors or bias, potentially leading 
to discriminatory outcomes.
---------------------------------------------------------------------------

    \14\ OMB Circular A-130, Appendix II https://www.whitehouse.gov/wp-content/uploads/legacy_drupal_files/omb/circulars/A130/a130revised.pdf.
---------------------------------------------------------------------------

    Uses of generated synthetic data must also be carefully considered. 
The context and quality of the generated synthetic data will impact its 
practical utility and impact. Assessing and understanding the fitness 
of a generated synthetic dataset is essential. For instance, a 
generated synthetic dataset may not sufficiently represent the 
diversity of the source dataset. In addition, a generated synthetic 
dataset may not contain sufficient variables to fully represent the 
system and the drivers of differences in the phenomenon it is meant to 
represent.
Questions
     What techniques are available to facilitate transparency 
around generated synthetic data?
     What are best practices for CDOs to coordinate with 
privacy officials on

[[Page 786]]

ethics and equity matters related to synthetic data generation?
     How can we apply the Federal Data Ethics Framework \15\ to 
address these ethics and equity concerns?
---------------------------------------------------------------------------

    \15\ https://resources.data.gov/assets/documents/fds-data-ethics-framework.pdf.
---------------------------------------------------------------------------

Section 5: Synthetic Data Generation and Evidence-Building

    Synthetic data generation can enable the production of evidence for 
use in policymaking. Applications such as simulation or modeling can 
help policymakers explore scenarios and their potential impacts. 
Likewise, policymakers can conduct controlled experiments of potential 
policy interventions to better understand their impacts. Data synthesis 
may help policymakers make more data publicly available to spur 
research and other foundational fact-finding activities that can inform 
policymaking.
Questions
     What other applications of synthetic data generation 
support evidence-based policymaking? \16\
---------------------------------------------------------------------------

    \16\ OMB Memorandum M-19-23.
---------------------------------------------------------------------------

     What is the relationship between synthetic data generation 
and open government data? \17\
---------------------------------------------------------------------------

    \17\ 44 U.S.C. 3520(20).
---------------------------------------------------------------------------

     How can CDOs and Evaluation Officers best collaborate on 
synthetic data generation to support evidence-building? \18\ What about 
other evidence officials? \19\
---------------------------------------------------------------------------

    \18\ OMB Memorandum M-19-23, Appendix A.
    \19\ OMB Memorandum M-19-23, Section II (Key Senior Officials).

Kenneth Ambrose,
Senior Advisor CDO Council, Office of Shared Solutions and Performance 
Improvement, General Services Administration.
[FR Doc. 2024-00036 Filed 1-4-24; 8:45 am]
BILLING CODE 6820-69-P


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.