As the US government pursues a goal of advanced artificial intelligence (AI) infrastructure by 2025, a clear challenge emerges. How can federal departments and agencies achieve autonomous, self-sufficient data science capabilities that are also secure, scalable, transparent, and free from both bias and technical debt?
Solving this challenge requires overcoming the misconception that data is a prepackaged resource, fully formed and ready to be put to use. In practice, just the opposite is true: data must be cultivated and curated to become useful. The pace and application of data in our increasingly digital world mean that data and data architecture are now strategic, tangible, real, and measure assets, creating the need for operations that measure, plan, and optimize for “data readiness.” These operations must strike a balance between standardization and customization—building systems that, while streamlined, also recognize that end-users are individual stakeholders with unique needs and goals.
So, how can the federal government achieve its goal of advanced AI infrastructure by 2025? The answer lies in building modern data architecture: a responsible, end-to-end data pipeline that achieves data readiness by:
- Designing data for specific end-users and use cases
- Democratizing AI
- Providing easy, unfettered access to all stakeholders, including leadership, to support decision-making
Defining Modern Data Architecture
At its core, modern data architecture is comprised of people, products, and repeatable processes. The objective is to create a microservices-based architecture with specialized modules that “plug and play” into an overall framework. Each individual module is independent, yet facilitates information and capability sharing with all other components through an integrated approach. This overall framework enables collaboration between the administrative and technological elements of data science.
Modern data architecture centers on three primary areas:
Enterprise Analytics, which uses data analysis and modeling to inform decision-making, is supported by:
- Data Governance – conducting data stewardship and data management
- User Adoption – fostering end-user adoption through training, communications, and user engagement
- Reporting Steering Committee – defining and managing key performance indicators (KPIs), business definitions, applications, and priorities
- Business Process Improvement – enabling data-driven actions mapped to KPIs
Enterprise Solutioning, in which stakeholders define use cases and create a minimum viable product (MVP) based on desired outcomes, consists of:
- Data Lab – subject matter experts (SMEs) are granted unfettered access to data in a heavy support model
- AI Forge – serves as the conduit and workbench for engineers, data scientists, operational SMEs, and stakeholders to coordinate and perform solution development
- Collaborative Analytics Hub – operational SMEs, data scientists, and other stakeholders perform ad hoc data discovery and combine insights around business problems
Enterprise Enablement, which focuses on scaling and expanding the MVP to meet the needs of the entire enterprise, consists of:
- DevSecOps – platform operations and security are developed in an Agile, containerized fashion
- Data Engineering and Extract-Transfer-Load (ETL) – larger-scale data processing procedures and systems are created
- Research and Development (R&D) – emerging tools and products are evaluated and incorporated into the platform
- Infrastructure Management – the platform is hardened, secured, and made elastic
- Business Reporting – visualizations, dashboards, and standard reports are created for enterprise reporting and accountability
Achieving Data Readiness
Modern data architecture is designed to achieve stated goals like those embodied by “VAULTIS” (ref: DoD Data Strategy, October 2020)—data that is:
- Visible – consumers can locate the data
- Accessible – consumers can retrieve the data
- Understandable – consumers can recognize the data’s content, context, and applicability
- Linked – consumers can exploit data elements through innate relationships
- Trustworthy – consumers can trust all aspects of data to inform decision-making
- Interoperable – consumers have a common representation and comprehension of data
- Secure – consumers know data is protected from unauthorized use and manipulation
To VAULTIS, we add E for ethical, to assert that data and its algorithms, attributes, and correlations must be accountable, impartial, resilient, transparent, secure, and governed. This EVAULTIS framework ensures that data, algorithms, and processes are not only scrutinized for accuracy and impartiality throughout every stage of the data pipeline, but are also reliable and durable, governed by clear organization and policies, and protected from potential risks and cyber threats.
So, how does an organization leverage this framework to evaluate if any given data set can be effectively used for its intended purpose? In other words, how do we measure data readiness to better govern its fitness of use for decision-making?
The answer lies in another framework, ARTO. While EVAULTIS describes the desired end-state of data, ARTO provides the “how to” guidelines and associated metrics to achieve this end-state:
- Accuracy – data is complete, relevant, consistent, interpretable, and free from errors
- Repeatability – data is consistent, precise, and interoperable
- Timeliness – data has sufficient volume, throughput, and velocity
- Operations – data is organized and disseminated through regular reporting, steering councils, dictionaries, and standards for documentation
Based on ECS’ experience implementing the ARTO framework, these data reporting and readiness systems greatly accelerate data’s ability to inform and improve critical decision-making. We apply a maturity model to assess the activities and controls that support data collection and analysis, fostering greater accuracy and consistency on an ongoing basis. The same metrics that improve the curation of data also provide a disciplined approach to articulating the data landscape. This approach enables an organization to streamline the oversight of data use and dissemination.
ARTO provides readily actionable information that empowers data review boards (DRBs), analytics review boards (ARBs), and institutional review boards (IRBs) to quickly assess how data is being used, its ability to empower accurate decision-making, and the limitations that may need to be deployed for security and compliance.
This standardized approach—based on activities that co-benefit the derivation, use, and delivery, as well as the governance, of data—is key to accelerating an organization’s ability to oversee data use and consumption without hampering the activities of those leveraging the data assets. By operating the consumption and governance of data in tandem, these processes are mutually strengthened and accelerated.
Compliance and Security
Modern data architecture accelerates compliance and security, meeting many stated objectives of the Federal Data Strategy (FDS) Action Plan and broader federal objectives.
While technology provides a wealth of opportunities to apply controls to data and reduce administrative burden, it does not eliminate that burden altogether. Modern data architecture identifies extraneous and burdensome administrative controls, helping organizations streamline their governance process by homing in on the important administrative elements.
The deployment of modern data architecture leverages technical controls to empower ready use of data assets. Key aspects of modern data architecture that facilitate compliance and security include:
- Component- and service-based DevSecOps and infrastructure as a service (IaaS) ensure the development of repeatable and sustainable processes that can be readily reviewed, approved, and redeployed with minimal need for additional administrative oversight.
- Automated interfaces to enterprise-wide data governance platforms enable automatic updates for data use and dissemination, rapid integration with established and emerging data standards, and compliance with open data requirements. Integration of change management processes streamlines the administrative burden by transforming data governance into a shared endeavor.
- Automated data inventory metadata feeds enable the “timeliness, completeness, consistency, accuracy, usefulness, and availability of open Government data assets” (ref: FDS 2020 Action Plan). Coordinating the deployment of effective metadata ensures not only that data consumers can understand the meaning of data, but also that governance has tools to understand the security classification of the data, facilitating rapid assessments of disclosure and use.
- Event logging and monitoring fosters the realization of the full DevSecOps project methodology, integrating security and operational needs into the development and data analytics processes. The proactive monitoring of data assets enables an organization to respond to data issues, while providing avenues for oversight and governance to understand drifts in the use of data assets.
- Enterprise user management makes data available on a need-to-know basis with appropriate user/role/attribute level controls.
- Multi-mode data source integration offers flexibility to introduce information from different data sources including databases and sensors.
- Enterprise data/model dissemination channels enable the sharing of data, models, and best practices among partners.
Accelerating Other Initiatives
Data science does not exist in a vacuum; just as modern data architecture paves the way for effective, scalable AI infrastructure, so too will it accelerate other key federal initiatives, including:
Agile Algorithm Development
The ECS AI Forge is a platform that integrates data lab feeds to mobilize models, enabling users to define problems, frame solutions, and perform rapid discovery workshops and other ad-hoc analyses to develop and evaluate plans of action. The collaborative nature of the ECS AI Forge’s modern data architecture promotes Agile solutioning that is open, creative, and collaborative, accelerating the iterative processes for creating advanced algorithms. This is a repeatable and sustainable technological process that minimizes the burden of administrative controls.
The ECS AI Forge transforms data access, governance, and management into an easy-to-use user experience. By creating a logical, federated data library, the ECS AI Forge delivers all the advantages of the data lake concept, while enabling users to manage myriad data sources in many different physical locations, each governed by its own access controls and data formats.
The platform provides transparency across multi-partner, distributed data sets through a single lens, improving data auditability, governance, management, and policy for applications deployed at scale and across echelons. Data and models can reside in any physical location—no matter the agency, program or initiative—and be approved, saved, and audited at different security levels, greatly simplifying partner management in distributed data projects.
Once a plan of action is created, the ECS AI Forge enables communication, collaboration, and experimentation with data to test, train, and recalibrate models, while data engineers evaluate and expand the data to assess data quality, historical relevance, and priorities. As this process proceeds, the enterprise enablement team integrates models into a solution, continually enhancing and calibrating the models’ progress. Through business reporting, models and all associated deployment activities are documented, tracked, and validated.