In Search of Clean Data for AI Projects
High-performing AI companies are taking steps to check data for skew, bias and protected attributes; Dr. Jeannette Wing says formal methods for AI needed.
By John P. Desmond, Editor, AI in Busines

The search for quality data is a roadblock to developing fair and equitable AI systems. Companies that are high performers in AI systems development are now taking steps to improve the quality of the needed data.
A recent survey of 504 IT executives released by Appen and conducted by The Harris Poll, finds that business and technology leaders are arriving at a consensus that they need to concentrate on more “responsible” approaches to AI, according to a recent report from ZDNet.
Some 51 percent reported that data accuracy is critical to their AI use case, but only six percent reported achieving data accuracy over 90 percent. "Many are facing the challenges of trying to build great AI with poor datasets," stated the survey authors. "To successfully build AI models, organizations need accurate and high-quality data. Unfortunately, business leaders and technologists report a significant gap in the ideal versus reality in achieving data accuracy."
Companies more advanced in development of AI systems are taking steps to correct for bias in datasets, according to a 2021 report from McKinsey. The results showed:
40 percent were found to have protocols in place to ensure good data quality;
47 percent of high-performing AI companies were scanning training and testing data to detect underrepresentation of protected attributes;
47 percent were checking for skewed or biased data during the data ingestion stage;
36 percent of data professionals were actively checking for skewed or biased data in several stages of model development;
In 24 percent of the AI high performing companies, legal and risk professionals were working with data science teams to help them understand definitions of bias and protected classes;
And in 23 percent of AI high performing companies, a dedicated governance committee that includes risk and legal professionals was in place.
The McKinsey online survey was conducted in mid-2021, based on responses from 1,843 participants, over 1,000 in organizations that had adopted AI in at least one function. A range of regions, industries, company sizes, functional specialties and tenure were represented.
Great Expectations Survey on Data Quality Finds Negative Impact
More findings on the concerns about data quality in AI projects recently came from a survey conducted by Great Expectations, a shared, open standard for data quality. The survey was conducted for the company in May 2022 by Pollfish, an independent research platform. Over 500 information services and data professionals in the US responded, (57 percent men; 43 percent women, both of ages between 18 and 54; some 60 percent of respondents were employed at companies with 250 or more employees, according to an account from insideBigData.
The findings showed 71 percent reporting having data quality issues and 91 percent saying those affected the company’s performance.
“Poor data quality and pipeline debt [undocumented, untested, unstable data] create organizational friction between stakeholders, with consequences like degraded confidence,” stated Abe Gong, CEO and Cofounder of Superconductive, the company that makes Great Expectations. “This survey made it clear that data quality issues are prevalent, and they’re harming business outcomes.”
Data practitioners blamed poor data quality on: lack of documentation (31 percent), lack of tooling (27 percent), and teams not understanding each other (22 percent).
Fewer than half of respondents expressed high trust in their organization’s data; 13 percent had low trust in data quality, stemming from broken apps or dashboards, decisions based on unreliable or bad data; teams having no shared understanding of metrics; and siloed or conflicting departments. Friction between teams was an additional issue.
“Data quality is critical to facilitate the making of decisions with confidence across the organization, enabling a singular understanding of what that data means and what it’s being used for,” Gong stated. “That’s why support for data quality efforts should be found at every level of an organization, from data scientists and engineers to the C-suite and board, to have confidence in outcomes for decision-making.”
Ongoing data quality efforts include: having a data quality plan scoped out and budgeted (22 percent), using a specific data quality tool (19 percent), checking data manually (14 percent), and building their own systems (15 percent).
Prof. Wing Argues for Formal Methods to Define Quality AI Development
One computer scientist makes the case that new formal methods are needed, in particular around data, to make AI trustworthy.
Jeannette Wing, director of the Data Science Institute at Columbia University, wrote an account in Communications of the ACM last October to make the case. “The ultimate purpose of this article is to rally the computing community to support a broad-based, long-term research program on trustworthy AI, drawing on the expertise and sensibilities from multiple research communities and stakeholders,” Dr. Wing stated
Also, “Trustworthy AI adds new desired properties above and beyond those for trustworthy computing; AI systems require new formal methods techniques, and in particular, the role of data raises brand-new research questions,” she stated.
A tradeoff between accuracy and fairness might need to be considered by AI system developers. “The machine learning community considers accuracy as a gold standard, but trustworthy AI requires us to explore trade-offs among these properties,” Dr. Wing stated. Some accuracy might have to be traded off to make a more fair model, for example.
Moreover, fairness can have different interpretations, including demographic parity, equal odds and individual fairness, “some of which are incompatible with each other,” she stated.
AI systems employing deep neural nets add a dimension of complexity, in that they are based on probabilities, as opposed to a definite outcome fully determined by parameter values. What she calls a “probabilistic” approach incorporating some randomness, as opposed to a “deterministic” approach where the outputs can be traced back to the inputs.
“More recent advances in machine learning, which rely on big data, add to their probabilistic nature, as data from the real world are just points in a probability space,” Dr. Wing stated. Making the probabilistic AI system trustworthy, is the challenge.
Dr. Wing has a suggestion. “One approach for building end-user trust in computing systems is formal verification, where properties are proven once and for all over a large domain.” She sees the approach as having some points in its favor.
“Formal verification has the advantage of obviating the need to test individual input values or behaviors one-by-one, which for large (or infinite) state spaces is impossible to achieve completely,” Dr. Wing stated.
Precedent does exist. Success stories in the use of formal methods for software development have been achieved in efforts to detect device driver bugs, for example, resulting in scalable and practical results. Similar approaches she sees used in the hardware and software industries, by companies including Intel, IBM, Microsoft and Amazon.
Read the source articles and information from ZDNet, from McKinsey, from insideBigData and in Communications of the ACM.
(Write to the editor here.)