Data Governance

Data Governance

Skill 6: Data Quality, Governance, and Grounding

The foundational discipline that separates toy demos from production-grade agentic AI.


Overview

Skill 6 addresses the foundational discipline that was missing from earlier frameworks: data quality, governance, and grounding. Research reveals that data quality is the single most important factor in agent performance. The principle of "garbage in, garbage out" applies with even greater force to agentic systems—an agent with perfect reasoning capabilities will fail if grounded in inaccurate, outdated, or biased data.


The Three Sub-Skills

Sub-Skill Focus Area Key Concepts
6.1 Data Quality Assurance Ensuring data accuracy, consistency, and freshness Validation, deduplication, canonicalization, staleness management
6.2 Data Governance and Lineage Traceability, access control, and bias mitigation Lineage tracking, RBAC, bias detection, compliance
6.3 Grounding and Hallucination Prevention Ensuring factual correctness and source attribution Strict grounding, citation, confidence scoring

6.1 Data Quality Assurance

Data Validation and Schema Enforcement

  • Core Principle: All ingested data must conform to defined schemas and quality standards
  • For Structured Data: Enforce database constraints (type checking, foreign key validation, range checks)
  • For Unstructured Data: Detect corrupted PDFs, low-quality OCR, malformed documents
  • Tools: Great Expectations for defining data quality rules

Deduplication and Canonicalization

  • Problem: "IBM", "International Business Machines", "IBM Corp.", "I.B.M." are all the same entity
  • Solution: Entity resolution algorithms using fuzzy matching, similarity scoring, and ML
  • Techniques: Levenshtein distance, Jaro-Winkler, phonetic algorithms, knowledge graph linking

Freshness and Staleness Management

  • Core Principle: Track data currency and refresh outdated information
  • Implementation: Timestamp tracking, TTL policies, scheduled refresh jobs, staleness scoring
  • Use Cases: Financial data, regulatory information, product availability, real-time monitoring

6.2 Data Governance and Lineage

Data Lineage Tracking

  • Core Principle: Every piece of information must be traceable back to its source
  • Audit Trail: Source document → extraction → transformation → storage → retrieval → output
  • Tools: Apache Atlas, OpenLineage, Amundsen
  • Compliance: GDPR (right to explanation), HIPAA, SOC2, financial regulations

Access Control and Data Segmentation

  • Implementation: Role-based access (RBAC), attribute-based access (ABAC), data classification
  • Technical Controls: IAM, row-level security, column-level encryption, data masking
  • Compliance: GDPR data minimization, HIPAA PHI protection, PCI-DSS

Bias Detection and Mitigation

  • Problem: Training data often contains historical biases (gender, race, age, socioeconomic)
  • Detection: Statistical analysis, fairness metrics (demographic parity, equalized odds)
  • Mitigation: Reweighting, resampling, adversarial debiasing, fairness constraints
  • Use Cases: Hiring AI, lending decisions, healthcare recommendations

6.3 Grounding and Hallucination Prevention

Strict Grounding Requirements

  • Core Principle: Agents should only use information from retrieved documents, never parametric knowledge for factual claims
  • Implementation: RAG with strict grounding prompts, attribution checking, validation against sources
  • Use Cases: Enterprise knowledge bases, medical AI, legal AI

Citation and Attribution

  • Core Principle: Every factual claim should include a citation to the source document
  • Implementation: Inline citations, footnote-style references, clickable links to sources
  • Benefits: Builds trust, enables fact-checking, improves reliability

Confidence Scoring and Uncertainty

  • Core Principle: Agents should express uncertainty and refuse to answer when confidence is low
  • Implementation: Confidence scores (0-1), uncertainty bands, explicit "I don't know" responses
  • Use Cases: High-stakes decisions, safety-critical systems, customer service escalation

Real-World Failure Modes

Healthcare AI Hallucination

Scenario: Medical diagnosis agent hallucinates treatment recommendations
Impact: Patient harm, legal liability, loss of trust
Mitigation: Implement RAG with strict grounding, require citations to medical literature

Hiring AI Bias

Scenario: Resume screening agent discriminates against women and minorities
Impact: Legal action, reputational damage, regulatory fines
Mitigation: Bias detection, fairness constraints, diverse training data, regular audits

Success: Financial Services Compliance

Implementation: Apache Atlas for lineage, Great Expectations for validation, strict access control
Outcome: Passed regulatory audits, zero compliance violations, competitive advantage


Transferable Competencies

Mastering Skill 6 requires proficiency in:

  • Data Engineering: ETL pipelines, data validation, quality assurance
  • Data Governance: Lineage tracking, access control, compliance frameworks
  • Information Quality: Accuracy, completeness, consistency, timeliness dimensions
  • Entity Resolution: Fuzzy matching, deduplication, canonicalization
  • Fairness and Ethics: Bias detection, fairness metrics, mitigation techniques
  • Regulatory Compliance: GDPR, HIPAA, SOC2, PCI-DSS, industry standards

Common Pitfalls

  1. No validation: Ingesting data without quality checks leads to garbage outputs
  2. Ignoring duplicates: Multiple representations of entities cause inconsistency
  3. Stale data: Using outdated information leads to incorrect decisions
  4. No lineage tracking: Cannot explain where information came from
  5. Weak access control: Data leakage and privacy violations
  6. Ignoring bias: Perpetuating and amplifying historical biases
  7. Allowing hallucinations: Not enforcing strict grounding requirements
  8. No citations: Users cannot verify factual claims

Key Technologies

Data Quality

  • Great Expectations (data validation framework)
  • Deequ (AWS data quality)
  • Apache Griffin (data quality solution)

Data Governance

  • Apache Atlas (governance and metadata)
  • Amundsen (data discovery)
  • OpenLineage (data lineage standard)

RAG Frameworks

  • LlamaIndex (comprehensive RAG with data quality)
  • Haystack (pipeline-based RAG with governance)
  • LangChain (grounding capabilities)

Bias Detection

  • Fairlearn (Microsoft)
  • AI Fairness 360 (IBM)
  • What-If Tool (Google)

The Bottom Line

Skill 6 is the foundational discipline that separates toy demos from production-grade agentic AI. Data quality, governance, and grounding are not optional—they are the bedrock of trustworthy, compliant, and reliable systems. The most successful AI organizations won't just have the smartest models; they'll have the cleanest data and the most rigorous governance.


← Back to Nine Skills Framework | Next: Skill 7 - Identity Management →