Overview
Data quality validation is a critical aspect of data governance, ensuring that datasets are accurate, complete, and consistent. By leveraging a Large Language Model (LLM) like Meta LLaMA, we can dynamically generate data quality rules tailored to the structure and characteristics of a dataset. This approach enhances automation, reduces manual effort, and ensures compliance with best practices in data validation.
Additionally, Databricks Unity Catalog plays a key role in improving metadata management and governance by providing a unified, secure, and scalable platform to track datasets, manage schema evolution, and enforce access control.
Key Steps in Generating Data Quality Rules
1. Extracting Dataset Profile
The first step in the process is to extract key metadata and statistical insights from the dataset. Databricks Unity Catalog helps in this process by:
- Providing centralized schema management to track column names, data types, and schema evolution.
- Allowing fine-grained access control to ensure only authorized users can extract metadata.
- Offering built-in data lineage tracking, which helps identify how data moves across different tables and transformations.
The dataset profile includes:
- Schema Information: Column names and data types.
- Null Value Analysis: Identifying missing values in each column.
- Distinct Value Count: Understanding uniqueness in the dataset.
- Statistical Summary: Computing minimum, maximum, mean, standard deviation, percentiles, skewness, and kurtosis for numerical columns.
- Most Frequent Values: Identifying common patterns in categorical columns.
These insights form the foundation for generating meaningful validation rules.
Essential Components of Data Profiling
2. Formulating a Dynamic LLM Prompt
To ensure precise and context-aware rule generation, a structured prompt is created for the LLM. The prompt includes:
- Dataset Profile: A structured summary of the extracted metadata and statistics.
- Additional Instructions: Any user-specified constraints or domain-specific rules.
- Rule Format Guidelines: Rule format guidelines are enforced to ensure a standard output format
- Regex-based Format Validation: Ensuring pattern compliance using regex rather than relying on built-in format checks.
This structured prompt ensures that the LLM generates high-quality, standardized validation rules.
3. LLM-Driven Rule Generation
The LLM processes the prompt and outputs validation rules in JSON format. Each rule follows a structured schema:
- Column Name: The target column for validation.
- Expectation Type: The validation rule (e.g., uniqueness, non-null constraint, format match).
- Regex Pattern: If applicable, a regex pattern will be used to validate the format.
- Min/Max Value: Numeric constraints where relevant.
- Data Type: The expected data type for the column.
The JSON output is structured to ensure that each expectation is clearly defined, allowing for seamless integration into data validation frameworks.
4. Storing and Applying DQ Rules in Unity Catalog
Once the data quality rules are generated, Databricks Unity Catalog serves as a central repository to store and manage them:
- Metadata-driven governance: The rules can be linked to specific datasets and tables.
- Secure rule storage: Access to the validation rules can be controlled using Unity Catalog’s permission model.
- Versioning and lineage tracking: Unity Catalog keeps track of rule updates and historical changes.
By storing validation rules in the Unity Catalog, organizations can ensure seamless integration with existing data quality frameworks, enabling reusability across different data processing pipelines.
5. Orchestrating Data Quality Validation using Databricks Workflows
To automate and operationalize the entire data quality process, we integrate all the steps into a Databricks Workflow. This ensures seamless execution and scheduling of each step.
6. Testing with Existing Data Quality Frameworks
The generated validation rules are then tested with existing data quality frameworks such as:
- Databricks Workflow Pipelines: Applying the rules dynamically to validate datasets in real time.
- Delta Live Tables (DLT): Enforcing quality checks on streaming and batch data ingestion.
- Custom Data Quality Framework: Identifying data trends, drifts, or deviations through week-on-week comparison against predefined threshold limits. This capability ensures the precise detection of anomalies and data quality issues.
Data Quality Rule Generation Framework
Sample Output
1. Generated Prompt
Generated Prompt: You are a data quality expert. Your task is to suggest appropriate validation rules for a dataset. ###Data Profile { “Schema”: [ [ “order_id”, “string” ], [ “order_item_id”, “bigint” ], [ “product_id”, “string” ], [ “seller_id”, “string” ], [ “shipping_limit_date”, “timestamp” ], [ “price”, “double” ], [ “freight_value”, “double” ] ], “Total Rows”: 112650, “Null Counts”: { “order_id_null_count”: 0, “order_item_id_null_count”: 0, “product_id_null_count”: 0, “seller_id_null_count”: 0, “shipping_limit_date_null_count”: 0, “price_null_count”: 0, “freight_value_null_count”: 0 }, “Non-Null Counts”: { “order_id”: 112650, “order_item_id”: 112650, “product_id”: 112650, “seller_id”: 112650, “shipping_limit_date”: 112650, “price”: 112650, “freight_value”: 112650 }, “Distinct Counts”: { “order_id_distinct_count”: 99770, “order_item_id_distinct_count”: 21, “product_id_distinct_count”: 32938, “seller_id_distinct_count”: 3080, “shipping_limit_date_distinct_count”: 95982, “price_distinct_count”: 5988, “freight_value_distinct_count”: 7026 }, “Numeric Stats”: { “price_min”: 0.85, “freight_value_min”: 0.0, “price_max”: 6735.0, “freight_value_max”: 409.68, “price_mean”: 120.65373901477311, “freight_value_mean”: 19.99031992898562, “price_std_dev”: 183.6339280502597, “freight_value_std_dev”: 15.806405412296998, “price_q1”: 39.9, “freight_value_q1”: 13.08, “price_median”: 74.99, “freight_value_median”: 16.26, “price_q3”: 134.9, “freight_value_q3”: 21.15, “price_skewness”: 7.923102760715804, “freight_value_skewness”: 5.639794522020189, “price_kurtosis”: 120.82288192024787, “freight_value_kurtosis”: 59.785545672106004 }, “Top Values”: { “order_id”: “8272b63d03f5f79c56e9e4120aec44ef”, “order_item_id”: 1, “product_id”: “aca2eb7d00ea1a7b8ebd4e68314663af”, “seller_id”: “6560211a19b47992c3666cc44a7e94c0”, “shipping_limit_date”: “2017-07-21T18:25:23”, “price”: 59.9, “freight_value”: 15.1 } } ### Additional Information: Each column in the dataset can have multiple expectation types like unique and not null and etc ### Instructions: – Identify the best validation rules for each column based on type and sample values. – Use **Great Expectations format** only (e.g., expect_column_values_to_be_of_type, expect_column_values_to_match_regex,unique,not_null). -Return only in JSON format –if a column has multiple expectation type have them as separate json structures like for example customer id can have a nested structure with one expectation type and same customer id can have a separate new nested structure for the new expectation type –for every format check it must be regex check with regex value and expectation_type must be expect_values_to_match_regex and not expect_column_values_to_match_strftime_format – Return results in JSON format as follows without any extra summary or information just json results like the below format: { “column_name”:”column_name”,expection_type”:”expectation_type”,”regex”:”regex_value”,”column_type”:”data_type”, “min_value”:”min_value”, “max_value”:”max_value” } |
2. Generated Data Quality Rules
Generated DQ Rules “`json [ { “column_name”: “order_id”, “expectation_type”: “not_null”, “column_type”: “string” }, { “column_name”: “order_id”, “expectation_type”: “expect_values_to_match_regex”, “regex”: “^[a-f0-9]{32}$”, “column_type”: “string” }, { “column_name”: “order_item_id”, “expectation_type”: “not_null”, “column_type”: “bigint” }, { “column_name”: “order_item_id”, “expectation_type”: “expect_column_values_to_be_of_type”, “column_type”: “bigint” }, { “column_name”: “order_item_id”, “expectation_type”: “expect_column_values_to_be_between”, “column_type”: “bigint”, “min_value”: “1”, “max_value”: “21” }, { “column_name”: “product_id”, “expectation_type”: “not_null”, “column_type”: “string” }, { “column_name”: “product_id”, “expectation_type”: “expect_values_to_match_regex”, “regex”: “^[a-f0-9]{32}$”, “column_type”: “string” }, { “column_name”: “seller_id”, “expectation_type”: “not_null”, “column_type”: “string” }, { “column_name”: “seller_id”, “expectation_type”: “expect_values_to_match_regex”, “regex”: “^[a-f0-9]{32}$”, “column_type”: “string” }, { “column_name”: “shipping_limit_date”, “expectation_type”: “not_null”, “column_type”: “timestamp” }, { “column_name”: “shipping_limit_date”, “expectation_type”: “expect_column_values_to_match_regex”, “regex”: “^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}$”, “column_type”: “timestamp” }, { “column_name”: “price”, “expectation_type”: “not_null”, “column_type”: “double” }, { “column_name”: “price”, “expectation_type”: “expect_column_values_to_be_of_type”, “column_type”: “double” }, { “column_name”: “price”, “expectation_type”: “expect_column_values_to_be_between”, “column_type”: “double”, “min_value”: “0.85”, “max_value”: “6735.0” }, { “column_name”: “freight_value”, “expectation_type”: “not_null”, “column_type”: “double” }, { “column_name”: “freight_value”, “expectation_type”: “expect_column_values_to_be_of_type”, “column_type”: “double” }, { “column_name”: “freight_value”, “expectation_type”: “expect_column_values_to_be_between”, “column_type”: “double”, “min_value”: “0.0”, “max_value”: “409.68” } ] |
Benefits of LLM-Driven Data Quality Rule Generation
1. Automation
- Reduces manual effort in defining data quality rules.
- Eliminates human errors in rule creation.
2. Scalability
- Can handle large datasets with varying structures.
- Supports multiple data sources and formats.
3. Consistency
- Ensures uniform application of validation rules.
- Avoids discrepancies in data governance practices.
4. Customizability
- Allows for domain-specific rules via additional instructions.
- Adapts dynamically to changes in schema and business requirements.
5. Improved Governance with Unity Catalog
- Centralized rule repository: Ensures all teams follow consistent validation policies.
- Secure access controls: Manages who can read, modify, or apply DQ rules.
- Auditability and lineage tracking: Helps trace rule applications across datasets and transformations.
By integrating LLMs like Meta LLaMA with Databricks Unity Catalog, organizations can achieve a more automated, scalable, and intelligent approach to data governance. This methodology enhances accuracy, reduces human intervention, and ensures that datasets adhere to high-quality standards while benefiting from secure, centralized rule management in Unity Catalog.