Streamlit is an open-source Python library designed to effortlessly create interactive web applications for data science and machine learning endeavors. It transforms data scripts into shareable web apps with minimal coding, empowering users to build dashboards, visualize data, and craft interactive plots without the need for extensive web development expertise.
Snowpark, an innovative technology from Snowflake, a cloud-based data warehousing platform, allows data engineers, scientists, and developers to write code in familiar programming languages such as Java, Scala, and Python. This code can then be executed directly within Snowflake, bridging the gap between data processing and warehousing and enabling more flexible and integrated data pipelines.
Combining the capabilities of Streamlit and Snowpark offers a distinctive advantage in the realm of interactive data applications. This amalgamation allows the creation of visually appealing, highly functional applications that seamlessly integrate real-time data analysis and processing. Such a synergy unlocks new possibilities for data-driven decision-making, facilitating effective collaboration between data professionals and developers.
Data engineering and migration projects often necessitate the creation of multiple tables, including SCD Type-1 and Type-2 tables. Crafting code for each table individually can be laborious and prone to errors, such as data type mismatches or column order issues. We have explored a streamlined approach that simplifies this process through the utilization of a dynamic script generation framework. Leveraging the capabilities of Streamlit and Snowpark, we have developed an application designed to generate SQL scripts facilitating CDC checks, generate Surrogate Keys if necessary, and facilitate data loading into the final target table. This blog post provides comprehensive insights into the usage and development of this innovative solution.
The Challenge
When confronted with numerous tables to migrate, developers encounter:
- Time-consuming repetition in coding similar patterns for all tables to capture data changes and merge data into target tables.
- Increased likelihood of human error and the need for rework due to managing data type mismatches and column order discrepancies.
The Approach
To surmount these challenges, we have devised a framework that automates the script generation process. This framework utilizes Snowpark for Python as the native language and leverages Streamlit for the user interface (UI). By dynamically generating scripts based on user inputs, the framework eliminates the need for manual coding, thereby reducing the time required to prepare scripts.
Solution Overview
The solution consists of a generic notebook that generates the necessary scripts for SCD Type-1 and Type-2 tables. Here’s a high-level overview of the solution’s key components and functionalities:
- DDL Generation: The framework generates Data Definition Language (DDL) scripts for the Stage, CDC, and Surrogate_Key tables.
- Script Generation: Dynamically generates SQL scripts for performing CDC, generating surrogate keys, and merging the final data into the target table, tailored specifically for Snowflake ETL processes.
- Snowflake Integration: Once generated, the scripts can be seamlessly executed in Snowflake to load data from the Stage to the Target table, eliminating the need for manual script preparation and enhancing efficiency.
Getting the Required Inputs
To generate the scripts, the framework requires several inputs:
- SOURCE_DUPLICATE_FLAG: Indicates whether duplicates are expected from the source.
- PRIMARY_KEYS: Specifies the primary keys of the table.
- SURROGATE_KEY: Specifies the surrogate key of the table.
- CDC_START_DATE: Represents the first occurrence of a record, used for SCD Type-2 tables.
- CDC_END_DATE: Represents the latest occurrence of a record, used for SCD Type-2 tables.
- LOAD_DATE: Represents the date/time at which the record was first inserted into the target table.
- UPDATE_DATE: Represents the date/time at which the existing record was updated.
- STAGE_DB: Specifies the DB.SCHEMA details for the stage tables.
- TARGET_DDL: Provides the DDL of the target table.
- SCD-TYPE: Specifies the SCD type of the target table.
- SNOWFLAKE_ACCOUNT: Specifies the Snowflake server account details.
- SNOWFLAKE_USER: Specifies the Snowflake login username.
- SNOWFLAKE_PASSWORD: Specifies the Snowflake login password.
- SNOWFLAKE_DB: Specifies the Target Database Name.
- SNOWFLAKE_SCHEMA: Specifies the Target Schema Name.
Impact
- Increased Developer Efficiency: Automation of script generation allows developers to focus more on business logic and higher-value tasks.
- Drastically Reduced Code Building Time: The time taken to build SQL code for SCD tables is significantly reduced, leading to faster project completion.
Limitations
- Manual Source-to-Stage Script: The framework does not automatically generate the source-to-stage script, as the transformation process may vary from table to table. Developers need to write this script manually.
- The TARGET DDL input should not have any leading or trailing spaces near the column names or table name.
- Avoid having any leading or trailing spaces in the input.
Conclusion
Data engineering and migration projects often involve the development of SCD Type-1 and Type-2 tables, which can be time-consuming and error-prone when writing repetitive code. However, with the dynamic script generation framework discussed in this blog, developers can significantly streamline the process. By automating the generation of scripts for CDC, surrogate key generation, and data merging, this framework reduces the time required to prepare scripts, allowing developers to focus more on business logic. With Snowflake integration and a user-friendly Streamlit UI, this solution empowers data engineers and simplifies the overall data migration journey.