

Hope this article gives you an idea of how to use the great_expectations data quality framework, when to use it and when not to use it. Do not use a data quality framework, if you only have a few (usually Do not use a data quality framework, if simple SQL based tests at post load time works for your use case.It would be wise to use it only when needed. This tool is great and provides a lot of advanced data quality validation functions, but it adds another layer of complexity to your infrastructure that you will have to maintain and trouble shoot in case of errors. Great Expectations has a lot of options for storage as shown here When not to use a data quality framework When deploying in production, you can store any sensitive information(credentials, validation results, etc) which are part of the uncommitted folder in cloud storage systems or databases or data stores depending on your infratructure setup. This lends itself to easy integration with scheduling tools like airflow, cron, prefect, etc. Now that we have seen how to run tests on our data, we can run our checkpoints from bash or a python script(generated using great_expectations checkpoint script first_checkpoint). Now let’s create our own expectation file, and call it error When you click on the expectation suite, you will see the sample expectation shown in human readable format in the UI Let’s view this in the data doc site, since these are static website you can just open them directly in your web browser, let’s open the file /great_expectations/uncommitted/data_docs/local_site/index.html Tions.\n- Because this suite was auto-generated using a very basic profiler that does not know your data like you do, many of the expectations may not be meaningful.\n"Īs the content says, this is a simple expectation configuration created by great expectations based on scanning first 1000 rows of your dataset. "# This is an _example_ suite\n\n- This suite was made by quickly glancing at 1000 rows of your data.\n- This is **not a production suite**. "expectation_type": "expect_column_value_lengths_to_be_between", "expectation_type": "expect_table_row_count_to_be_between", Start a local postgres docker using the below command on your terminal Now that we have a basic understanding on what great_expectations does, let’s look at how to set it up, writes test cases for data from a postgres database that can be grouped together and run. But note that here we are testing by loading the data within the application, in the next section we will see how we can create exceptions that are run on databases. The result section of the response JSON has some metadata information about the column and what percentage failed the expectation. "expectation_type": "expect_column_values_to_be_unique", Open a python repl by typing in python in your terminal You can use great_expectations as you would any other python library.
#POSTGRES APP DATA DIRECTORY WINDOWS#
docker, if you have windows home you might need to look here.In this tutorial we will build a simple data test scenario using an extremely popular data testing framework called Great Expectations pre-requisites And if you have no ML model monitoring setup this can cause significant damage(money or other metric based) over a long period to your business. If there was an issue with a feature (say feature scaling was not done) and the ML model uses this feature, all the prediction will be way off since your model is using unscaled data. Depending on how the data is used this can cause significant monetary damage to your business.Īnother example would be machine learning systems, this is even trickier because you do not have the intuition of a human in the loop with computer systems. The quality of your data will affect the ability of your company to make intelligent and correct business decisions.įor example a business user may want to look at customer attribution using a third party marketing dataset, if for whatever reason the third party data source maybe faulty and if this goes unnoticed, it will lead to the business user making decision based on wrong data. It can be as simple as ensuring a certain column has only the allowed values present or falls within a given range of values to more complex cases like, when a certain column must match a specific regex pattern, fall within a standard deviation range, etc.

Quality should be defined based on your project requirements. As the name suggest, it refers to the quality of our data.
