Quickly Compare Datasets (Python)

abhinaya rajaram
3 min readSep 25, 2023

Knowing the differences between two datasets

In data analysis and integration tasks, it’s common to encounter the need to compare and reconcile two datasets. These datasets might come from different sources, have varying structures, or represent the same data at different points in time.

Accurately identifying discrepancies, inconsistencies, or matching records across datasets is crucial for data quality assessment, data cleaning, and gaining meaningful insights. The Python package DataCompy provides a valuable solution for this task.

Using DataCompy, data analysts and engineers can efficiently compare datasets, identify discrepancies, and take appropriate actions to enhance data quality and ensure reliable analysis and decision-making.

DataCompy is a Python package that provides functionalities to compare two datasets and identify discrepancies or differences between them.

Below are steps to use DataCompy for comparing datasets:



from datacompy import Compare
import pandas as pd

# Sample data
data1 = {'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']}
data2 = {'ID': [1, 2, 4], 'Name': ['Alice', 'Bobby', 'David']}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)



compare = datacompy.Compare(
df1,
df2,
join_columns= ['ID','ID'], #You can also specify a list of columns eg ['policyID','statecode']
abs_tol=0, #Optional, defaults to 0
rel_tol=0, #Optional, defaults to 0
df1_name='Spark', #Optional, defaults to 'df1'
df2_name='Athena' #Optional, defaults to 'df2'
)
print(compare.report())

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

DataFrame Columns Rows
0 Spark 2 3
1 Athena 2 3

Column Summary
--------------

Number of columns in common: 2
Number of columns in Spark but not in Athena: 0
Number of columns in Athena but not in Spark: 0

Row Summary
-----------

Matched on: id, id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 2
Number of rows in Spark but not in Athena: 1
Number of rows in Athena but not in Spark: 1

Number of rows with some compared columns unequal: 1
Number of rows with all compared columns equal: 1

Column Comparison
-----------------

Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 1

Columns with Unequal Values or Types
------------------------------------

Column Spark dtype Athena dtype # Unequal Max Diff # Null Diff
0 name object object 1 0 0

Sample Rows with Unequal Values
-------------------------------

id id name (Spark) name (Athena)
1 2 2 Bob Bobby

Sample Rows Only in Spark (First 10 Columns)
--------------------------------------------

id name
2 3 Charlie

Sample Rows Only in Athena (First 10 Columns)
---------------------------------------------

id name
3 4 David

You can see how we have as an output a comparison summary showing matching records and differences based on the specified key (‘ID’ in this case). You can adjust the parameters and customize the comparison to meet your specific requirements.

Conclusion:

Key advantages include its ease of use and straightforward integration with pandas DataFrames. DataCompy enables users to quickly and accurately identify matching records and discrepancies between datasets, providing essential insights for data validation, data cleaning, and data integration processes. Additionally, the ability to customize comparison parameters, such as join columns and comparison methods, enhances its flexibility and adaptability to various use cases. However, it’s important to note that DataCompy is primarily designed for small to medium-sized datasets and may face performance challenges when dealing with extremely large datasets. Users should be cautious and evaluate its performance based on the size and complexity of their datasets. Despite this limitation, DataCompy remains a valuable tool for efficient and effective dataset comparison and validation in Python.

--

--