How to use the command 'csv-diff' (with examples)
The ‘csv-diff’ command is a tool that allows users to view differences between two CSV, TSV, or JSON files. It provides a human-readable summary of the differences between the files, which can be useful for comparing and analyzing data.
Use case 1: Display a human-readable summary of differences between files using a specific column as a unique identifier
Code:
csv-diff path/to/file1.csv path/to/file2.csv --key=column_name
Motivation:
This use case is handy when comparing two files that have a unique identifier column, such as an ID or a name. By specifying the column as the key, the command will compare the files based on that column and provide a summary of the differences.
Explanation:
path/to/file1.csv
andpath/to/file2.csv
: These are the paths to the two files that will be compared.--key=column_name
: This argument specifies the column from both files that will be used as a unique identifier for the comparison.
Example output:
Summary:
- 10 new rows found in file2.csv.
- 5 rows modified in file2.csv.
- 3 rows deleted from file1.csv.
Details:
- New rows:
- 10 new rows found in file2.csv.
[Row 1] ID: 1, Name: Alice, Age: 25
...
- Modified rows:
- 5 rows modified in file2.csv.
[Row 1] ID: 1, Name: Bob [modified], Age: 30 [modified]
...
- Deleted rows:
- 3 rows deleted from file1.csv.
[Row 1] ID: 3, Name: Charlie, Age: 45
...
Use case 2: Display a human-readable summary of differences between files that includes unchanged values in rows with at least one change
Code:
csv-diff path/to/file1.csv path/to/file2.csv --key=column_name --show-unchanged
Motivation:
Sometimes, it can be helpful to see not only the changed rows but also the unchanged rows when comparing two files. This allows for better context and understanding of the differences between the files.
Explanation:
--show-unchanged
: This argument tells the command to include rows with unchanged values in the summary of differences.
Example output:
Summary:
- 10 new rows found in file2.csv.
- 5 rows modified in file2.csv.
- 3 rows deleted from file1.csv.
Details:
- New rows:
- 10 new rows found in file2.csv.
[Row 1] ID: 1, Name: Alice, Age: 25
...
- Modified rows:
- 5 rows modified in file2.csv.
[Row 1] ID: 1, Name: Bob [modified], Age: 30 [modified]
...
- 2 rows with no changes.
[Row 1] ID: 2, Name: Carol, Age: 40
...
- Deleted rows:
- 3 rows deleted from file1.csv.
[Row 1] ID: 3, Name: Charlie, Age: 45
...
Use case 3: Display a summary of differences between files in JSON format using a specific column as a unique identifier
Code:
csv-diff path/to/file1.csv path/to/file2.csv --key=column_name --json
Motivation:
In some cases, it might be beneficial to have the differences between files in a structured format like JSON. This can be useful for further processing, integration with other systems, or automation purposes.
Explanation:
--json
: This argument instructs the command to output the summary of differences in JSON format.
Example output:
{
"summary": {
"new_rows": 10,
"modified_rows": 5,
"deleted_rows": 3
},
"details": {
"new_rows": [
{
"ID": 1,
"Name": "Alice",
"Age": 25
},
...
],
"modified_rows": [
{
"ID": 1,
"Name": "Bob [modified]",
"Age": "30 [modified]"
},
...
],
"deleted_rows": [
{
"ID": 3,
"Name": "Charlie",
"Age": 45
},
...
]
}
}
Conclusion:
The ‘csv-diff’ command provides a straightforward and effective way to compare and analyze differences between CSV, TSV, or JSON files. It offers various options to customize the comparison, such as specifying a unique identifier column, including unchanged rows in the summary, and outputting the results in JSON format. This tool can be particularly valuable for data analysis, data integration, and quality assurance scenarios.