How to use the command csvstat (with examples)
The csvstat
command is a tool included in csvkit
that allows users to print descriptive statistics for all columns in a CSV file. It provides useful information such as min, max, mean, sum, and unique values for each column in the CSV file.
Use case 1: Show all stats for all columns
Code:
csvstat data.csv
Motivation: This use case is helpful when you want to get an overview of all the statistics for each column in a CSV file. It provides a comprehensive summary of the dataset, including minimum and maximum values, mean, standard deviation, and more.
Explanation: The csvstat
command is followed by the name of the CSV file (data.csv
in this case) to analyze. Without any additional options, it provides statistics for all columns in the file.
Example output:
column,mean,sum,min,max
col1,5.5,55,1,10
col2,5.0,50,0,10
col3,5.1,51,0,10
Use case 2: Show all stats for columns 2 and 4
Code:
csvstat -c 2,4 data.csv
Motivation: Sometimes, you may only be interested in analyzing specific columns of a CSV file. This use case allows you to specify the columns you want to include in the analysis, providing statistics for those columns only.
Explanation: The -c
option is used to specify the columns to include in the analysis, followed by a comma-separated list of column numbers or names (2,4
in this example). The command will then output statistics for the specified columns in the order they were provided.
Example output:
column,mean,sum,min,max
col2,5.0,50,0,10
col4,7.2,72,4,10
Use case 3: Show sums for all columns
Code:
csvstat --sum data.csv
Motivation: In some cases, you may only be interested in the sum of values for each column in a CSV file. This use case allows you to generate a concise output consisting of column names and their corresponding sums.
Explanation: The --sum
option is used to calculate and display the sum for each column in the CSV file.
Example output:
column,sum
col1,55
col2,50
col3,51
Use case 4: Show the max value length for column 3
Code:
csvstat -c 3 --len data.csv
Motivation: If you need to determine the maximum length of values in a specific column, this use case is useful. It provides the maximum length of values for the designated column.
Explanation: The -c
option followed by the column number or name (3
in this example) is used to specify the column to analyze. The --len
option is used to calculate and display the maximum length of values for the specified column.
Example output:
column,len
col3,2
Use case 5: Show the number of unique values in the “name” column
Code:
csvstat -c name --unique data.csv
Motivation: When you want to find the count of unique values in a specific column, this use case can be helpful. It provides the number of unique values for the designated column.
Explanation: The -c
option followed by the column number or name (name
in this example) is used to specify the column to analyze. The --unique
option is used to calculate and display the count of unique values for the specified column.
Example output:
column,unique
name,5
Conclusion:
The csvstat
command is a versatile tool for analyzing CSV files. It provides a wide range of statistics and allows you to specify the columns to include in the analysis. By using different options, you can obtain the desired insights into your data. Whether you need an overview of all statistics or only specific information for certain columns, csvstat
is a useful command-line tool.