How to use the command 'datamash' (with examples)
- Linux
- December 25, 2023
The ‘datamash’ command is a powerful tool that allows users to perform basic numeric, textual, and statistical operations on input textual data files. It is especially useful for data analysis and manipulation tasks.
Use case 1: Get max, min, mean and median of a single column of numbers
Code:
seq 3 | datamash max 1 min 1 mean 1 median 1
Motivation: This use case is useful when you want to quickly analyze a column of numerical data. By using the ‘datamash’ command, you can easily calculate the maximum, minimum, mean, and median values of the values in a single column.
Explanation:
seq 3
generates a sequence of numbers from 1 to 3.max 1
calculates the maximum value in column number 1.min 1
calculates the minimum value in column number 1.mean 1
calculates the mean value (average) of column number 1.median 1
calculates the median value of column number 1 (the middle value when the numbers are sorted in ascending order).
Example output:
3 1 2 2
Use case 2: Get the mean of a single column of float numbers (floats must use “,” and not “.”)
Code:
echo -e '1.0\n2.5\n3.1\n4.3\n5.6\n5.7' | tr '.' ',' | datamash mean 1
Motivation: In some cases, decimal numbers may be represented using a comma instead of a period as the decimal separator. This use case is helpful when you have a single column of float numbers with comma separators and you want to calculate the mean value.
Explanation:
echo -e '1.0\n2.5\n3.1\n4.3\n5.6\n5.7'
prints a series of float numbers, each on a new line.tr '.' ','
replaces all occurrences of ‘.’ with ‘,’ in the input to convert decimal separators.mean 1
calculates the mean value of column number 1.
Example output:
3.933333333
Use case 3: Get the mean of a single column of numbers with a given decimal precision
Code:
echo -e '1\n2\n3\n4\n5\n5' | datamash -R number_of_decimals_wanted mean 1
Motivation: When calculating the mean value of a column, you may want to specify the decimal precision of the output. This use case is useful when you have a single column of numbers and you want to round the mean value to a specific number of decimal places.
Explanation:
echo -e '1\n2\n3\n4\n5\n5'
prints a series of numbers, each on a new line.-R number_of_decimals_wanted
specifies the number of decimal places in the output mean value.mean 1
calculates the mean value of column number 1.
Example output:
3.333333333
Use case 4: Get the mean of a single column of numbers ignoring “Na” and “NaN” (literal) strings
Code:
echo -e '1\n2\nNa\n3\nNaN' | datamash --narm mean 1
Motivation: When dealing with data sets, it is common to encounter missing values represented as “Na” or “NaN”. This use case allows you to calculate the mean of a column of numbers while ignoring these missing values.
Explanation:
echo -e '1\n2\nNa\n3\nNaN'
prints a series of numbers and the literal strings “Na” and “NaN”, each on a new line.--narm
instructs the ‘datamash’ command to ignore the literal strings “Na” and “NaN”.mean 1
calculates the mean value of column number 1.
Example output:
2
Conclusion:
The ‘datamash’ command provides a convenient way to perform basic numeric, textual, and statistical operations on input textual data files. With its various options and operations, it can be a valuable tool for data analysis and manipulation tasks.