How to Use the Command 'datamash' (with examples)
- Linux
- December 17, 2024
The datamash
command is a versatile tool utilized for performing basic numeric, textual, and statistical operations on input textual data files. Designed to handle data selection and aggregation tasks with ease, it provides an efficient way to process and analyze data directly from the command line. datamash
can compute statistics like mean, median, min, and max, count frequencies, or even group data by specific identifiers.
Use case 1: Get max, min, mean, and median of a single column of numbers
Code:
seq 3 | datamash max 1 min 1 mean 1 median 1
Motivation:
This command is particularly useful when needing to quickly obtain summary statistics from numeric datasets, which is a common requirement in data analysis tasks. By using datamash
, you can conveniently calculate key statistical metrics on-the-fly without the need for a full spreadsheet application or complex scripts.
Explanation:
seq 3
: Generates a sequence of numbers from 1 to 3, serving as our dataset for basic statistical operations.|
: Pipe operator that takes the output of theseq
command and uses it as input fordatamash
.datamash
: The command-line tool for data processing.max 1 min 1 mean 1 median 1
: These are the operations and the column number to apply them to. The operations are:max 1
: Calculates the maximum value in the first column.min 1
: Calculates the minimum value in the first column.mean 1
: Computes the mean (average) of the values in the first column.median 1
: Finds the median value in the first column.
Example Output:
3 1 2.0 2
Use case 2: Get the mean of a single column of float numbers (floats must use “,” and not “.”)
Code:
echo -e '1.0\n2.5\n3.1\n4.3\n5.6\n5.7' | tr '.' ',' | datamash mean 1
Motivation:
This command is crucial when dealing with datasets that use European-style decimal separators (’,’ instead of ‘.’). This situation is often encountered in international data sets where regional formatting norms are used. Adjusting decimal separators ensures compatibility with tools and simplifies data manipulation.
Explanation:
echo -e '1.0\n2.5\n3.1\n4.3\n5.6\n5.7'
: Prints a list of float numbers, each on a new line.|
: Pipe operator that directs the output ofecho
totr
.tr '.' ','
: Translates periods to commas, adjusting the decimal separator to the desired format.|
: Pipe operator again, passing the modified data todatamash
.datamash mean 1
: Computes the mean of the numbers in the first column, adjusting for the new decimal format.
Example Output:
3,7
Use case 3: Get the mean of a single column of numbers with a given decimal precision
Code:
echo -e '1\n2\n3\n4\n5\n5' | datamash -R 2 mean 1
Motivation:
In data analysis, it is often necessary to present results with a specific decimal precision to meet reporting standards or improve readability. This command illustrates how to define the precision of the output result directly from the command line without post-processing the results further.
Explanation:
echo -e '1\n2\n3\n4\n5\n5'
: Produces a list of integers, each appearing on a new line.|
: Pipe operator connectingecho
todatamash
.datamash -R 2 mean 1
: Executes datamash with:-R 2
: Sets the decimal precision of the output result to two decimal places.mean 1
: Computes the mean of the first column.
Example Output:
3.33
Use case 4: Get the mean of a single column of numbers ignoring “Na” and “NaN” (literal) strings
Code:
echo -e '1\n2\nNa\n3\nNaN' | datamash --narm mean 1
Motivation:
Data cleaning is a critical step in analysis, and dealing with missing or non-numeric values is a common issue. Using this command ensures that calculations are conducted on actual numeric values, ignoring any ‘Na’ or ‘NaN’ entries that could skew results.
Explanation:
echo -e '1\n2\nNa\n3\nNaN'
: Outputs a series of numbers with some ‘Na’ and ‘NaN’ placeholders, simulating a flawed dataset.|
: Pipelines the output again todatamash
.datamash --narm mean 1
:--narm
: Instructsdatamash
to ignore ‘Na’ and ‘NaN’ values in calculations, essential for ensuring accuracy.mean 1
: Calculates the mean of the first column.
Example Output:
2.0
Conclusion:
The datamash
utility is a powerful and flexible tool ideal for basic data manipulation tasks involving numerical computation and aggregation from command-line inputs. It offers versatility in computation, managing a range of tasks, from handling regional data formatting differences to processing incomplete datasets. These examples illustrate its utility in real-world data analysis scenarios, demonstrating its capabilities and flexibility.