How to Leverage the 'mlr' Command for Data Manipulation (with examples)
Miller, commonly referred to as ‘mlr’, is a powerful and versatile tool for the command line that processes and transforms structured data formats like CSV, TSV, and JSON. It offers functionality akin to traditional UNIX text processing tools like awk
, sed
, cut
, join
, and sort
, but specifically optimized for name-indexed data, making it ideal for data scientists, analysts, and engineers who work extensively with tabular datasets.
Use Case 1: Pretty-print a CSV file in a tabular format
Code:
mlr --icsv --opprint cat example.csv
Motivation:
Data analysts often require a quick and visually clear view of CSV data, especially when dealing with raw data to inspect or share in a legible format. By using Miller to pretty-print, users can transform the data into a tabular form that is easier to read and understand, which is particularly useful for large datasets or when starting exploratory data analysis.
Explanation:
--icsv
: Specifies that the input format is CSV. It ensures that Miller correctly interprets the data structure of the incoming file.--opprint
: Stands for “output pretty print,” which formats the data into aligned columns for improved readability.cat
: This command copies the data unmodified from the input to the output stream, appropriate for basic inspection.
Example Output:
field1 field2 field3
value11 value12 value13
value21 value22 value23
...
Use Case 2: Receive JSON data and pretty print the output
Code:
echo '{"hello":"world"}' | mlr --ijson --opprint cat
Motivation:
When working with JSON data, especially in the context of API responses or configuration files, data printed in a clear, structured format simplifies debugging and inspection. This usage of Miller allows developers and analysts to quickly and effectively display JSON objects in an organized fashion.
Explanation:
echo '{"hello":"world"}'
: Produces a JSON string to use as input for the command.--ijson
: Tells Miller that the input data is in JSON format.--opprint
: Formats the output in a human-readable table format.cat
: Outputs data without modification, useful for producing a neat display.
Example Output:
hello world
Use Case 3: Sort alphabetically on a field
Code:
mlr --icsv --opprint sort -f field example.csv
Motivation:
Sorting data alphabetically based on specific fields is a common need in data processing tasks, enabling users to organize information in a meaningful way. This can be essential when preparing data for reports, visualizations, or further analysis activities.
Explanation:
--icsv
: Indicates that the input format is CSV.--opprint
: Ensures that the sorted data is output in a readable format.sort -f field
: Sorts the data based on the specified field name, in this case, alphabetically.
Example Output:
field1 field2 field3
apple 52 note
banana 23 note
cherry 85 note
...
Use Case 4: Sort in descending numerical order on a field
Code:
mlr --icsv --opprint sort -nr field example.csv
Motivation:
Descending numerical sorting is particularly useful for data analysis tasks where users need to identify top or bottom values, such as finding the highest sales figures or lowest performing metrics. This functionality of Miller can cater to such needs efficiently.
Explanation:
--icsv
: Specifies input of CSV format.--opprint
: Outputs the data in a clean, readable table.sort -nr field
: Sorts the values in descending order numerically based on the specified field.
Example Output:
field1 field2 field3
42 high note1
39 medium note2
12 low note3
...
Use Case 5: Convert CSV to JSON, performing calculations and display those calculations
Code:
mlr --icsv --ojson put '$newField1 = $oldFieldA/$oldFieldB' example.csv
Motivation:
Combining format conversion with real-time computation highlights Miller’s ability to concurrently handle data transformations and complex operations. This use case is ideal for scenarios where data outputs need to be enriched or manipulated as part of a data pipeline.
Explanation:
--icsv
: Indicates input in CSV format.--ojson
: Specifies that the output should be JSON format.put '$newField1 = $oldFieldA/$oldFieldB'
: Executes a script to create a new field from a calculation based on existing fields.
Example Output:
[
{"newField1": 2.5},
{"newField1": 3.1},
{"newField1": 1.7}
]
Use Case 6: Receive JSON and format the output as vertical JSON
Code:
echo '{"hello":"world", "foo":"bar"}' | mlr --ijson --ojson --jvstack cat
Motivation:
Vertical JSON format is useful for enhanced readability, especially for complex JSON objects or for visualizing hierarchical data structures. This method of formatting aids in analyzing nested data at a glance, often used in logs or configuration files.
Explanation:
echo '{"hello":"world", "foo":"bar"}'
: Sends a JSON string to the command as input.--ijson
: Declares the input type as JSON.--ojson
: Specifies JSON output.--jvstack
: Outputs JSON in a “vertical stack” format enhancing readability.cat
: Outputs the provided data as-is.
Example Output:
{
"hello": "world",
"foo": "bar"
}
Use Case 7: Filter lines of a compressed CSV file treating numbers as strings
Code:
mlr --prepipe 'gunzip' --csv filter -S '$fieldName =~ "regular_expression"' example.csv.gz
Motivation:
Filtering and processing compressed CSV files directly without prior decompression saves time and computational resources. This becomes particularly valuable for managing data pipelines that handle large-scale datasets stored in a compressed format.
Explanation:
--prepipe 'gunzip'
: Decompresses a gzip file directly in the command pipeline for immediate processing.--csv
: Indicates CSV data format.filter -S '$fieldName =~ "regular_expression"'
: Filters lines based on a regular expression, treating numeric fields as strings for flexibility in pattern matching.
Example Output:
field1 field2 field3
10 match item1
21 match item2
45 no-match item3
...
Conclusion
These examples provide an insight into the versatility and efficiency of the ‘mlr’ command in handling and transforming structured data. Whether it is simply reformatting, sorting, or converting data between formats with additional calculations, Miller demonstrates its prowess as a multifaceted data manipulation tool. This makes it an invaluable asset in any data handling and transformation toolkit, aiding a wide range of tasks from preliminary data inspection to complex data processing in production systems.