How to Leverage the 'mlr' Command for Data Manipulation (with examples)

How to Leverage the 'mlr' Command for Data Manipulation (with examples)

Miller, commonly referred to as ‘mlr’, is a powerful and versatile tool for the command line that processes and transforms structured data formats like CSV, TSV, and JSON. It offers functionality akin to traditional UNIX text processing tools like awk, sed, cut, join, and sort, but specifically optimized for name-indexed data, making it ideal for data scientists, analysts, and engineers who work extensively with tabular datasets.

Use Case 1: Pretty-print a CSV file in a tabular format

Code:

mlr --icsv --opprint cat example.csv

Motivation:

Data analysts often require a quick and visually clear view of CSV data, especially when dealing with raw data to inspect or share in a legible format. By using Miller to pretty-print, users can transform the data into a tabular form that is easier to read and understand, which is particularly useful for large datasets or when starting exploratory data analysis.

Explanation:

  • --icsv: Specifies that the input format is CSV. It ensures that Miller correctly interprets the data structure of the incoming file.
  • --opprint: Stands for “output pretty print,” which formats the data into aligned columns for improved readability.
  • cat: This command copies the data unmodified from the input to the output stream, appropriate for basic inspection.

Example Output:

field1  field2  field3
value11 value12 value13
value21 value22 value23
...

Use Case 2: Receive JSON data and pretty print the output

Code:

echo '{"hello":"world"}' | mlr --ijson --opprint cat

Motivation:

When working with JSON data, especially in the context of API responses or configuration files, data printed in a clear, structured format simplifies debugging and inspection. This usage of Miller allows developers and analysts to quickly and effectively display JSON objects in an organized fashion.

Explanation:

  • echo '{"hello":"world"}': Produces a JSON string to use as input for the command.
  • --ijson: Tells Miller that the input data is in JSON format.
  • --opprint: Formats the output in a human-readable table format.
  • cat: Outputs data without modification, useful for producing a neat display.

Example Output:

hello world

Use Case 3: Sort alphabetically on a field

Code:

mlr --icsv --opprint sort -f field example.csv

Motivation:

Sorting data alphabetically based on specific fields is a common need in data processing tasks, enabling users to organize information in a meaningful way. This can be essential when preparing data for reports, visualizations, or further analysis activities.

Explanation:

  • --icsv: Indicates that the input format is CSV.
  • --opprint: Ensures that the sorted data is output in a readable format.
  • sort -f field: Sorts the data based on the specified field name, in this case, alphabetically.

Example Output:

field1  field2  field3
apple   52      note
banana  23      note
cherry  85      note
...

Use Case 4: Sort in descending numerical order on a field

Code:

mlr --icsv --opprint sort -nr field example.csv

Motivation:

Descending numerical sorting is particularly useful for data analysis tasks where users need to identify top or bottom values, such as finding the highest sales figures or lowest performing metrics. This functionality of Miller can cater to such needs efficiently.

Explanation:

  • --icsv: Specifies input of CSV format.
  • --opprint: Outputs the data in a clean, readable table.
  • sort -nr field: Sorts the values in descending order numerically based on the specified field.

Example Output:

field1  field2  field3
42      high    note1
39      medium  note2
12      low     note3
...

Use Case 5: Convert CSV to JSON, performing calculations and display those calculations

Code:

mlr --icsv --ojson put '$newField1 = $oldFieldA/$oldFieldB' example.csv

Motivation:

Combining format conversion with real-time computation highlights Miller’s ability to concurrently handle data transformations and complex operations. This use case is ideal for scenarios where data outputs need to be enriched or manipulated as part of a data pipeline.

Explanation:

  • --icsv: Indicates input in CSV format.
  • --ojson: Specifies that the output should be JSON format.
  • put '$newField1 = $oldFieldA/$oldFieldB': Executes a script to create a new field from a calculation based on existing fields.

Example Output:

[
  {"newField1": 2.5},
  {"newField1": 3.1},
  {"newField1": 1.7}
]

Use Case 6: Receive JSON and format the output as vertical JSON

Code:

echo '{"hello":"world", "foo":"bar"}' | mlr --ijson --ojson --jvstack cat

Motivation:

Vertical JSON format is useful for enhanced readability, especially for complex JSON objects or for visualizing hierarchical data structures. This method of formatting aids in analyzing nested data at a glance, often used in logs or configuration files.

Explanation:

  • echo '{"hello":"world", "foo":"bar"}': Sends a JSON string to the command as input.
  • --ijson: Declares the input type as JSON.
  • --ojson: Specifies JSON output.
  • --jvstack: Outputs JSON in a “vertical stack” format enhancing readability.
  • cat: Outputs the provided data as-is.

Example Output:

{
  "hello": "world",
  "foo": "bar"
}

Use Case 7: Filter lines of a compressed CSV file treating numbers as strings

Code:

mlr --prepipe 'gunzip' --csv filter -S '$fieldName =~ "regular_expression"' example.csv.gz

Motivation:

Filtering and processing compressed CSV files directly without prior decompression saves time and computational resources. This becomes particularly valuable for managing data pipelines that handle large-scale datasets stored in a compressed format.

Explanation:

  • --prepipe 'gunzip': Decompresses a gzip file directly in the command pipeline for immediate processing.
  • --csv: Indicates CSV data format.
  • filter -S '$fieldName =~ "regular_expression"': Filters lines based on a regular expression, treating numeric fields as strings for flexibility in pattern matching.

Example Output:

field1  field2  field3
10      match   item1
21      match   item2
45      no-match item3
...

Conclusion

These examples provide an insight into the versatility and efficiency of the ‘mlr’ command in handling and transforming structured data. Whether it is simply reformatting, sorting, or converting data between formats with additional calculations, Miller demonstrates its prowess as a multifaceted data manipulation tool. This makes it an invaluable asset in any data handling and transformation toolkit, aiding a wide range of tasks from preliminary data inspection to complex data processing in production systems.

Related Posts

How to use the command 'thunderbird' (with examples)

How to use the command 'thunderbird' (with examples)

Thunderbird is a free and open-source email client and RSS reader developed by Mozilla.

Read More
How to use the command 'gdal_translate' (with examples)

How to use the command 'gdal_translate' (with examples)

The gdal_translate utility is a powerful tool within the Geospatial Data Abstraction Library (GDAL) suite that allows users to convert raster data between different formats.

Read More
How to use the command 'pamtopng' (with examples)

How to use the command 'pamtopng' (with examples)

The pamtopng command is a part of the Netpbm suite, designed to convert images in the PAM (Portable Arbitrary Map) format to the more universally recognized PNG (Portable Network Graphics) format.

Read More