How to Filter and Select Data using 'tsv-filter' (with examples)
The tsv-filter
command-line tool is a powerful utility for processing Tab-Separated Value (TSV) files. It filters lines from a TSV file by running tests against individual fields, providing an efficient method for data cleansing and logical data processing tasks. Whether you’re dealing with large datasets or simply want to extract specific data points, tsv-filter
offers a variety of options to ease the process.
Print the lines where a specific column is numerically equal to a given number
Code:
tsv-filter -H --eq field_name:number path/to/tsv_file
Motivation:
Imagine working with a large TSV file that contains data about employees, and you’re tasked with finding entries where the employee’s age (recorded under the field name “age”) is exactly 30. Using tsv-filter
, you can swiftly extract these lines for further analysis or reporting.
Explanation:
tsv-filter
: Invokes the command to filter lines based on conditions.-H
: Tellstsv-filter
to treat the first line as a header, which allows the use of field names instead of numbers.--eq
: Specifies that the command should filter lines where the given field is numerically equal to the specified number.field_name:number
: Replacefield_name
with the actual column name, likeage
, andnumber
with the specific number, such as30
.path/to/tsv_file
: The path to your TSV file, which contains the data to be processed.
Example output:
John Doe Software Engineer 30
Jane Smith Project Manager 30
Print the lines where a specific column has numerical conditions like [eq]ual, [n]on [e]qual, [l]ess [t]han, etc.
Code:
tsv-filter --eq|ne|lt|le|gt|ge column_number:number path/to/tsv_file
Motivation:
Suppose you need to filter products in an inventory TSV file whose price is greater than $100. This can be done seamlessly without manually scanning through the entire file.
Explanation:
--eq|ne|lt|le|gt|ge
: These flags specify the condition:--eq
: Equal to--ne
: Not equal to--lt
: Less than--le
: Less than or equal to--gt
: Greater than--ge
: Greater than or equal to
column_number:number
: Replacecolumn_number
with the index of the column (starting from 1), andnumber
with the value for comparison.path/to/tsv_file
: The file path of the input TSV file.
Example output for --gt 3:100
:
Product A Electronics 120
Product B Furniture 150
Print the lines where a specific column meets string-based conditions
Code:
tsv-filter --str-eq|ne|in-fld|not-in-fld column_number:string path/to/tsv_file
Motivation:
Suppose you are tasked to filter entries where the department is exactly “Human Resources” or not equal to “Finance.” String comparison aids in textual data segmentation.
Explanation:
--str-eq|ne|in-fld|not-in-fld
: These are string-based flags:--str-eq
: String equals--ne
: Not equal--in-fld
: Part of--not-in-fld
: Not part of
column_number:string
: Replacecolumn_number
with the field index, andstring
with the string pattern you want to filter against.path/to/tsv_file
: Path to your TSV file.
Example output for --str-eq 2:Human Resources
:
John Doe Human Resources
Jane Smith Human Resources
Filter for non-empty fields
Code:
tsv-filter --not-empty column_number path/to/tsv_file
Motivation:
In datasets, especially customer feedback or records, it’s crucial to focus only on entries with relevant and filled data for a particular column, such as Email
.
Explanation:
--not-empty
: Filters lines where the given column is not empty.column_number
: The column index to check for non-empty values.path/to/tsv_file
: Location of the input file.
Example output for --not-empty 4
:
John Doe johndoe@example.com 30
Alice White alice@example.com 25
Print the lines where a specific column is empty
Code:
tsv-filter --invert --not-empty column_number path/to/tsv_file
Motivation:
To identify records where crucial information is missing (like lack of an Email address), you might need to extract lines where a field is empty.
Explanation:
--invert
: Inverts the result of the--not-empty
flag causing the command to select lines where the field is empty.--not-empty
: Checks for non-empty, and when inverted, filters empty columns.column_number
: Column index to check.path/to/tsv_file
: The file path.
Example output for --invert --not-empty 4
:
Sam Green
Julia Prest
Print the lines that satisfy two conditions
Code:
tsv-filter --eq column_number1:number --str-eq column_number2:string path/to/tsv_file
Motivation:
Suppose you’re tasked to find sales records where the quantity is more than 50 and the region is “North America”. This allows multi-condition filtering for more specific results.
Explanation:
--eq column_number1:number
: Numerical condition for one column.--str-eq column_number2:string
: String condition for another.path/to/tsv_file
: Input file path.
Example output:
ProductX North America 55
ProductY North America 70
Print the lines that match at least one condition
Code:
tsv-filter --or --eq column_number1:number --str-eq column_number2:string path/to/tsv_file
Motivation:
This is useful when you want to retrieve records under multiple possible criteria, such as purchases of either 50 items or from vendors located in “Canada”.
Explanation:
--or
: Apply a logical OR condition where either one of the conditions needs to be met.--eq column_number1:number
: Numeric check for one column.--str-eq column_number2:string
: Text check for another column.path/to/tsv_file
: Location of the data file.
Example output:
ProductA 50
ProductB Canada
Count matching lines, interpreting the first line as a header
Code:
tsv-filter --count -H --eq field_name:number path/to/tsv_file
Motivation:
If you need to quickly understand how many entries meet a particular condition, such as counting all employees aged 30, this command provides a direct line count.
Explanation:
--count
: Outputs only the count of lines that meet the criteria.-H
: Indicates that the TSV file includes a header.--eq field_name:number
: Conditions to test for equality in the data.path/to/tsv_file
: Path to file being analyzed.
Example output:
Number of matching lines: 10
Conclusion:
Using tsv-filter
, users can handle and manipulate TSV files to retrieve and analyze data effectively, whether through numerical or string matching, for both individual and complex combined criteria. This tool significantly eases up data preparation and cleaning tasks, assisting data analysts in producing more meaningful insights.