How to use the command 'duckdb' (with examples)

How to use the command 'duckdb' (with examples)

DuckDB is a command-line client for DuckDB, an in-process analytical SQL engine designed to execute complex query workloads efficiently. It is particularly suited for use cases involving analytical queries on large datasets. Essentially, it combines the rich SQL semantics used in big data with the convenience of single-node operations.

Use case 1: Start an interactive shell with a transient in-memory database

Code:

duckdb

Motivation: Starting an interactive shell with a transient in-memory database is useful when you need a quick workspace to perform exploratory data analysis or testing queries without the need to create a persistent database file. This is an excellent choice for prototyping and lightweight testing since it does not require disk operations for creating or managing files.

Explanation:

  • duckdb: Running the command without additional arguments launches an interactive shell backed by an in-memory database. This session allows you to execute SQL queries in real-time, storing all data temporarily, reflecting the state of operations until the shell is closed.

Example Output: Upon entering the command, you’ll be greeted with the DuckDB prompt (D) allowing SQL inputs:

D>

Use case 2: Start an interactive shell on a database file

Code:

duckdb path/to/dbfile

Motivation: This use case is beneficial when you want to connect to an existing database for data analysis, querying, or management operations. If the specified database file does not exist, DuckDB will automatically create a new one, allowing you either to start a new data project or continue working on an existing one seamlessly.

Explanation:

  • duckdb: The base command to start the DuckDB interface.
  • path/to/dbfile: Specifies the path to the DuckDB database file. If the path doesn’t point to an existing file, it creates a new database at this location.

Example Output: You will enter the DuckDB shell with the command line interface prompting:

D>

Use case 3: Directly query a CSV, JSON, or Parquet file

Code:

duckdb -c "SELECT * FROM 'data_source.[csv|csv.gz|json|json.gz|parquet]'"

Motivation: This functionality enables you to perform direct SQL operations on a data file without the need to import it into a database first. It’s particularly advantageous when working with data files that are too large to load into RAM entirely or when you only need to extract certain subsets of data.

Explanation:

  • -c: The command flag tells DuckDB to execute the provided SQL command directly.
  • "SELECT * FROM 'data_source.[csv|csv.gz|json|json.gz|parquet]'": A SQL query that selects all data from the specified file. The command in quotes uses the DuckDB capability to query data files directly as if they were database tables.

Example Output: Output will display the contents of the data file in a tabular format.

Use case 4: Run an SQL script

Code:

duckdb -c ".read path/to/script.sql"

Motivation: Executing an SQL script file is a convenient way to run a batch of SQL commands automatically. This use case allows you to automate tasks like database setup, data transformations, or executing a sequence of reports with minimal manual intervention.

Explanation:

  • -c: Executes the given command instead of opening a shell.
  • ".read path/to/script.sql": A DuckDB-specific command that runs all SQL statements contained in the specified SQL script file.

Example Output: The output will show results or confirmations for each SQL command executed from the script file.

Use case 5: Run query on database file and keep the interactive shell open

Code:

duckdb path/to/dbfile -cmd "SELECT DISTINCT * FROM tbl"

Motivation: This command lets you run a specific query immediately upon opening a database, useful for workflows where initial data retrieval, transformation, or checks are needed prior to further interactive exploration. It saves time by not requiring a separate entry of the command after opening the shell.

Explanation:

  • path/to/dbfile: The path to the DuckDB database file to operate on.
  • -cmd: Runs the specified SQL query before providing access to the interactive shell.
  • "SELECT DISTINCT * FROM tbl": Executes a query to retrieve distinct records from the table ’tbl’ in the database.

Example Output: Results from the query will be displayed, and then the interactive shell will remain open for further instructions.

Use case 6: Run SQL queries in file on database and keep the interactive shell open

Code:

duckdb path/to/dbfile -init path/to/script.sql

Motivation: This use case automates the execution of initial setup or routine SQL operations and maintains the shell for additional ad-hoc requests. It’s especially practical for a repetitive pre-processing step when working with databases every session.

Explanation:

  • path/to/dbfile: Directs to the DuckDB database file.
  • -init: Signals that the operations within the provided SQL script file should be run immediately upon starting.
  • path/to/script.sql: The script file that contains SQL statements to execute in sequence.

Example Output: Commands in the SQL script file execute on startup, with output generated accordingly, followed by awaiting further user input at the shell prompt.

Use case 7: Read CSV from stdin and write CSV to stdout

Code:

cat path/to/source.csv | duckdb -c "COPY (FROM read_csv('/dev/stdin')) TO '/dev/stdout' WITH (FORMAT CSV, HEADER)"

Motivation: This use case allows streaming data processing by reading from and writing to standard input/output. It’s highly useful for integrating DuckDB processing within a Unix pipeline, enabling dynamic data transformations without intermediate files.

Explanation:

  • cat path/to/source.csv: Uses the cat command to stream the contents of source.csv.
  • |: Pipes the output as input to the DuckDB command.
  • -c: Specifies a SQL command to execute.
  • "COPY (FROM read_csv('/dev/stdin')) TO '/dev/stdout' WITH (FORMAT CSV, HEADER)": Reads CSV data input from standard input, processes it, and outputs it as CSV to standard output with headers.

Example Output: The processed CSV data is output to the terminal or next pipeline command.

Use case 8: Display help

Code:

duckdb -help

Motivation: Accessing help provides guidance on all available options and usage syntax within DuckDB. It’s particularly helpful for new users or when trying to recall specific command options or functionalities.

Explanation:

  • -help: This flag requests the detailed help documentation from the DuckDB command, printing usage information to the console.

Example Output: Displays a list of command line options and a summary of their uses, akin to help documentation.

Conclusion:

DuckDB’s command-line capabilities facilitate a multitude of data processing tasks involving complex queries and large datasets efficiently. By leveraging its flexibility—ranging from basic shell interaction, file querying, script executions, to pipeline integrations—users can enhance and streamline data workflows seamlessly. Each use case exemplifies how DuckDB can be tailored for various data operations, epitomizing its prowess as a versatile analytical SQL engine.

Related Posts

How to use the command 'mkfs.ntfs' (with examples)

How to use the command 'mkfs.ntfs' (with examples)

The mkfs.ntfs command is a utility in Linux that allows users to create a New Technology File System (NTFS) on a specified partition or storage device.

Read More
How to use the command 'nxc ftp' (with examples)

How to use the command 'nxc ftp' (with examples)

The nxc ftp command is a powerful tool designed for penetration testing and exploiting FTP (File Transfer Protocol) servers.

Read More
How to use the command 'pnmtotiffcmyk' (with examples)

How to use the command 'pnmtotiffcmyk' (with examples)

The pnmtotiffcmyk command is a utility within the Netpbm package used to convert PNM (Portable AnyMap) images into CMYK (Cyan, Magenta, Yellow, and Key/Black) encoded TIFF (Tagged Image File Format) images.

Read More