How to use the command 'duckdb' (with examples)
DuckDB is a command-line client for DuckDB, an in-process analytical SQL engine designed to execute complex query workloads efficiently. It is particularly suited for use cases involving analytical queries on large datasets. Essentially, it combines the rich SQL semantics used in big data with the convenience of single-node operations.
Use case 1: Start an interactive shell with a transient in-memory database
Code:
duckdb
Motivation: Starting an interactive shell with a transient in-memory database is useful when you need a quick workspace to perform exploratory data analysis or testing queries without the need to create a persistent database file. This is an excellent choice for prototyping and lightweight testing since it does not require disk operations for creating or managing files.
Explanation:
duckdb
: Running the command without additional arguments launches an interactive shell backed by an in-memory database. This session allows you to execute SQL queries in real-time, storing all data temporarily, reflecting the state of operations until the shell is closed.
Example Output:
Upon entering the command, you’ll be greeted with the DuckDB prompt (D
) allowing SQL inputs:
D>
Use case 2: Start an interactive shell on a database file
Code:
duckdb path/to/dbfile
Motivation: This use case is beneficial when you want to connect to an existing database for data analysis, querying, or management operations. If the specified database file does not exist, DuckDB will automatically create a new one, allowing you either to start a new data project or continue working on an existing one seamlessly.
Explanation:
duckdb
: The base command to start the DuckDB interface.path/to/dbfile
: Specifies the path to the DuckDB database file. If the path doesn’t point to an existing file, it creates a new database at this location.
Example Output: You will enter the DuckDB shell with the command line interface prompting:
D>
Use case 3: Directly query a CSV, JSON, or Parquet file
Code:
duckdb -c "SELECT * FROM 'data_source.[csv|csv.gz|json|json.gz|parquet]'"
Motivation: This functionality enables you to perform direct SQL operations on a data file without the need to import it into a database first. It’s particularly advantageous when working with data files that are too large to load into RAM entirely or when you only need to extract certain subsets of data.
Explanation:
-c
: The command flag tells DuckDB to execute the provided SQL command directly."SELECT * FROM 'data_source.[csv|csv.gz|json|json.gz|parquet]'"
: A SQL query that selects all data from the specified file. The command in quotes uses the DuckDB capability to query data files directly as if they were database tables.
Example Output: Output will display the contents of the data file in a tabular format.
Use case 4: Run an SQL script
Code:
duckdb -c ".read path/to/script.sql"
Motivation: Executing an SQL script file is a convenient way to run a batch of SQL commands automatically. This use case allows you to automate tasks like database setup, data transformations, or executing a sequence of reports with minimal manual intervention.
Explanation:
-c
: Executes the given command instead of opening a shell.".read path/to/script.sql"
: A DuckDB-specific command that runs all SQL statements contained in the specified SQL script file.
Example Output: The output will show results or confirmations for each SQL command executed from the script file.
Use case 5: Run query on database file and keep the interactive shell open
Code:
duckdb path/to/dbfile -cmd "SELECT DISTINCT * FROM tbl"
Motivation: This command lets you run a specific query immediately upon opening a database, useful for workflows where initial data retrieval, transformation, or checks are needed prior to further interactive exploration. It saves time by not requiring a separate entry of the command after opening the shell.
Explanation:
path/to/dbfile
: The path to the DuckDB database file to operate on.-cmd
: Runs the specified SQL query before providing access to the interactive shell."SELECT DISTINCT * FROM tbl"
: Executes a query to retrieve distinct records from the table ’tbl’ in the database.
Example Output: Results from the query will be displayed, and then the interactive shell will remain open for further instructions.
Use case 6: Run SQL queries in file on database and keep the interactive shell open
Code:
duckdb path/to/dbfile -init path/to/script.sql
Motivation: This use case automates the execution of initial setup or routine SQL operations and maintains the shell for additional ad-hoc requests. It’s especially practical for a repetitive pre-processing step when working with databases every session.
Explanation:
path/to/dbfile
: Directs to the DuckDB database file.-init
: Signals that the operations within the provided SQL script file should be run immediately upon starting.path/to/script.sql
: The script file that contains SQL statements to execute in sequence.
Example Output: Commands in the SQL script file execute on startup, with output generated accordingly, followed by awaiting further user input at the shell prompt.
Use case 7: Read CSV from stdin
and write CSV to stdout
Code:
cat path/to/source.csv | duckdb -c "COPY (FROM read_csv('/dev/stdin')) TO '/dev/stdout' WITH (FORMAT CSV, HEADER)"
Motivation: This use case allows streaming data processing by reading from and writing to standard input/output. It’s highly useful for integrating DuckDB processing within a Unix pipeline, enabling dynamic data transformations without intermediate files.
Explanation:
cat path/to/source.csv
: Uses thecat
command to stream the contents ofsource.csv
.|
: Pipes the output as input to the DuckDB command.-c
: Specifies a SQL command to execute."COPY (FROM read_csv('/dev/stdin')) TO '/dev/stdout' WITH (FORMAT CSV, HEADER)"
: Reads CSV data input from standard input, processes it, and outputs it as CSV to standard output with headers.
Example Output: The processed CSV data is output to the terminal or next pipeline command.
Use case 8: Display help
Code:
duckdb -help
Motivation: Accessing help provides guidance on all available options and usage syntax within DuckDB. It’s particularly helpful for new users or when trying to recall specific command options or functionalities.
Explanation:
-help
: This flag requests the detailed help documentation from the DuckDB command, printing usage information to the console.
Example Output: Displays a list of command line options and a summary of their uses, akin to help documentation.
Conclusion:
DuckDB’s command-line capabilities facilitate a multitude of data processing tasks involving complex queries and large datasets efficiently. By leveraging its flexibility—ranging from basic shell interaction, file querying, script executions, to pipeline integrations—users can enhance and streamline data workflows seamlessly. Each use case exemplifies how DuckDB can be tailored for various data operations, epitomizing its prowess as a versatile analytical SQL engine.