How to use the command 'ocrmypdf' (with examples)
Ocrmypdf is a robust command-line utility that processes scanned PDF files or images of text to produce a searchable PDF or PDF/A. This tool is particularly useful for converting non-searchable documents into searchable formats, enhancing document accessibility and facilitating better document management. It can be essential for archiving purposes where maintaining document formats in compliance with PDF/A standards is necessary. Users benefit in areas such as document processing, digital archiving, personal document management, and numerous professional tasks that require conversion of scanned files to searchable formats.
Use case 1: Create a new searchable PDF/A file from a scanned PDF or image file
Code:
ocrmypdf path/to/input_file path/to/output.pdf
Motivation: When you have a physical document that has been scanned and saved as a PDF or image file, it’s akin to having a photocopy without any interactive or searchable features. By creating a searchable PDF/A file with OCR (Optical Character Recognition), you transform the document into a digital asset that allows text searching, copying, and indexing. This is incredibly useful for archiving purposes, personal organization, or businesses needing to convert large volumes of scanned documents to a searchable database.
Explanation:
path/to/input_file
: This specifies the location and the name of the scanned input PDF or image file that needs processing.path/to/output.pdf
: This indicates where the newly converted searchable PDF/A file should be saved.
Example output: You start with a scanned image of a textbook page and end up with a PDF that allows you to search for keywords, highlight text, and extract needed information.
Use case 2: Replace a scanned PDF file with a searchable PDF file
Code:
ocrmypdf path/to/file.pdf path/to/file.pdf
Motivation: In scenarios where disk space is limited or when simplicity is prioritized, directly replacing the scanned PDF with its searchable counterpart can streamline workflow. This approach is perfect for reducing clutter when the original scanned file holds no more value once it has fulfilled its purpose of being OCR processed.
Explanation:
path/to/file.pdf
: First usage is the path to the original scanned PDF file. The second usage is where the searchable PDF is saved, effectively overwriting the original.
Example output: You have a scanned PDF stored on your system, and after running the command, you end up with the same file in the same location, but now it’s searchable.
Use case 3: Skip pages of a mixed-format input PDF file that already contain text
Code:
ocrmypdf --skip-text path/to/input.pdf path/to/output.pdf
Motivation: When dealing with documents containing a mix of images and already digital-born, searchable text, it is inefficient and unnecessary to perform OCR on pages that already contain text. By skipping these pages, you save processing time and avoid potential errors or duplications in text extraction.
Explanation:
--skip-text
: This option tells ocrmypdf to bypass pages that already include text layers.path/to/input.pdf
: The path to the mixed-format input PDF file.path/to/output.pdf
: The desired path for saving the processed PDF output.
Example output: A PDF document that combines scanned business letters with digitally created invoices is processed, resulting in a searchable output without modifying the digital invoices.
Use case 4: Clean, de-skew, and rotate pages of a poor scan
Code:
ocrmypdf --clean --deskew --rotate-pages path/to/input_file path/to/output.pdf
Motivation: Old documents, poor-quality scans, or awkwardly placed originals can lead to skewed, misaligned, or low-quality PDFs that are hard to read. By utilizing cleaning options, you can significantly enhance clarity and readability, ensuring that the digital version represents the best quality possible of the original document.
Explanation:
--clean
: This option reduces noise, improving text clarity and making the document appear more polished.--deskew
: Corrects any skews, aligning text perfectly horizontal to improve readability and OCR accuracy.--rotate-pages
: Automatically detects and corrects the rotation angle of pages to ensure proper orientation.path/to/input_file
: The path to the potentially poor-quality scanned input document.path/to/output.pdf
: Path for the enhanced, cleaned, and searchable output PDF.
Example output: Originally cockeyed and blurred scans of antique letters are processed into a sharp, correctly oriented, and easily readable digital format.
Use case 5: Set the metadata of the searchable PDF file
Code:
ocrmypdf --title "title" --author "author" --subject "subject" --keywords "keyword; key phrase; ..." path/to/input_file path/to/output.pdf
Motivation: In many professional and academic settings, metadata is crucial for effective document organization. By setting specific metadata like titles, authors, subjects, and keywords, the document becomes more easily retrievable and professionally arranged within databases or storage systems.
Explanation:
--title "title"
: Adds a designated title to the PDF document.--author "author"
: States the author of the document for reference or attribution.--subject "subject"
: Provides a brief description or subject matter for context.--keywords "keyword; key phrase;..."
: Lists relevant keywords or phrases to enhance searchability and context understanding.path/to/input_file
: The path to the original input document.path/to/output.pdf
: The path for the searchable output PDF with embedded metadata.
Example output: A research paper PDF gains structured metadata, ensuring it’s properly cataloged and easily discoverable in digital libraries or repositories.
Use case 6: Display help
Code:
ocrmypdf --help
Motivation: When you’re unsure of which options or parameters to use, the help command is your guide to understanding all functionalities that ocrmypdf offers. It acts as an instructional primer to help users, whether novices or experts, make full use of the tool.
Explanation:
--help
: This argument displays a list of available options and commands along with their descriptions within the ocrmypdf utility.
Example output: Upon executing, a comprehensive guide appears on the terminal detailing commands and usage instructions for all aspects of ocrmypdf.
Conclusion:
Ocrmypdf showcases its versatility through these varied use cases, enabling users to effectively convert, enhance, and organize PDF documents. The commands illustrated provide a glimpse into the extensive capabilities of this tool, tailored to meet a broad array of document processing demands across different professional fields and personal use cases. Whether for individual document processing needs or large-scale digital transformations, ocrmypdf emerges as an indispensable utility for creating robust, searchable, and accessible digital documents.