How to use the command 'tesseract' (with examples)
Tesseract is a powerful and versatile open-source Optical Character Recognition (OCR) engine. It’s designed to recognize and convert different input images into machine-readable text. This command-line tool is particularly useful for tasks that involve digitizing printed or handwritten text so it can be edited or searched. Tesseract supports various languages, allows customization of page segmentation modes, and offers numerous functionalities, making it a preferred choice for OCR needs.
Use case 1: Recognize text in an image and save it to output.txt
Code:
tesseract image.png output
Motivation:
This use case is perfect for quickly extracting text from an image when you need it in a digital format. Imagine you have a scanned document or a photograph of a printed text that you want to edit or incorporate into a Word document or a spreadsheet. Using this command, you can easily extract the text without manually typing it out. This significantly saves time and reduces the possibility of human errors during transcription.
Explanation:
tesseract
: The command invokes the Tesseract OCR engine.image.png
: This is the input file name. You specify the path to the image from which you want to extract text.output
: This is the base name for the output text file. Tesseract will automatically add a.txt
extension, producing anoutput.txt
file containing the recognized text.
Example output:
After running the command, Tesseract will analyze ‘image.png’ and create ‘output.txt’, containing the text extracted from the image. If your image was a clear photograph of a printed page, the text in ‘output.txt’ should closely match the text in the image.
Use case 2: Specify a custom language (default is English)
Code:
tesseract -l deu image.png output
Motivation:
Suppose you have a document written in a language other than English, such as German, and you want to perform OCR on it. By specifying the language, you inform Tesseract to use the proper language data files for accurate text recognition. This is particularly useful for multilingual projects or researchers dealing with international texts.
Explanation:
-l deu
: This option sets the language for OCR. ‘deu’ is the ISO 639-2 code for German, telling Tesseract to use German language data.image.png
: The input image file.output
: The base name for the text file output.
Example output:
The command generates ‘output.txt’ with the German text extracted from ‘image.png’, using German-specific language data for improved accuracy.
Use case 3: List the ISO 639-2 codes of available languages
Code:
tesseract --list-langs
Motivation:
You may need to know which languages are supported by your Tesseract installation, especially if working on a project involving multiple languages. This command provides a convenient way to check that the language you need is available, ensuring that your OCR tasks proceed without unnecessary interruptions or errors.
Explanation:
--list-langs
: This option instructs Tesseract to display a list of available language codes, representing different languages for OCR.
Example output:
Running this command will produce a list similar to:
eng
deu
fra
spa
...
Each code corresponds to a specific language, helping you choose the correct one for your OCR tasks.
Use case 4: Specify a custom page segmentation mode
Code:
tesseract --psm 6 image.png output
Motivation:
Different types of documents and images require different approaches for text segmentation. For example, a single column handwritten text may benefit from a different page segmentation mode than a newspaper with multiple columns. Customizing the page segmentation mode can optimize text extraction, making it more accurate and reliable for various types of texts and document layouts.
Explanation:
--psm 6
: ‘PSM’ stands for Page Segmentation Mode. Mode ‘6’ tells Tesseract to assume a single uniform block of text. There are 14 modes, addressing different layout assumptions.image.png
: The input image file.output
: The base name for the text file output.
Example output:
The output ‘output.txt’ is generated with Tesseract applying mode ‘6’, optimizing for images containing a single block of text.
Use case 5: List page segmentation modes and their descriptions
Code:
tesseract --help-psm
Motivation:
Understanding which page segmentation mode to use can be challenging without knowledge of what each mode does. This command provides detailed descriptions of each mode, empowering you to select the most appropriate one for the specific structure and complexity of your documents.
Explanation:
--help-psm
: This command lists all available page segmentation modes along with brief descriptions of their functions.
Example output:
Running the command will yield a brief explanation of each mode, such as:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
...
10 Treat the image as a single text line.
Conclusion:
Tesseract stands out as a robust tool in the realm of OCR, offering diverse functionalities tailored for text extraction needs. By leveraging its capabilities, such as language selection and page segmentation modes, users can significantly streamline their text digitization processes. This article provides just a few examples of Tesseract’s capabilities, but its potential applications extend far into many fields, making it an invaluable tool for anyone needing reliable OCR solutions.