Get text from PDF Tesseract OCR

Get text from PDF Version 11 (Python)

Action group: Text recognition

Description

This action is designed to recognize text from the specified page of a PDF document and save the recognized text to a variable.

PDF file path - Path to PDF file to be recognized

Language of the text - Languages that the recognizer expects in the text

Page number - The number of the page of the file from which the text will be read

Result - Variable into which the recognized text will be saved

Property	Description	Type	Mandatory field
Parameters
PDF file path	The path to the PDF file for recognition	Robin.FilePath	Yes
Language of the text	Expected languages of the text in the PDF file A dropdown list of items: Russian English Vietnamese Arabic Spanish Portuguese Indonesian Persian Turkish Kazakh Belarusian Default value - Russian	Robin.Collection	No
Page number	The page number of the file from which the text will be read	Robin.Numeric	No
Additional language	An additional language required for document recognition A dropdown list of items: No Russian English Vietnamese Arabic Spanish Portuguese Indonesian Persian Turkish Kazakh Belarusian The default value is No If the same option is selected in the "Language" and "Additional language" parameters, there will be no error. The duplicate will be counted as 1 language	Robin.Collection	No
Trained model	Tesseract trained model file in .taineddata format. Allows you to load your own model trained on the required fonts. If the parameter is populated, it will be prioritized over the "Language" and "Additional language" parameters
Results
Result	Received text from a specific page from PDF. If the document does not contain the specified page, a blank value will be stored.	Robin.Collection

None.

There is a document in pdf format, need to get the text from 2 pages of the document.

Use the "Get text from PDF" action.

Result

The program robot completed successfully. The text from page 2 of the document has been retrieved.