Extract data to table Version 8 (Python)

Action group: Robin OCR 


Description

The action recognizes the text from the document, divides it into blocks and returns it in a tabular form.

Action icon

Settings of parameters

PropertyDescriptionTypeFilling exampleMandatory field
Parameters
Path to fileThe path to the file to extract the data from. Supported formats: jpg, jpeg, bmp, png, tif, pdf.Robin.FilePath
Yes
Page numberThe page number in the document from which to extract data. If the field is empty, data will be retrieved from all pages.Robin.Numeric
No
LanguageThe suggested language of the text to extract.Robin.String
No
Additional languageAn additional language required for document recognition.Robin.String
No
AlgorithmIf Text is selected, the action only recognizes text data. If «Table», the action only recognizes tabular data. If «Text and Table», the action recognizes any data.Robin.String
Yes
Distance between wordsThe maximum distance between words in the document’s text data. Used to divide text into columns in the resulting table. The default value is 20 pixels.Robin.Numeric
No
Distance between linesThe maximum distance between lines in the document’s text data. Used to divide text into lines in the resulting table. The default value is 1 pixel.Robin.Numeric
No
Folder pathThe path to the folder where the image of the document page will be saved with the overlay blocks into which the action has divided the data. To save the file, you also need to fill in the «File name» field.Robin.FolderPathC:\doc\imgNo
File nameTitle of the document page image with overlaid blocks (no extension). If the action retrieves data from several pages, then a separate file will be created for each of them, to the name of which an index will be added. To save the file, you also need to fill in the «Path to folder» field.Robin.String
No
OverwriteIf the value is «true» and an image file with the same name, index, and extension exists in the specified folder, the new file will overwrite it. If «false», the file will not be overwritten and the action will return an error.Robin.BooleantrueNo
ParametersAdditional parameters that affect the result and quality of text recognition.Robin.String
No
SignIf the value is «true», then the word «Part» will be added before each recognized tabular or text part with the ordinal number of this part. If the value is «false», a blank line will be inserted before these parts.Robin.BooleantrueNo
Trained modelA file with a trained Tesseract model in the format .traineddata.Robin.FilePath
No
Results
TableA table generated from data retrieved from a source document.Robin.DataTable

Path to image with blocksCollection of paths to image files of document pages with overlaid blocks.Robin.Collection

Special conditions of use

The action recognizes the text of the document and splits it into blocks depending on the specified input parameters (line spacing/word spacing) extracts data from text documents and saves them as a table. The source document may not contain a table.

You can input documents:

  • only the text layer of a pdf-document
  • images only
  • both text layer and images.

The operation of the action is based on the algorithm of document text block extraction. Words and lines of the document are combined into blocks based on the maximum distance between words and maximum distance between lines. These parameters are set in the input parameters of the action.

Examples of such documents: cashier's checks; documents containing tabular data without delimiters; documents containing solid text. 

The logic of the action, depending on the setting of the "Algorithm" parameter: if "Table" is selected, the action will return text from tables only, preserving its table markup. If "Text", it will return everything, but for text from tables, it will ignore its markup (i.e. it reads all text from the source document, and if tables were found, the text from them will be extracted not by table markup, but by distance).  If "Text and Table", it will recognize and return text from tables and plain text as separate parts, preserving the markup for tables.

Note that if the trained model does not work or is poorly trained, you should replace the values of the "Language" dropdown list with the values of the "Additional language" list.

Example of use

Task

Recognize the table from obrazec.pdf document and write the result to a table in CSV format. 

Solution

Use the actions "Extract data to table", "Table to CSV". 

Implementation

  1. Assemble a robot scheme consisting of actions:

  2. Set the parameters for the "Extract data to table" action. 
  3. Set the parameters for the "Table to CSV" action. 
  4. Click on the "Start" button in the top panel. 

Result

The program robot completed successfully. Data from the document was extracted to a CSV format table. 

Pages from the document are saved in .png format to the specified folder with the blocks highlighted.


The table is cured and saved in .csv format 

 

 

 

 

  • Нет меток