• Home
  • Releases
  • Submit Vuln
  • Press
  • About
  • PGP
  • Contact
    • Contact
    • Submit Vuln
    • VDP
  • Tutorials
    • All Posts
    • Photoshop on Linux
    • macOS on Linux
  • Supporters
  • Projects
  • Training
Sick Codes - Security Research, Hardware & Software Hacking, Consulting, Linux, IoT, Cloud, Embedded, Arch, Tweaks & Tips!
  • Home
  • Releases
  • Submit Vuln
  • Press
  • About
  • PGP
  • Contact
    • Contact
    • Submit Vuln
    • VDP
  • Tutorials
    • All Posts
    • Photoshop on Linux
    • macOS on Linux
  • Supporters
  • Projects
  • Training
No Result
View All Result
Sick Codes - Security Research, Hardware & Software Hacking, Consulting, Linux, IoT, Cloud, Embedded, Arch, Tweaks & Tips!
  • Home
  • Releases
  • Submit Vuln
  • Press
  • About
  • PGP
  • Contact
    • Contact
    • Submit Vuln
    • VDP
  • Tutorials
    • All Posts
    • Photoshop on Linux
    • macOS on Linux
  • Supporters
  • Projects
  • Training
No Result
View All Result
Sick Codes - Security Research, Hardware & Software Hacking, Consulting, Linux, IoT, Cloud, Embedded, Arch, Tweaks & Tips!
No Result
View All Result
Home Tutorials

Convert PDF to TXT in GNU/Linux – How to To Turn Images, Scans and PDF into TXT or DOCX format! OCR images & PDF Using Free & Open Source OCR).

by Sick Codes
October 5, 2020
in Tutorials
0

Two programs are advisable for this converting images or PDF file:

To extract raw text, you can use tesseract.

Tesseract command-line OCR engine

# ubuntu, debian, pop
sudo apt update -y && sudo apt install tesseract
# arch, manjaro
sudo pacman -S tesseract
# rhel, fedora, centos
sudo yum update -y || sudo dnf update -y
sudo yum install tesseract -y

To create more advanced documents, you can use gimagereader

# ubuntu, debian, pop
sudo apt update -y && sudo apt install gimagereader -y
# arch, manjaro
sudo pacman -Syu gimagereader

RHEL, Fedora, CentOS can find the RPM here: https://fedora.pkgs.org/30/fedora-updates-x86_64/gimagereader-gtk-3.3.1-1.fc30.x86_64.rpm.html

For example, we are going to convert the following image into text, and then convert the image to text

Convert PNG, JPG to TXT in GNU/Linux

PDF to TXT Document OCR in Linux
PDF to TXT Document OCR in Linux

tesseract PDF-to-TXT-Document-OCR-in-Linux.png -l

This yields:

Optional Licence Elements
Along with the basic rights and obligations set out in each CC licence, there are a set of
“optional’ licence elements which can be added by the creator of the work.
These elements allow the creator to select the different ways they want the public to use
their work. The creator can mix and match the elements to produce the CC licence they

This is great for block text, but what if we want to keep the formatting?

Then we can use gimagereader (which uses tesseract too!)

Step 1: Open gimagereader-gtk or gimagereader-qt and drag in the file you want to convert.

gimagereader convert pdf img png jpg to text or pdf in Linux using OCR
gimagereader convert pdf img png jpg to text or pdf in Linux using OCR

Step 2:

Convert Image or PDF to Plaintext using gimagereader

Option 1

Change OCR mode to Plain text and then click Recognize all:

Convert images to text in Linux using gimagereader
Convert images to text in Linux using gimagereader

Option 2: Convert images to text in Linux keep formatting.

Convert images and pdf to text in Linux keep formatting
Convert images and pdf to text in Linux keep formatting

Now, export in your desired format!

You can even export images to PDF with text overlaying the old positioning!

Export images as text pdf on Linux
Export images as text pdf on Linux

You can select PDF with invisible text.

Invisible text means you can highlight it when viewing in any PDF viewer.

You can also choose a suitable font for your exported PDF.

Overlay pdf file with original text using OCR in Linux
Overlay pdf file with original text using OCR in Linux

Now you can highlight text from the original image that is now converted to a PDF file!

Export PDF with overlay text that you can highlight in Linux
Export PDF with overlay text that you can highlight in Linux

If you have any questions or you would like us to convert files in bulk for you, you can drop us a contact form with a dropbox link or a Google drive link and we will let you know price for time!

Let us know if you have any questions in the comments!

Sick.Codes

Next Post
SICK-2020-004 Hindotech HK1 TV Box - Root Privilege Escalation - Improper Access Control

CVE-2020-27402 - Hindotech HK1 TV Box - Root Privilege Escalation

How to Install barebones PHPMyAdmin Locally on Linux with just PHP and mariadb/mysql

Tor Brave Incognito Timestamp Light

CVE-2020-8276 - Exposure of Sensitive Information to an Unauthorized Actor - Brave Browser Potentially Logs The Last Time A Tor Window Was Used.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

No Result
View All Result
  • Home
  • Releases
  • Submit Vuln
  • Press
  • About
  • PGP
  • Contact
    • Contact
    • Submit Vuln
    • VDP
  • Tutorials
    • All Posts
    • Photoshop on Linux
    • macOS on Linux
  • Supporters
  • Projects
  • Training

© 2017-2021 Sick.Codes

@sickcodes

@sickcodes

@sickcodes

Discord Server

sickcodes.slack.com

t.me/sickcodeschat

./contact_form