PDF text extraction

PDF text extraction#

Below we illustrate the usage of PdfToTextConverted. Please notice that data curation of extracted texts is still required if readability is a requirement. If quality of automated extractions is often poor for a specific language, you might want to search the web how to train tesseract, that topic is not covered here.

Note: this note assumes majordome has been installed with optional dependencies from pdftools, i.e. pip install majordome[pdftools]; it also assumes tesseract and poppler, and ImageMagick are available in system path. Under Windows you might struggle to get them all working together, please check Majordome’s Kompanion for automatic installation.

Install dependencies on Ubuntu 22.04:

sudo apt install tesseract-ocr imagemagick poppler-utils

In case of Rocky Linux 9:

sudo dnf install tesseract tesseract-langpack-eng ImageMagick poppler-utils
[1]:
from majordome import PdfToTextConverter

Assuming the dependencies are found in the path, it is simply a matter of creating a converter:

[2]:
converter = PdfToTextConverter()

For generated PDF (not scanned documents), it is much faster to avoir using OCR; below we show the metadata from a paper:

[3]:
data = converter("data/sample-pdf/paper.pdf", use_ocr=False)
data.meta
[3]:
{'/Author': "W. Dal'Maz Silva",
 '/CreationDate': "D:20170403224009+05'30'",
 '/Creator': 'Elsevier',
 '/CrossMarkDomains[1]': 'elsevier.com',
 '/CrossMarkDomains[2]': 'sciencedirect.com',
 '/CrossmarkDomainExclusive': 'true',
 '/CrossmarkMajorVersionDate': '2010-04-23',
 '/ElsevierWebPDFSpecifications': '6.5',
 '/Keywords': 'Hardness measurement; Martensite; Low-alloy steel; Precipitation',
 '/ModDate': "D:20170403224009+05'30'",
 '/Subject': 'Materials Science & Engineering A, 693 (2017) 225-232. doi:10.1016/j.msea.2017.03.077',
 '/Title': 'Carbonitriding of low alloy steels_ Mechanical and metallurgical responses',
 '/doi': '10.1016/j.msea.2017.03.077',
 '/robots': 'noindex'}

For scanned documents, by default if OCR is not enabled it will be used as a fallback method for text extraction:

[4]:
data = converter("data/sample-pdf/scanned.pdf", last_page=1)
data.content[:500]
[4]:
'549\n\n5.. Uber die von der molekularkinetischen Theorie\nder Wdirme geforderte Bewegung von in ruhenden\nFlissigkeiten suspendierten Teilchen;\nvon A, Einstein.\n\nIn dieser Arbeit soll gezeigt werden, daB nach der molekular-\nkinetischen Theorie der Warme in Flissigkeiten suspendierte\nKérper von mikroskopisch sichtbarer GréBe infolge der Mole-\nkularbewegung der Warme Bewegungen von solcher GréBe\nausfiihren miissen, daB diese Bewegungen leicht mit dem\nMikroskop nachgewiesen werden kénnen. Es ist méglic'