converter = PdfToTextConverter()14 PDF Tools
aka majordome_utilities.pdftools
14.1 PdfToTextConverter
Below we illustrate the usage of PdfToTextConverter. Please notice that data curation of extracted texts is still required if readability is a requirement. If quality of automated extractions is often poor for a specific language, you might want to search the web how to train tesseract, that topic is not covered here.
Note: this note assumes tesseract and poppler, and ImageMagick are available in system path. Under Windows you might struggle to get them all working together, please check Majordome’s Kompanion for automatic installation.
Install dependencies on Ubuntu 22.04:
sudo apt install tesseract-ocr imagemagick poppler-utilsIn case of Rocky Linux 9:
sudo dnf install tesseract tesseract-langpack-eng ImageMagick poppler-utilsAssuming the dependencies are found in the path, it is simply a matter of creating a converter:
For generated PDF (not scanned documents), it is much faster to avoir using OCR; below we show the metadata from a paper:
data = converter("data/docs/paper.pdf", use_ocr=False)
data.meta{'/Author': "W. Dal'Maz Silva",
'/CreationDate': "D:20170403224009+05'30'",
'/Creator': 'Elsevier',
'/CrossMarkDomains[1]': 'elsevier.com',
'/CrossMarkDomains[2]': 'sciencedirect.com',
'/CrossmarkDomainExclusive': 'true',
'/CrossmarkMajorVersionDate': '2010-04-23',
'/ElsevierWebPDFSpecifications': '6.5',
'/Keywords': 'Hardness measurement; Martensite; Low-alloy steel; Precipitation',
'/ModDate': "D:20170403224009+05'30'",
'/Subject': 'Materials Science & Engineering A, 693 (2017) 225-232. doi:10.1016/j.msea.2017.03.077',
'/Title': 'Carbonitriding of low alloy steels_ Mechanical and metallurgical responses',
'/doi': '10.1016/j.msea.2017.03.077',
'/robots': 'noindex'}
For scanned documents, by default if OCR is not enabled it will be used as a fallback method for text extraction:
data = converter("data/docs/scanned.pdf", last_page=1)
data.content[:500]'549\n\n5.. Uber die von der molekularkinetischen Theorie\nder Wirme geforderte Bewegung von in ruhenden\nFlussigkeiten suspendierten Teilchen;\nvon A. Einstein.\n\nIn dieser Arbeit soll gezeigt werden, daB nach der molekular-\nkinetischen Theorie der Warme in Flissigkeiten suspendierte\nKorper von mikroskopisch sichtbarer GroBe infolge der Mole-\nkularbewegung der Wirme Bewegungen von solcher GrifSe\nausfiihren miissen, daB diese Bewegungen leicht mit dem\nMikroskop nachgewiesen werden konnen. Es ist moglic'