majordome.pdftools#

class majordome.pdftools.PdfExtracted(meta: dict[str, Any], content: str)#

Bases: object

Stores data extracted from a PDF file.

content: str#

meta: dict[str, Any]#

class majordome.pdftools.PdfToTextConverter(tesseract: str | Path | None = None, pdftotext: str | Path | None = None, n_pages_warn: int = 100)#

Bases: object

Performs text extraction from PDF file.

Please notice that Tesseract executable path is to be supplied (if not in PATH) and not its containing directory, while Poppler’s directory containing pdftotext and other tools is also to be supplied (if these are not in PATH).

Parameters:

tesseractMaybePath, optional: Path to Tesseract executable. If None, searches in PATH.
pdftotextMaybePath, optional: Path to pdftotext directory. If None, searches in PATH.
n_pages_warnint, optional: Number of pages above which a warning is issued. Default is 100.

read(pdf_path: str | Path) → PdfReader | None#: Return True if PDF is not encrypted, performs some checks.

majordome.pdftools

Contents

majordome.pdftools#