majordome.pdftools#
- class majordome.pdftools.PdfExtracted(meta: dict[str, Any], content: str)#
Stores data extracted from a PDF file.
- class majordome.pdftools.PdfToTextConverter(tesseract: str | Path | None = None, pdftotext: str | Path | None = None, n_pages_warn: int = 100)#
Performs text extraction from PDF file.
Please notice that Tesseract executable path is to be supplied (if not in PATH) and not its containing directory, while Poppler’s directory containing pdftotext and other tools is also to be supplied (if these are not in PATH).
- Parameters:
- tesseractMaybePath, optional
Path to Tesseract executable. If None, searches in PATH.
- pdftotextMaybePath, optional
Path to pdftotext directory. If None, searches in PATH.
- n_pages_warnint, optional
Number of pages above which a warning is issued. Default is 100.
- read(pdf_path: str | Path) PdfReader | None#
Return True if PDF is not encrypted, performs some checks.