The general form of the CLI usage is:
python3 -m pdfsyntax COMMAND FILE
You can get quick insights on a PDF file with these commands:
overview
outputs text data about the structure and the metadata.disasm
outputs a dump of the file structure on the terminal.text
spatially extracts text content on all pages, as if it was a kind of scan.browse
outputs static html data that lets you browse the internal structure of the PDF file: the PDF source is pretty-printed and augmented with hyperlinks.
The output shows information about:
- the structure : Version, Pages, Revisions, etc...
- the metadata : Title, Author, Subject, etc...
The output shows a terse and greppable view of the file internal structure. Please refer to the Disassembler article for details.
The output shows a full extract of the text content, with a spatial awareness: the algorithm tries to respect the original layout, as if characters of all sizes were approximately rendered on a fixed-size grid.
The output shows a list of fonts used in the file, with the following tabular data:
- Name
- Type
- Encoding
- Object number and generation number, comma separated
- Number of pages where it occurs
This command generates HTML output that looks like the raw PDF file with additionnal hyperlinks and information that expose its internal structure and relations between its objects. Redirect the standard output to a file that you can open in your browser:
python3 -m pdfsyntax browse file.pdf > inspection_file.html
Please refer to the Browse article for details.
TO BE CONTINUED