Get pageid of a search object #62

khoivan88 · 2018-01-11T06:05:39Z

Hi, I am a newbie with Python and pdfquery . I am writing a python program to extract info from pdf files and then insert into a word document. I am having trouble with a particular object: "minor spill". Specifically, I am trying to scrap the content of the paragraph underneath "6.3 Methods and materials for containment and cleaning up" (the content I want is "Contain spillage, and then collect with an electrically protected vacuum cleaner or by wet-brushing and place in
container for disposal according to local regulations (see section 13). Keep in suitable, closed containers for disposal.", on page 2 of the pdf file. The problem is that for this particular pdf file, my code will also extract "Product This combustible material may be burned in a chemical incinerator equipped with an afterburner and scrubber. Offer surplus and non-recyclable solutions to a licensed disposal company." on p.5. Because I want to work with many pdf files that might have "6.3..." content on different page, I figure if I can pass the pageid in the extract then it should be fine.
My question is, is there a way you can get the pageid of a object (for example: "minor_spill" in my code.
My code is below and I also attach the pdf file:
https://pastebin.com/rwseBSZV

Thank you very much!
PDF file:
932-66-1.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get pageid of a search object #62

Get pageid of a search object #62

khoivan88 commented Jan 11, 2018 •

edited

Loading

Get pageid of a search object #62

Get pageid of a search object #62

Comments

khoivan88 commented Jan 11, 2018 • edited Loading

khoivan88 commented Jan 11, 2018 •

edited

Loading