You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am a newbie with Python and pdfquery . I am writing a python program to extract info from pdf files and then insert into a word document. I am having trouble with a particular object: "minor spill". Specifically, I am trying to scrap the content of the paragraph underneath "6.3 Methods and materials for containment and cleaning up" (the content I want is "Contain spillage, and then collect with an electrically protected vacuum cleaner or by wet-brushing and place in
container for disposal according to local regulations (see section 13). Keep in suitable, closed containers for disposal.", on page 2 of the pdf file. The problem is that for this particular pdf file, my code will also extract "Product This combustible material may be burned in a chemical incinerator equipped with an afterburner and scrubber. Offer surplus and non-recyclable solutions to a licensed disposal company." on p.5. Because I want to work with many pdf files that might have "6.3..." content on different page, I figure if I can pass the pageid in the extract then it should be fine.
My question is, is there a way you can get the pageid of a object (for example: "minor_spill" in my code.
My code is below and I also attach the pdf file: https://pastebin.com/rwseBSZV
Hi, I am a newbie with Python and pdfquery . I am writing a python program to extract info from pdf files and then insert into a word document. I am having trouble with a particular object: "minor spill". Specifically, I am trying to scrap the content of the paragraph underneath "6.3 Methods and materials for containment and cleaning up" (the content I want is "Contain spillage, and then collect with an electrically protected vacuum cleaner or by wet-brushing and place in
container for disposal according to local regulations (see section 13). Keep in suitable, closed containers for disposal.", on page 2 of the pdf file. The problem is that for this particular pdf file, my code will also extract "Product This combustible material may be burned in a chemical incinerator equipped with an afterburner and scrubber. Offer surplus and non-recyclable solutions to a licensed disposal company." on p.5. Because I want to work with many pdf files that might have "6.3..." content on different page, I figure if I can pass the pageid in the extract then it should be fine.
My question is, is there a way you can get the pageid of a object (for example: "minor_spill" in my code.
My code is below and I also attach the pdf file:
https://pastebin.com/rwseBSZV
Thank you very much!
PDF file:
932-66-1.pdf
The text was updated successfully, but these errors were encountered: