Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Is it possible to extract the tables span across multiple pages ? #531

Open
All-In-Coder opened this issue Jan 10, 2025 · 0 comments
Open
Labels
bug Something isn't working

Comments

@All-In-Coder
Copy link

I have a pdf where the table is spread across multiple pages. I need it to be in a single csv or excel format.
I have attached a screenshot of the PDF as well.

Steps to reproduce the bug

If you try to extract the code, it will extract the first table nicely but it is not able to extract the table below it.

Expected behavior

Both tables should be in one single table

Code

try:
  tables = camelot.read_pdf(pdf_path, pages="all") # Extract all pages
except Exception as e:
  print(f"Error extracting tables from {pdf_path}: {e}")
  return

extracted_data: Dict[str, Any] = {}

# Store table data as CSV and include path in JSON
for i, table in enumerate(tables):
    table_filename = f"table_{i + 1}.csv"
    table_path = os.path.join(tables_dir, table_filename)
    table.to_csv(table_path, index=False) # store as CSV
    extracted_data[f"table_{i+1}"] = table_path


PDF

Screenshots

image

image

Environment

  • OS: [e.g. macOS]
  • Python version:
  • Numpy version:
  • OpenCV version:
  • Ghostscript version:
  • camelot version:

Additional context

@All-In-Coder All-In-Coder added the bug Something isn't working label Jan 10, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant