Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

chore: update output schema for parse and extract_tables #66

Merged
merged 2 commits into from
Nov 19, 2024
Merged

Conversation

SeisSerenata
Copy link
Collaborator

Description

This PR modifies the output schema for the parse and extract_tables functions to consistently return markdown content as a list instead of a joined string. This change provides more flexibility for downstream processing while maintaining backward compatibility through list joining where needed.

Related Issue

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring
  • Performance improvement

How Has This Been Tested?

  • Updated all existing tests to handle the new list-based output format
  • Tests have been modified to join the markdown list elements when comparing with ground truth
  • All test cases pass with the new schema

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

Key changes:

  1. Modified parse() and extract_tables() to return markdown as a list instead of joining it
  2. Updated async_fetch() to maintain consistency with the new return format
  3. Updated all test cases to handle the new list-based output format
  4. Maintained backward compatibility by joining lists where needed for comparison

Copy link
Member

@lingjiekong lingjiekong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@lingjiekong lingjiekong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comment

@@ -52,7 +52,8 @@ def test_pdf_sync_parse(self):
correct_output_file = "./tests/outputs/correct_pdf_output.txt"

# extract
markdown, elapsed_time = self.ap.parse(file_path=working_file)
markdown_list, elapsed_time = self.ap.parse(file_path=working_file)
markdown = "\n".join(markdown_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be better to check the result page to page.

@lingjiekong lingjiekong merged commit a27a68b into main Nov 19, 2024
5 checks passed
@SeisSerenata SeisSerenata deleted the seis-dev branch December 5, 2024 09:09
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants