Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Splitter interprets any @ sign as new block start #488

Open
1 of 2 tasks
juthilo opened this issue Jul 26, 2024 · 0 comments
Open
1 of 2 tasks

Splitter interprets any @ sign as new block start #488

juthilo opened this issue Jul 26, 2024 · 0 comments

Comments

@juthilo
Copy link

juthilo commented Jul 26, 2024

Describe the bug
The splitter's methods _move_to_comma_or_closing_curly_bracket and _move_to_closed_bracket each contain a check for unexpected block starts. Unfortunately, this interferes with the parsing of entries that contain the @ sign as raw text.

Reproducing

Version: 2.0.0b7

Code:
This example parse fails because of the @ in the title, raising a BlockAbortedException and adding the block to failed_blocks.

test = bibtexparser.parse_string(
    """
    @inproceedings{DBLP:conf/cikm/EsuliM021,
      author       = {Andrea Esuli and Alejandro Moreo and Fabrizio Sebastiani},
      editor       = {Gao Cong and Maya Ramanath},
      title        = {LeQua @ {CLEF} 2022: {A} Shared Task for Evaluating Quantification Systems},
      booktitle    = {Proceedings of the {CIKM} 2021 Workshops co-located with 30th {ACM}
                      International Conference on Information and Knowledge Management {(CIKM}
                      2021), Gold Coast, Queensland, Australia, November 1-5, 2021},
      series       = {{CEUR} Workshop Proceedings},
      volume       = {3052},
      publisher    = {CEUR-WS.org},
      year         = {2021},
      url          = {https://ceur-ws.org/Vol-3052/abstract4.pdf},
      timestamp    = {Fri, 10 Mar 2023 16:22:33 +0100},
      biburl       = {https://dblp.org/rec/conf/cikm/EsuliM021.bib},
      bibsource    = {dblp computer science bibliography, https://dblp.org}
    }
    """
)
print(test.entries_dict['DBLP:conf/cikm/EsuliM021'])

Bibtex:

@inproceedings{DBLP:conf/cikm/EsuliM021,
      author       = {Andrea Esuli and Alejandro Moreo and Fabrizio Sebastiani},
      editor       = {Gao Cong and Maya Ramanath},
      title        = {LeQua @ {CLEF} 2022: {A} Shared Task for Evaluating Quantification Systems},
      booktitle    = {Proceedings of the {CIKM} 2021 Workshops co-located with 30th {ACM}
                      International Conference on Information and Knowledge Management {(CIKM}
                      2021), Gold Coast, Queensland, Australia, November 1-5, 2021},
      series       = {{CEUR} Workshop Proceedings},
      volume       = {3052},
      publisher    = {CEUR-WS.org},
      year         = {2021},
      url          = {https://ceur-ws.org/Vol-3052/abstract4.pdf},
      timestamp    = {Fri, 10 Mar 2023 16:22:33 +0100},
      biburl       = {https://dblp.org/rec/conf/cikm/EsuliM021.bib},
      bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Workaround
Monkey-patching the two methods by removing the @ check leads to a successful parse.

Remaining Questions (Optional)

  • I would be willing to contribute a PR to fix this issue.
  • This issue is a blocker, I'd be grateful for an early fix.

It says in the code that new blocks are identified by being after a new line. If that assumption is generally safe to make, I could remove the two checks altogether. The only other solution I could think of is replacing the "@" check with a tuple of the most common entry types, e.g. startswith(("@article", "@book", "@proceedings", ...)). Let me know if one of those works and I'll gladly prepare a PR.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant