Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Better support for parsing LLM output #355

Open
verhovsky opened this issue Dec 12, 2023 · 0 comments
Open

Better support for parsing LLM output #355

verhovsky opened this issue Dec 12, 2023 · 0 comments

Comments

@verhovsky
Copy link

cmark-gfm is used by a number of apps that interface with text-generating Large Language Models (LLMs) (this one and this one are the ones I know). These models produce a few characters of Markdown every 200ms (on my machine) and cmark-gfm is used continuously to render the output text so far as Markdown. This is inefficient because (as far as I can tell) the entire generated Markdown has to be re-parsed from the beginning for every generated token, even though it has already been parsed except for the latest token.

cmark-gfm has a streaming interface of cmark_parser_feed and cmark_parser_finish but it seems like I need to call cmark_parser_finish every time I actually want to parse and I need to re-create a parser after that, I can't feed more tokens and re-parse. I would have expected there to be a way to cmark_parser_feed and then cmark_parser_parse and then doing cmark_parser_feed again, or a more complicated interface for editing the parse tree like tree-sitter has.

Also, while we're at it, the other issue is that the syntax isn't stable when it hasn't yet seen the entire input. Namely, a trailing single backtick ` should open a code block until the end of the line/input even if there's no closing backtick. The way it is now leads to jittering in the UI, where the UI first prints a backtick and a few seconds later removes it and re-renders everything after it in monospace when the LLM generates the closing backtick. This is also a problem for horizontal rules and bold/italic but definitely the latter isn't doable because many people use single * characters for multiplication.

nerocui pushed a commit to nerocui/cmark-gfm that referenced this issue May 29, 2024
Otherwise we can get quadratic increase in size with deeply
nested structures.

See github#355.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant