Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Bug: The Dedupe YAML Array Values rule fails to detect differently escaped variations of the same value as duplicates #1217

Open
1 task done
rxlecky opened this issue Nov 11, 2024 · 28 comments
Labels
bug Something isn't working yaml YAML related issues or features

Comments

@rxlecky
Copy link

rxlecky commented Nov 11, 2024

  • I have verified that I am on the latest version of the Linter

Describe the Bug

The Dedupe YAML Array Values rule fails to detect differently escaped variations of the same value as duplicates.

How to Reproduce

Here's an example markdown file with a duplicate value in its array property, only differing in how the value variations are escaped:

---
array: [a, "a", 'a']
---

# Untitled

... note text here ...

Expected Behavior

When running linter with dedupe-yaml-array-values rule turned on, I would expect the three instances of the value a to be collapsed into one. Instead, the property is left unchanged.

Config

{
  "ruleConfigs": {
    "add-blank-line-after-yaml": {
      "enabled": false
    },
    "dedupe-yaml-array-values": {
      "enabled": true,
      "dedupe-alias-key": true,
      "dedupe-tag-key": true,
      "dedupe-array-keys": true,
      "ignore-keys": ""
    },
    "escape-yaml-special-characters": {
      "enabled": false,
      "try-to-escape-single-line-arrays": true
    },
    "force-yaml-escape": {
      "enabled": false,
      "force-yaml-escape-keys": ""
    },
    "format-tags-in-yaml": {
      "enabled": false
    },
    "format-yaml-array": {
      "enabled": false,
      "alias-key": true,
      "tag-key": true,
      "default-array-style": "single-line",
      "default-array-keys": true,
      "force-single-line-array-style": "",
      "force-multi-line-array-style": ""
    },
    "insert-yaml-attributes": {
      "enabled": false,
      "text-to-insert": "aliases: \ntags: "
    },
    "move-tags-to-yaml": {
      "enabled": false,
      "how-to-handle-existing-tags": "Nothing",
      "tags-to-ignore": ""
    },
    "remove-yaml-keys": {
      "enabled": false,
      "yaml-keys-to-remove": ""
    },
    "sort-yaml-array-values": {
      "enabled": false,
      "sort-alias-key": true,
      "sort-tag-key": true,
      "sort-array-keys": true,
      "ignore-keys": "",
      "sort-order": "Ascending Alphabetical"
    },
    "yaml-key-sort": {
      "enabled": false,
      "yaml-key-priority-sort-order": "",
      "priority-keys-at-start-of-yaml": true,
      "yaml-sort-order-for-other-keys": "None"
    },
    "yaml-timestamp": {
      "enabled": false,
      "date-created": true,
      "date-created-key": "date created",
      "date-created-source-of-truth": "file system",
      "date-modified": true,
      "date-modified-key": "date modified",
      "date-modified-source-of-truth": "file system",
      "format": "dddd, MMMM Do YYYY, h:mm:ss a",
      "convert-to-utc": false,
      "update-on-file-contents-updated": "never"
    },
    "yaml-title": {
      "enabled": false,
      "title-key": "title",
      "mode": "first-h1-or-filename-if-h1-missing"
    },
    "yaml-title-alias": {
      "enabled": false,
      "preserve-existing-alias-section-style": true,
      "keep-alias-that-matches-the-filename": false,
      "use-yaml-key-to-keep-track-of-old-filename-or-heading": true,
      "alias-helper-key": "linter-yaml-title-alias"
    },
    "capitalize-headings": {
      "enabled": false,
      "style": "Title Case",
      "ignore-case-words": true,
      "ignore-words": "macOS, iOS, iPhone, iPad, JavaScript, TypeScript, AppleScript, I",
      "lowercase-words": "a, an, the, aboard, about, abt., above, abreast, absent, across, after, against, along, aloft, alongside, amid, amidst, mid, midst, among, amongst, anti, apropos, around, round, as, aslant, astride, at, atop, ontop, bar, barring, before, B4, behind, below, beneath, neath, beside, besides, between, 'tween, beyond, but, by, chez, circa, c., ca., come, concerning, contra, counting, cum, despite, spite, down, during, effective, ere, except, excepting, excluding, failing, following, for, from, in, including, inside, into, less, like, minus, modulo, mod, near, nearer, nearest, next, notwithstanding, of, o', off, offshore, on, onto, opposite, out, outside, over, o'er, pace, past, pending, per, plus, post, pre, pro, qua, re, regarding, respecting, sans, save, saving, short, since, sub, than, through, thru, throughout, thruout, till, times, to, t', touching, toward, towards, under, underneath, unlike, until, unto, up, upon, versus, vs., v., via, vice, vis-à-vis, wanting, with, w/, w., c̄, within, w/i, without, 'thout, w/o, abroad, adrift, aft, afterward, afterwards, ahead, apart, ashore, aside, away, back, backward, backwards, beforehand, downhill, downstage, downstairs, downstream, downward, downwards, downwind, east, eastward, eastwards, forth, forward, forwards, heavenward, heavenwards, hence, henceforth, here, hereby, herein, hereof, hereto, herewith, home, homeward, homewards, indoors, inward, inwards, leftward, leftwards, north, northeast, northward, northwards, northwest, now, onward, onwards, outdoors, outward, outwards, overboard, overhead, overland, overseas, rightward, rightwards, seaward, seawards, skywards, skyward, south, southeast, southwards, southward, southwest, then, thence, thenceforth, there, thereby, therein, thereof, thereto, therewith, together, underfoot, underground, uphill, upstage, upstairs, upstream, upward, upwards, upwind, west, westward, westwards, when, whence, where, whereby, wherein, whereto, wherewith, although, because, considering, given, granted, if, lest, once, provided, providing, seeing, so, supposing, though, unless, whenever, whereas, wherever, while, whilst, ago, according to, as regards, counter to, instead of, owing to, pertaining to, at the behest of, at the expense of, at the hands of, at risk of, at the risk of, at variance with, by dint of, by means of, by virtue of, by way of, for the sake of, for sake of, for lack of, for want of, from want of, in accordance with, in addition to, in case of, in charge of, in compliance with, in conformity with, in contact with, in exchange for, in favor of, in front of, in lieu of, in light of, in the light of, in line with, in place of, in point of, in quest of, in relation to, in regard to, with regard to, in respect to, with respect to, in return for, in search of, in step with, in touch with, in terms of, in the name of, in view of, on account of, on behalf of, on grounds of, on the grounds of, on the part of, on top of, with a view to, with the exception of, à la, a la, as soon as, as well as, close to, due to, far from, in case, other than, prior to, pursuant to, regardless of, subsequent to, as long as, as much as, as far as, by the time, in as much as, inasmuch, in order to, in order that, even, provide that, if only, whether, whose, whoever, why, how, or not, whatever, what, both, and, or, not only, but also, either, neither, nor, just, rather, no sooner, such, that, yet, is, it"
    },
    "file-name-heading": {
      "enabled": false
    },
    "header-increment": {
      "enabled": false,
      "start-at-h2": false
    },
    "headings-start-line": {
      "enabled": false
    },
    "remove-trailing-punctuation-in-heading": {
      "enabled": false,
      "punctuation-to-remove": ".,;:!。,;:!"
    },
    "footnote-after-punctuation": {
      "enabled": false
    },
    "move-footnotes-to-the-bottom": {
      "enabled": false
    },
    "re-index-footnotes": {
      "enabled": false
    },
    "auto-correct-common-misspellings": {
      "enabled": false,
      "ignore-words": "",
      "skip-words-with-multiple-capitals": false,
      "extra-auto-correct-files": []
    },
    "blockquote-style": {
      "enabled": false,
      "style": "space"
    },
    "convert-bullet-list-markers": {
      "enabled": false
    },
    "default-language-for-code-fences": {
      "enabled": false,
      "default-language": ""
    },
    "emphasis-style": {
      "enabled": false,
      "style": "consistent"
    },
    "no-bare-urls": {
      "enabled": false,
      "no-bare-uris": false
    },
    "ordered-list-style": {
      "enabled": false,
      "number-style": "ascending",
      "list-end-style": "."
    },
    "proper-ellipsis": {
      "enabled": false
    },
    "quote-style": {
      "enabled": false,
      "single-quote-enabled": true,
      "single-quote-style": "''",
      "double-quote-enabled": true,
      "double-quote-style": "\"\""
    },
    "remove-consecutive-list-markers": {
      "enabled": false
    },
    "remove-empty-list-markers": {
      "enabled": false
    },
    "remove-hyphenated-line-breaks": {
      "enabled": false
    },
    "remove-multiple-spaces": {
      "enabled": false
    },
    "strong-style": {
      "enabled": false,
      "style": "consistent"
    },
    "two-spaces-between-lines-with-content": {
      "enabled": false,
      "line-break-indicator": "  "
    },
    "unordered-list-style": {
      "enabled": false,
      "list-style": "consistent"
    },
    "compact-yaml": {
      "enabled": false,
      "inner-new-lines": false
    },
    "consecutive-blank-lines": {
      "enabled": false
    },
    "convert-spaces-to-tabs": {
      "enabled": false,
      "tabsize": 4
    },
    "empty-line-around-blockquotes": {
      "enabled": false
    },
    "empty-line-around-code-fences": {
      "enabled": false
    },
    "empty-line-around-horizontal-rules": {
      "enabled": false
    },
    "empty-line-around-math-blocks": {
      "enabled": false
    },
    "empty-line-around-tables": {
      "enabled": false
    },
    "heading-blank-lines": {
      "enabled": false,
      "bottom": true,
      "empty-line-after-yaml": true
    },
    "line-break-at-document-end": {
      "enabled": false
    },
    "move-math-block-indicators-to-their-own-line": {
      "enabled": false
    },
    "paragraph-blank-lines": {
      "enabled": false
    },
    "remove-empty-lines-between-list-markers-and-checklists": {
      "enabled": false
    },
    "remove-link-spacing": {
      "enabled": false
    },
    "remove-space-around-characters": {
      "enabled": false,
      "include-fullwidth-forms": true,
      "include-cjk-symbols-and-punctuation": true,
      "include-dashes": true,
      "other-symbols": ""
    },
    "remove-space-before-or-after-characters": {
      "enabled": false,
      "characters-to-remove-space-before": ",!?;:).’”]",
      "characters-to-remove-space-after": "¿¡‘“(["
    },
    "space-after-list-markers": {
      "enabled": false
    },
    "space-between-chinese-japanese-or-korean-and-english-or-numbers": {
      "enabled": false,
      "english-symbols-punctuation-before": "-+;:'\"°%$)]",
      "english-symbols-punctuation-after": "-+'\"([¥$"
    },
    "trailing-spaces": {
      "enabled": false,
      "twp-space-line-break": false
    },
    "add-blockquote-indentation-on-paste": {
      "enabled": false
    },
    "prevent-double-checklist-indicator-on-paste": {
      "enabled": false
    },
    "prevent-double-list-item-indicator-on-paste": {
      "enabled": false
    },
    "proper-ellipsis-on-paste": {
      "enabled": false
    },
    "remove-hyphens-on-paste": {
      "enabled": false
    },
    "remove-leading-or-trailing-whitespace-on-paste": {
      "enabled": false
    },
    "remove-leftover-footnotes-from-quote-on-paste": {
      "enabled": false
    },
    "remove-multiple-blank-lines-on-paste": {
      "enabled": false
    }
  },
  "lintOnSave": false,
  "recordLintOnSaveLogs": true,
  "displayChanged": true,
  "lintOnFileChange": false,
  "displayLintOnFileChangeNotice": false,
  "settingsConvertedToConfigKeyValues": true,
  "foldersToIgnore": [],
  "filesToIgnore": [],
  "linterLocale": "system-default",
  "logLevel": "TRACE",
  "lintCommands": [],
  "customRegexes": [],
  "commonStyles": {
    "aliasArrayStyle": "single-line",
    "tagArrayStyle": "single-line",
    "minimumNumberOfDollarSignsToBeAMathBlock": 2,
    "escapeCharacter": "\"",
    "removeUnnecessaryEscapeCharsForMultiLineArrays": false
  }
}

Logs

Running linter
rules before regular rules: 0 ms
Running Dedupe YAML Array Values
dedupe-yaml-array-values: 1.700000286102295 ms
---
array: [a, "a", 'a']
---

# [[Untitled]]

... note text here ...
Running Custom Regex
custom regex rules: 0.09999990463256836 ms
rules after regular rules: 0.20000028610229492 ms
rules running: 2.3000001907348633 ms

Additional Context

Add any other context about the problem here.

@rxlecky rxlecky added the bug Something isn't working label Nov 11, 2024
@pjkaufman
Copy link
Collaborator

@rxlecky , thanks for reporting this issue. It should not be too hard to address from my understanding. It just means stripping all starting and ending ' or " when they are both present and then using that remove duplicates. Then at the end it will need to use the setting for the default escape character to determine which escape character to add back if any. This sounds reasonable enough to add back, so I will see about adding it when I get time.

@pjkaufman pjkaufman added the yaml YAML related issues or features label Nov 13, 2024
@rxlecky
Copy link
Author

rxlecky commented Nov 14, 2024

Thanks for taking the time to review and respond @pjkaufman! I don't have much experience with JS development but this seems like a good entry-level issue. Since you already took the time to analyse and break down what needs to be done, I'm happy to take up writing a PR to fix this issue.

@pjkaufman
Copy link
Collaborator

pjkaufman commented Nov 15, 2024

@rxlecky that works for me. Really, I think this just needs happen in getUniqueArray.

The following would likely need to change into a for loop that adds to a set:

return [...new Set(arr)];

Would become something like

const set = new Set<string>();
for (let i = 0; i < arr.length; i++) {
  // check for starting and ending with either ' or "
  // if so strip that from the value for the set check
  // check if it is in the set, if so skip
  // if a value was escaped, we will need to add back the escaping when we add it to the array
}

You may find isValueAlreadyEscaped and escapeStringIfNecessaryAndPossible helpful.

We may need a way to know if we should be using escaped strings when handling duplicates (i.e. a setting or something passed into the rule that lets the rule know to escape the value). But we could just add to an array alongside the set or maybe swap to a map that goes from the value to the index in the array, so we can update the value if we find that one of the values is escaped, so we can make sure that the value is escaped when we dedupe. I am not really sure on the best way to handle this.

The contributing guidelines may also be helpful for getting started with contributing to fix this issue: https://platers.github.io/obsidian-linter/contributing/bug-fix/

Feel free to reach out if you encounter an issue.

@rxlecky
Copy link
Author

rxlecky commented Nov 17, 2024

Nice, I had a look the other day and I also identified the getUniqueArray method as the best place to implement the fix, so it's good we're on the same page.

I was going over the YAML specs to figure out what are all the edge cases related to escaping to get the idea for scope of this fix and I found it will be a bit more work than I initially anticipated.

Here's the breakdown of required work according to my findings:

  1. Correctly detecting duplicates and squishing them into a single value:

    There are simple cases, like the ones that I listed in my original bug report, which would be solved by the simple trimming of escape quotes as you suggested. But then there are also a bit more intricate cases, such as when using escape sequences, in which case the simple trimming would not do the trick. Here are some examples of triplets of values that should be recognised as three instances of the same value:

    • val, "val", 'val'
    • quote', "quote'", 'quote'''
    • backslash\, "backslash\\", 'backslash\'

    For this reason I would suggest using the jsyaml.load() method to parse the values to take care of all the escape sequences resolution, and only then do the deduplication pass ourselves on top of that.

  2. Detecting when the value needs to be escaped:

    There are certain characters and character sequences that are not valid in plain scalar format - i.e. non-escaped value format - and the values containing them need to be escaped to be considered valid YAML values. Additionally, the invalid characters and character sequences differ, based on whether the array is written in single-line or multi-line format. Based on my testing so far, the formatYamlArrayValue method and the escapeStringIfNecessaryAndPossible method that it's using under the hood don't handle escaping based on illlegal characters and character sequences at all. The only escaping it appears to be doing is for numeric values. So these functions would need to be updated with this functionality.

  3. Choosing the correct escape character:

    As you already said, we need to choose a strategy of how values are escaped after deduping. I suggest adding a new option to the rule, escapeStrategy, with two possible values: escapeOnlyIfNecessary and preserveEscapeCharacters. These would function as follows:

    • escapeOnlyIfNecessary: Regardless of what escape style the values were originally using, only escape the values if necessary, otherwise use plain scalar format. If values need escaping, respect the configured default escape character, regardless of which character was used in the original array.
    • preserveEscapeCharacters: If values in the original array were escaped, respect the user's choice of escape character, regardless of whether the values strictly need escaping or not. If the original array contains variants of the same value escaped with both single-quotes and double-quotes then the configured default escape character is used out of the two.

    Here we would also have to make sure that the force-yaml-escape rule overrides this, and always escapes the selected keys, regardless of the selected general escape strategy.

Let me know if this sounds good to you and if you have any additional remarks. I will start slowly working my way through the list and address your feedback as I go. Also, let me know if you feel this work is beyond the extent of this issue and would prefer me to break it down into multiple PRs.

@rxlecky
Copy link
Author

rxlecky commented Nov 18, 2024

Hey @pjkaufman, ho here's a little update from me. I started writing tests to cover the necessary edge cases that I mentioned but I quickly came across multiple bugs in the custom YAML parsing code, in particular the convertYAMLStringToArray method. This made me think whether it wouldn't be better to use a YAML parsing library instead of rolling our own parsing solution. I tried rewriting the method using the js-yaml library that's already in use but its capabilities are limited to just conversion between YAML and JS objects. It doesn't support parsing into some intermediate parse tree format which would be much more useful for us since we're trying to preserve as much of the original format as possible which is all lost in the conversion from/to JS objects. I went to have a look at some alternatives and found the yaml package. It is significantly more feature rich - including parsing into syntax tree representation -, it is the officially endorsed library by the YAML project, fully conforming to the YAML specs, and it's in active development, unlike js-yaml which hasn't seen an update in over two years.

Now I realise that this is beyond the scope of a small bugfix that this originally started as but I think it would be worth switching to a more robust YAML package and leveraging it to simplify our own YAML handling code, rewriting it as a thin wrapper around the library code where possible. Not only would it help eliminate the existing bugs, but it would also help prevent more in the future as doing the manual YAML parsing is rather error-prone, and frankly, it's just not worth reinventing the wheel. It may also help developing new, more intricate features in the future. I saw that switching to this library was already discussed once in #617 but it seems it was never followed through. Would you be okay with me testing the waters and trying implementing some functionality with it? If it goes well and if it can replace most of our custom parsing code we could eventually phase out js-yaml entirely.

If the yaml package doesn't support all the necessary functionality, I'm not necessarily set on using it in particular and we can have a look for alterantives. But I think it's clear that js-yaml is not quite cutting it anymore and we'll need a replacement sooner or later. It would be good to hear @mnaoumov's opinion as well, since he had some concerns regarding its capability to do roundtrip parsing in the above mentioned issue.

Let me know your thoughts on this @pjkaufman, I don't want to dive too deep without first consulting next steps with you.

@pjkaufman
Copy link
Collaborator

@rxlecky , I would definitely be open to swapping to use that package. I actually was working on some changes around that which is hitting some bugs in that parser around a specific scenario. However, I still think swapping to that library is probably the route to go. I had just started making the change on the branch here. It is on my own fork of the repo, but it does add a base for the change when sorting YAML. Now I can say it does seem to have an issue with dealing with multi-line arrays that are empty, but that may just be the one case in question.

I am definitely open to making the YAML experience more robust. I am just not sure on all cases. But if there is a good way with yaml to preserve the existing format while copying the value, I say go for it.

@pjkaufman
Copy link
Collaborator

I believe my above branch does completely phase out the use of js-yaml, but it needs some typings added and properly positioned to make it work without the need to ts-ignore or do some of the hacky things that it is.

@rxlecky
Copy link
Author

rxlecky commented Nov 18, 2024

Awesome, that's great news! Should I wait for your change to get merged before I start dabbling with the yaml pacakge?

@pjkaufman
Copy link
Collaborator

You can definitely start dabbling with it now. Feel free to yank some of the changes from the existing branch where needed. Right now I am not sure when it will be ready due to my availability.

@pjkaufman
Copy link
Collaborator

I have created a draft PR for the changes to move to yaml as a package. My main concern is that the regex parsing will likely need to be updated before moving forward with it getting merged. This is because the regex value retrieval and empty multi-line arrays do not seem to play well together. I am hoping a fix can be put in upstream or a clarification can be made on what I am doing wrong. I entered a bug ticket here. So hopefully that allows for some better understanding of what is going on in the mentioned scenario.

@pjkaufman
Copy link
Collaborator

So it looks like the bug is supposed to be fixed on the latest change that was merged this morning. I plan to do the following:

  • Work on a couple of features and get them added
  • Make a public release
  • Get the YAML package change merged while pointing to the commit hash of the fixed logic

I am hoping to get this done today and verify that the bug is no longer present.

@pjkaufman
Copy link
Collaborator

Hopefully that will help some with any changes that are currently being made for this.

@rxlecky
Copy link
Author

rxlecky commented Nov 23, 2024

Nice, glad that you managed to get things moving along! 👍

As for my progress, I decided to try rewriting the dedupe rule using the yaml library. I'm close to a working implementation, but still have some issues that need resolving. My thinking is that, ideally, we would use the parsing capabilities of this package to generate abstract tree representation of the provided YAML and do all the processing on this representation and stringify the results at the end. This should allow us to get rid of all the custom YAML parsing, regex extraction, etc. But I think we should implement these changes step by step and first try rewriting all the YAML rules using this new approach where applicable, before cleaning up the YAML code.

@rxlecky
Copy link
Author

rxlecky commented Nov 23, 2024

Also, one thing I wanted to mention that I found out. The YAML.Document stringification doesn't always fully preserve the original document format and we'll probably need to use the CST.Document and CST.stringify method to try keep as much of the original format as possible for our YAML processing. I just though I'd mention that if you had any more issues with formatting. The way you can get the CST.Document is by using the YAML.parseDocument method and passing it {keepSourceTokens: true} option. The document can then be accessed on the root node as follows yamlDocument.contents.srcToken

EDIT: Nevermind me. While the method I described does produce CST tokens, it seems the preferred method of working with CSTs is generating them using a parser as follows const [doc] = new YAML.Parser().parse(sourceYaml);

@pjkaufman
Copy link
Collaborator

Hey @rxlecky , I am just getting back into the swing of things after the holidays and I saw your previous comment. Do you have any examples of manipulating the CST Tokens and the types needed to do so outside of the YAML package? So far I have had difficulty getting it to work with the CST. I am guessing I am just doing something wrong, but YAML Key Sort is not playing well with it since I have to set a new CST.Document instance if that is what the underlying CST Token starts out with.

@pjkaufman
Copy link
Collaborator

I think that once I get that down, I can proceed with getting the PR merged that I have right now. I tried with the CST, but hit a brick wall. But I will try again Friday.

@rxlecky
Copy link
Author

rxlecky commented Jan 17, 2025

Yeah, I too was struggling with the CST manipulation a little bit. I did manage to get it working to an extent, but there were some cases when the behaviour that I was getting didn't match what I thought was expected behaviour based on the documentation.

I took a step back from working on this issue specifically and first focused on figuring out how to work with the CSTs and what are good patterns for working with them so that we can apply that for when whe refactor other YAML rules using the new package. I had to put that work on hold due to health issues but things are better now and I'd like to finish what I started. I'm thinking of contacting the package authors about the issues I encountered to make sure there are no bugs in the implementation before spending too much time on figuring out workarounds. Maybe if you also had issues, it would be worth combining our issues into one if it makes logical sense.

@pjkaufman
Copy link
Collaborator

I did enter a bug report around the AST getting stringified incorrectly, but beyond that I have not really encountered many issues with how it works. Most of my problem has been on creating and manipulating the CST. I need to take more time looking into its intended manipulation since I just took a passing attempt at it, but could not figure out which methods and other functionality was baked in by default. So there are no issues I would consider bugs present yet. I did find a place that does manipulate the CST and write it back out here. But I have not yet had time to look deeper into it to see if it would work as a use case for the Linter.

@rxlecky
Copy link
Author

rxlecky commented Jan 22, 2025

Alright, so I have spent the past couple of days trying to understand the CST API and comparing it to the AST API. Confirming my findings from before, I found the CST API to be extremely basic. Save for a few exceptions, the API for creating and manipulating individual CST nodes is non-existent. There is the CST.visit function for traversal and generalised operations on a CST tree but the functions that do the actual operations don't really exist. So the library generates the CST representation just fine but to make changes to the CST nodes, the node structure must be modified directly. This works okay for simple operations such as modifying scalar values or indentation changes but beyond that it gets more complex.

The AST API, on the other hand, is much more feature-rich. There are abstractions for scalars and collection nodes. The collections support the basic add, delete, has, get, set operations. New nodes can easily be created and added to the tree from their JS representations and existing nodes can easily be manipulated. However, as we already discovered sooner, ASTs don't fully preserve original formatting - they preserve quoting style, indenation, and general formatting, but drop stuff such as extra whitespace, comments, and everything else that can not be serialised into the JS data representation.

So if our goal is to keep the formatting as close to the original as possible, we will have to do without ASTs for stringification and instead use CSTs. But in that case we'll have to re-implement in CST land a lot of the basic functionality that we get in AST API out-of-the-box and handle all the possible edge cases for preserving the formatting when possible and handle the cases where it can't be preserved too. I can see arguments for either of these approaches but I think preserving the formatting is a priority so I would personally at least try the CST route and see how much work it would be. What are your thoughts on this?

@pjkaufman
Copy link
Collaborator

I am thinking we can probably do a hybrid approach. The CST would be used for stringification while the AST which is more convenient to use would be used for things like finding a value or node. But maybe I am being too optimistic based on https://github.com/apollographql/argocd-config-updater/blob/0217630ad16edf17d8ff98239cca3c4b29426ec9/src/yaml.ts.

@pjkaufman
Copy link
Collaborator

The AST does preserve comments, but not always where they were originally. Whitespace can be rearranged quite often.

@rxlecky
Copy link
Author

rxlecky commented Jan 22, 2025

I am thinking we can probably do a hybrid approach.

Yeah, if we do go down the CST route for stringification it still makes sense to use AST for almsot everything else.

But maybe I am being too optimistic

I guess we just have to try implementing it and see if we hit any significant roadblocks.

And most importantly, we should evaluate if it is worth doing in the first place. The main reason why we're considering switching to this library is that we want to get rid of our custom YAML string handling code. But now it seems that the stringification using CSTs will still require a fair bit of custom code to make it work. So before we commit to this, we should try evaluate whether working with CSTs is still better than our custom YAML code that we have currently.

What do you think is the best course of action now? I'd say we should first try rewriting existing functionality using the library and see how it goes - ideally in its own branch, at least at the start before we fully commit to this change.

@pjkaufman
Copy link
Collaborator

I definitely agree that taking the more cautious approach with both the YAML AST and CST is the way to go. I have a branch that starts using the AST in this PR: #1227. So far it does alright at replacing some functionality, but it does not use the CST, so some things are not kept as they are. Ideally I would swap to include a CST where possible.

I think slowly proving out how we can replace the existing functionality is the way to go with this.

@rxlecky
Copy link
Author

rxlecky commented Jan 29, 2025

Oh nice, well done!

I think slowly proving out how we can replace the existing functionality is the way to go with this.

Yeah, it would be ideal if we can make this change without any regressions. Is there any way I can be of assistance? I would be happy to have a look at the CST side of things, for example, if that would help.

@pjkaufman
Copy link
Collaborator

It would definitely help if you could look at the CST logic. Right now I have not been making enough time in my schedule to really iron out what needs to happen with that to keep formatting as close to the original as possible.

@pjkaufman
Copy link
Collaborator

Just as a heads up, I think I got something hacky working for the CST logic. I am getting the UTs passing for just yaml-key-sort. But once that is in place, I will push that up to the associated PR. I am hoping it will handle most of what needs to happen, but there probably needs to be a wrapper or something to help better move values from one document to another instead of just manually coping values everywhere which is what I am doing right now.

@pjkaufman
Copy link
Collaborator

I went ahead and merged the code despite it being a little messy. I would like to phase out the other logic that uses the regex as possible, but that may take some time yet. I will see about putting out a beta release for the changes made. It may be a couple days as I try to fit in some other fixes.

@rxlecky
Copy link
Author

rxlecky commented Feb 9, 2025

Good job on getting it merged! I'm sorry that I didn't come back you afterwards, I got sick soon after and was unable to work. When I recover a bit I will have a look at the code and see if I can help with clean up. Feel free to @ me if you have anything I can be of help with in the meantime.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working yaml YAML related issues or features
Projects
None yet
Development

No branches or pull requests

2 participants