-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Bug: The Dedupe YAML Array Values
rule fails to detect differently escaped variations of the same value as duplicates
#1217
Comments
@rxlecky , thanks for reporting this issue. It should not be too hard to address from my understanding. It just means stripping all starting and ending |
Thanks for taking the time to review and respond @pjkaufman! I don't have much experience with JS development but this seems like a good entry-level issue. Since you already took the time to analyse and break down what needs to be done, I'm happy to take up writing a PR to fix this issue. |
@rxlecky that works for me. Really, I think this just needs happen in getUniqueArray. The following would likely need to change into a for loop that adds to a set: return [...new Set(arr)]; Would become something like const set = new Set<string>();
for (let i = 0; i < arr.length; i++) {
// check for starting and ending with either ' or "
// if so strip that from the value for the set check
// check if it is in the set, if so skip
// if a value was escaped, we will need to add back the escaping when we add it to the array
} You may find isValueAlreadyEscaped and escapeStringIfNecessaryAndPossible helpful. We may need a way to know if we should be using escaped strings when handling duplicates (i.e. a setting or something passed into the rule that lets the rule know to escape the value). But we could just add to an array alongside the set or maybe swap to a map that goes from the value to the index in the array, so we can update the value if we find that one of the values is escaped, so we can make sure that the value is escaped when we dedupe. I am not really sure on the best way to handle this. The contributing guidelines may also be helpful for getting started with contributing to fix this issue: https://platers.github.io/obsidian-linter/contributing/bug-fix/ Feel free to reach out if you encounter an issue. |
Nice, I had a look the other day and I also identified the I was going over the YAML specs to figure out what are all the edge cases related to escaping to get the idea for scope of this fix and I found it will be a bit more work than I initially anticipated. Here's the breakdown of required work according to my findings:
Let me know if this sounds good to you and if you have any additional remarks. I will start slowly working my way through the list and address your feedback as I go. Also, let me know if you feel this work is beyond the extent of this issue and would prefer me to break it down into multiple PRs. |
Hey @pjkaufman, ho here's a little update from me. I started writing tests to cover the necessary edge cases that I mentioned but I quickly came across multiple bugs in the custom YAML parsing code, in particular the Now I realise that this is beyond the scope of a small bugfix that this originally started as but I think it would be worth switching to a more robust YAML package and leveraging it to simplify our own YAML handling code, rewriting it as a thin wrapper around the library code where possible. Not only would it help eliminate the existing bugs, but it would also help prevent more in the future as doing the manual YAML parsing is rather error-prone, and frankly, it's just not worth reinventing the wheel. It may also help developing new, more intricate features in the future. I saw that switching to this library was already discussed once in #617 but it seems it was never followed through. Would you be okay with me testing the waters and trying implementing some functionality with it? If it goes well and if it can replace most of our custom parsing code we could eventually phase out js-yaml entirely. If the yaml package doesn't support all the necessary functionality, I'm not necessarily set on using it in particular and we can have a look for alterantives. But I think it's clear that js-yaml is not quite cutting it anymore and we'll need a replacement sooner or later. It would be good to hear @mnaoumov's opinion as well, since he had some concerns regarding its capability to do roundtrip parsing in the above mentioned issue. Let me know your thoughts on this @pjkaufman, I don't want to dive too deep without first consulting next steps with you. |
@rxlecky , I would definitely be open to swapping to use that package. I actually was working on some changes around that which is hitting some bugs in that parser around a specific scenario. However, I still think swapping to that library is probably the route to go. I had just started making the change on the branch here. It is on my own fork of the repo, but it does add a base for the change when sorting YAML. Now I can say it does seem to have an issue with dealing with multi-line arrays that are empty, but that may just be the one case in question. I am definitely open to making the YAML experience more robust. I am just not sure on all cases. But if there is a good way with |
I believe my above branch does completely phase out the use of |
Awesome, that's great news! Should I wait for your change to get merged before I start dabbling with the |
You can definitely start dabbling with it now. Feel free to yank some of the changes from the existing branch where needed. Right now I am not sure when it will be ready due to my availability. |
I have created a draft PR for the changes to move to |
So it looks like the bug is supposed to be fixed on the latest change that was merged this morning. I plan to do the following:
I am hoping to get this done today and verify that the bug is no longer present. |
Hopefully that will help some with any changes that are currently being made for this. |
Nice, glad that you managed to get things moving along! 👍 As for my progress, I decided to try rewriting the dedupe rule using the |
Also, one thing I wanted to mention that I found out. The EDIT: Nevermind me. While the method I described does produce CST tokens, it seems the preferred method of working with CSTs is generating them using a parser as follows |
Hey @rxlecky , I am just getting back into the swing of things after the holidays and I saw your previous comment. Do you have any examples of manipulating the CST Tokens and the types needed to do so outside of the YAML package? So far I have had difficulty getting it to work with the CST. I am guessing I am just doing something wrong, but YAML Key Sort is not playing well with it since I have to set a new |
I think that once I get that down, I can proceed with getting the PR merged that I have right now. I tried with the CST, but hit a brick wall. But I will try again Friday. |
Yeah, I too was struggling with the CST manipulation a little bit. I did manage to get it working to an extent, but there were some cases when the behaviour that I was getting didn't match what I thought was expected behaviour based on the documentation. I took a step back from working on this issue specifically and first focused on figuring out how to work with the CSTs and what are good patterns for working with them so that we can apply that for when whe refactor other YAML rules using the new package. I had to put that work on hold due to health issues but things are better now and I'd like to finish what I started. I'm thinking of contacting the package authors about the issues I encountered to make sure there are no bugs in the implementation before spending too much time on figuring out workarounds. Maybe if you also had issues, it would be worth combining our issues into one if it makes logical sense. |
I did enter a bug report around the AST getting stringified incorrectly, but beyond that I have not really encountered many issues with how it works. Most of my problem has been on creating and manipulating the CST. I need to take more time looking into its intended manipulation since I just took a passing attempt at it, but could not figure out which methods and other functionality was baked in by default. So there are no issues I would consider bugs present yet. I did find a place that does manipulate the CST and write it back out here. But I have not yet had time to look deeper into it to see if it would work as a use case for the Linter. |
Alright, so I have spent the past couple of days trying to understand the CST API and comparing it to the AST API. Confirming my findings from before, I found the CST API to be extremely basic. Save for a few exceptions, the API for creating and manipulating individual CST nodes is non-existent. There is the The AST API, on the other hand, is much more feature-rich. There are abstractions for scalars and collection nodes. The collections support the basic add, delete, has, get, set operations. New nodes can easily be created and added to the tree from their JS representations and existing nodes can easily be manipulated. However, as we already discovered sooner, ASTs don't fully preserve original formatting - they preserve quoting style, indenation, and general formatting, but drop stuff such as extra whitespace, comments, and everything else that can not be serialised into the JS data representation. So if our goal is to keep the formatting as close to the original as possible, we will have to do without ASTs for stringification and instead use CSTs. But in that case we'll have to re-implement in CST land a lot of the basic functionality that we get in AST API out-of-the-box and handle all the possible edge cases for preserving the formatting when possible and handle the cases where it can't be preserved too. I can see arguments for either of these approaches but I think preserving the formatting is a priority so I would personally at least try the CST route and see how much work it would be. What are your thoughts on this? |
I am thinking we can probably do a hybrid approach. The CST would be used for stringification while the AST which is more convenient to use would be used for things like finding a value or node. But maybe I am being too optimistic based on https://github.com/apollographql/argocd-config-updater/blob/0217630ad16edf17d8ff98239cca3c4b29426ec9/src/yaml.ts. |
The AST does preserve comments, but not always where they were originally. Whitespace can be rearranged quite often. |
Yeah, if we do go down the CST route for stringification it still makes sense to use AST for almsot everything else.
I guess we just have to try implementing it and see if we hit any significant roadblocks. And most importantly, we should evaluate if it is worth doing in the first place. The main reason why we're considering switching to this library is that we want to get rid of our custom YAML string handling code. But now it seems that the stringification using CSTs will still require a fair bit of custom code to make it work. So before we commit to this, we should try evaluate whether working with CSTs is still better than our custom YAML code that we have currently. What do you think is the best course of action now? I'd say we should first try rewriting existing functionality using the library and see how it goes - ideally in its own branch, at least at the start before we fully commit to this change. |
I definitely agree that taking the more cautious approach with both the YAML AST and CST is the way to go. I have a branch that starts using the AST in this PR: #1227. So far it does alright at replacing some functionality, but it does not use the CST, so some things are not kept as they are. Ideally I would swap to include a CST where possible. I think slowly proving out how we can replace the existing functionality is the way to go with this. |
Oh nice, well done!
Yeah, it would be ideal if we can make this change without any regressions. Is there any way I can be of assistance? I would be happy to have a look at the CST side of things, for example, if that would help. |
It would definitely help if you could look at the CST logic. Right now I have not been making enough time in my schedule to really iron out what needs to happen with that to keep formatting as close to the original as possible. |
Just as a heads up, I think I got something hacky working for the CST logic. I am getting the UTs passing for just yaml-key-sort. But once that is in place, I will push that up to the associated PR. I am hoping it will handle most of what needs to happen, but there probably needs to be a wrapper or something to help better move values from one document to another instead of just manually coping values everywhere which is what I am doing right now. |
I went ahead and merged the code despite it being a little messy. I would like to phase out the other logic that uses the regex as possible, but that may take some time yet. I will see about putting out a beta release for the changes made. It may be a couple days as I try to fit in some other fixes. |
Good job on getting it merged! I'm sorry that I didn't come back you afterwards, I got sick soon after and was unable to work. When I recover a bit I will have a look at the code and see if I can help with clean up. Feel free to @ me if you have anything I can be of help with in the meantime. |
Describe the Bug
The
Dedupe YAML Array Values
rule fails to detect differently escaped variations of the same value as duplicates.How to Reproduce
Here's an example markdown file with a duplicate value in its array property, only differing in how the value variations are escaped:
Expected Behavior
When running linter with
dedupe-yaml-array-values
rule turned on, I would expect the three instances of the valuea
to be collapsed into one. Instead, the property is left unchanged.Config
Logs
Additional Context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: