-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--vimgrep doesn't satisfy vim's spec for multi-line searches #1866
Comments
Could you please explain why the existing |
@BurntSushi Hmm, not sure if I understand the format of existing json then... Is there a way to find the information I need for multi-line matches already? Could you provide an example. |
@BurntSushi Also, for large number big multiline matches, it would be better not to print the text of matches. That I believe would be faster than the existing format that prints a lot of information. |
The format is documented here: https://docs.rs/grep-printer/0.1.5/grep_printer/struct.JSON.html I'm not changing the format unless there's a well motivated benchmark of a real use case in a real environment. If you get that, then we can evaluate options for improving performance like adding |
@BurntSushi Ok, here's one problem: Even if I use Regarding benchmarks: Let's look at the usecase: I am trying to integrate this in pure vimscript. To be able to find end line number in the multiline example below, I will need to process the string Source: https://github.com/vim/vim/blob/master/README_VIM9.md. Note: New vim-9 (Vim new) is still in the works, and many setups lag behind in vim versions anyway. Here's one example from my latex file {
"type": "match",
"data": {
"path": {
"text": "decoding.tex"
},
"lines": {
"text": "\\usepackage{blindtext}\n\\usepackage[a4paper,left=0in,right=0in,top=0in,bottom=0in]{geometry}\n\\usepackage{pdfpages}\n\n\\defaultfontfeatures{\n Color=000000,\n Ligatures=TeX,\n WordSpace={1.0,1.0,1.0},\n PunctuationSpace=0,\n LetterSpace=0\n}%Opacity=0.3,\n\\newcommand{\\llakkuratstyle}{\\addfontfeatures{Opacity=1.0}} \n"
},
"line_number": 9,
"absolute_offset": 147,
"submatches": [
{
"match": {
"text": "usepackage{blindtext}\n\\usepackage[a4paper,left=0in,right=0in,top=0in,bottom=0in]{geometry}\n\\usepackage{pdfpages}\n\n\\defaultfontfeatures{\n Color=000000,\n Ligatures=TeX,\n WordSpace={1.0,1.0,1.0},\n PunctuationSpace=0,\n LetterSpace=0\n}%Opacity=0.3,\n\\newcommand"
},
"start": 1,
"end": 271
}
]
}
} |
The output includes the byte offsets of the match. You can use those to compute column. And I note that "column" is an ill-defined concept in general. I'm sure vim has it more rigorously defined, but I don't know what that definition is and whether it's stable.
Capture groups aren't reported in the JSON output. Each
I see why you think this, and you might very well be right. But it's still just a guess. I was serious. I'm not adding flags to ripgrep based on guesses. This is my mantra: make it work, make it right, make it fast. Before adding flags, we should have real repeatable benchmarks based on real use cases in real environments that end users are having problems with. That's what I would like my standard to be. |
@BurntSushi Some good news: vim has a command Though I need to understand submatches a bit more... I have a bunch of questions in mind:
|
An inverted match is a line in which none of the input patterns match. Thus, the match is the line itself.
I believe it's true that the first match will begin on the first line emitted from ripgrep. I am not 100% sure that it's a rock solid guarantee, but I cannot immediately think of any cases where it wouldn't be true. (It would likely be a strange corner case.) |
@BurntSushi Ok thanks for the information. So in that case the meaning of "submatch" is the list of all matches starting on "line_number". If that is not the case then there is ambiguity in the definition+implementation. I mean if a submatch begins on some other line, then why would it not be in the submatch entry of that line... Did you implement the json emitter? Anyway when you have time, please clarify the definition of submatch (with confidence/expected behavior) & perhaps add it to the documentation. |
I don't see what's ambiguous about this:
The wording here looks precisely correct to me.
I am having a hard time understanding your confusion. To be clear, we're both looking at the "match" message, right? It is not a "line" message. It's a "match" message. ripgrep does not emit JSON objects for every line. It emits JSON objects for matches. In the default non-multiline mode, yes, every match will be limited to precisely one line and every submatch of that match will start and end on that line. That's what it means to be line oriented and it is standard grep behavior. But when multiline mode is enabled, then a match may span multiple lines. In that case, submatches may start on one line and end on another. But, yes, the first submatch should have a starting position on the first line in the "match" message. Once a line is included in a "match" message, it will not be included in any subsequent messages. That's the point of submatches: it enumerates all possible matches within the region emitted in the "match" message. (And in multiline mode, a "match" message could be the entire input!)
Yes. |
@BurntSushi What about the other sub matches (other than the first sub-match) for multi line matches. That’s the question I have. Can any of such sub-matches begin on lines other than the first line of match? So far in my simple non-multi-line testing, sub matches array just consists of all matches on the same line. The reason I ask this is: in my current implementation, I assume lnum (line number) of all submatches is “line_number” of match. If that’s not true for all cases, it would be good to have an example. |
Of course!
|
Or perhaps a bit simpler to read, try |
@BurntSushi Ok now it’s clear what sub match means. It could be a regex sub-pattern-match, or one of the multiple matches on the same line for entire regex. You did save lines in json by putting multiple independent full regex matches as “submatch” of a “match”. Hope you now get what is the source of confusion. I will implement a fully functional version of ripgrep in vim & compare it to something like ag (I guess it provides exact information I asked in this ticket). I suspect ripgrep will be slow for such multi line patterns because the start line number for multi line sub matches is not available (& one needs to find with splits on newline character (ouch)). Do you have benchmark files and patterns for such cases? |
Does it? Could you show me which command is being run to get such information?
The I think it's important to note that I didn't just push out the JSON emitter on a lark. I wrote it specifically for editor integration, and in particular, with VS Code in mind. I got feedback from them before publishing it. Any time you do "search in files" in VS Code, it's using
Personally, I think you're over-estimating the importance of the performance of this. I'm sure you'll be able to find places where this matters, but I suspect it will only occur when either the match frequency is extremely high or when the match itself is extremely large. And in those cases, performance is likely going to suffer for other reasons (latency for high frequency matches or allocating a huge chunk of memory for a large match). And that's why I ask for benchmarks. It has to be a real and meaningful case. Guesses aren't good enough. A big part of the reason for building ripgrep is speed. That's what I do. It's what I've been doing for a long time. I have some instincts about things, but it's never a substitute for actually measuring things in an environment that matches a real end user's environment as closely as possible. If you go off and implement this, get it working and you come back to me with "here's a case a real user hit (or is very likely to hit), and there's a user visible performance impact resulting specifically from having to count line numbers," then I'd be very very very happy to work with you on fixing that. But that's not what we have here. What we have here are guesses.
No, i don't. I don't work on editor integrations. |
@BurntSushi Sorry I realized it wasn't Thanks for the examples. I think I get it how to bend the current information to implement something efficient in vim. Perhaps it will require adding a function in vim to work more efficiently with byte offsets, which I will ask Bram (the creator of Vim). |
@BurntSushi Your example from above doesn't print the correct numbers with the printf 'foobar\nfoobar\nfoo quux' | rg -U 'foobar\nfoobar\nfoo|quux' --vimgrep Output:
I would have expected something like:
Or am I misunderstanding something here? |
@BurntSushi I agree with @bfrg, its a bug! You are printing the byte offset relative to beginning of "match region" instead of printing "character column" from beginning of line of "submatches". @bfrg good catch, I ignored looking at it as I was only interested in |
It turned out that --vimgrep wasn't quite getting the column of each match correctly. Instead of printing column numbers relative to the current line, it was printing column numbers as byte offsets relative to where the match began. To fix this, we simply subtract the offset of the line number from the beginning of the match. If the beginning of the match came before the start of the current line, then there's really nothing sensible we can do other than to use a column number of 1, which we now document. Interestingly, existing tests were checking that the previous behavior was intended. My only defense is that I somehow tricked myself into thinking it was a byte offset instead of a column number. Kudos to @bfrg for calling this out in #1866: #1866 (comment)
@bfrg Excellent catch! It's now fixed on master. :-) Interestingly, I had several tests confirming the current buggy behavior. In any case, the output on master is now:
|
Thank you for the quick fix. But shouldn't there be only two matches just like with the
or, to be consistent with the
For comparison: printf 'foobar\nfoobar\nfoo quux' | rg -U 'foobar\nfoobar\nfoo|quux' --json | jq 'select(.type == "match")' Output: {
"type": "match",
"data": {
"path": {
"text": "<stdin>"
},
"lines": {
"text": "foobar\nfoobar\nfoo quux"
},
"line_number": 1,
"absolute_offset": 0,
"submatches": [
{
"match": {
"text": "foobar\nfoobar\nfoo"
},
"start": 0,
"end": 17
},
{
"match": {
"text": "quux"
},
"start": 18,
"end": 22
}
]
}
} |
No. That part is correct.
So in multi-line mode, it does the same: it prints the entire match, even if it spans multiple lines, just like it would if you ran your query without To be honest,
There is no real desired consistency between output modes at this level. |
But there are only two matches in the multi-line example. Hence, there should be only two lines in the In my opinion, Example file:
Open the file in Vim and run: :vimgrep /foobar\nfoobar\nfoo\|quux/ % Vim will find the following two matches (open Vim's
This is the same result that I also get with I tried to compare the behavior with GNU/grep and git-grep using their PCRE regex engines but neither one seems to work correctly. printf 'foobar\nfoobar\nfoo quux' | grep -znP 'foobar\nfoobar\nfoo|quux' Output:
Unfortunately, grep doesn't provide a |
Yes, as I said, that is the intended behavior. I don't really know why you would expect the complete matches to not be printed. I don't think that's ever true for any ripgrep output mode. (Other than aggregate views or things that limit the maximum line length printed.) I don't see why it should be true for It's hard for me to say whether
Right. "grep" is a line oriented tool. Multi-line searches on work in things like GNU grep with weird hacks. |
@BurntSushi Regarding your "I don't really know why you would expect the complete matches to not be printed.". You are not printing complete matches, you are giving Regarding |
Breaking it down really carefully:
There are two overall matches in the input. The first is
This doesn't call out multi-line specifically (because That is exactly the behavior that ripgrep has. There is no special treatment given to
This seems like the crux of the confusion. I did not know this. This is a critical piece of information.
Sadly, I got the name from What happened was likely this:
Yes, ripgrep's column numbers are byte offsets. And "column number" is just as ambiguous as "character number." :-) I suspect the space of possible concrete definitions is at least:
These are all precise definitions with specific meanings. My guess is that by "character number" you mean "codepoint number." |
You make it sound like we should know what you don't know. There was nothing in your prior comments to come to that conclusion. Also, given you added the There are two to four sections worth reading to understand how Vim processes the lines returned by a regex engine. But I will summarize the story for you here in short.
Finally below is a mp4 that shows result of my integration of
grepintegration.mp4 |
Yes, you're right, I'm sorry. I spoke out of frustration. Thank you for the follow up details. If you don't mind, I'd like to classify this issue as a bug and re-purpose it to " If and when you get to a point where it's clear that the |
(I will dig into your links when I get some time, and I'll plan to have this fixed for the next release if it's feasible to do so. I expect it will be.) |
It turns out that the vimgrep format really only wants one line per match, even when that match spans multiple lines. We continue to support the previous behavior (print all lines in a match) in the `grep-printer` crate. We add a new option to enable the "only print the first line" behavior, and unconditionally enable it in ripgrep. We can do that because the option has no effect in single-line mode, since, well, in that case matches are guaranteed to span one line anyway. Fixes #1866
It turns out that the vimgrep format really only wants one line per match, even when that match spans multiple lines. We continue to support the previous behavior (print all lines in a match) in the `grep-printer` crate. We add a new option to enable the "only print the first line" behavior, and unconditionally enable it in ripgrep. We can do that because the option has no effect in single-line mode, since, well, in that case matches are guaranteed to span one line anyway. Fixes #1866
NOTE from BurntSushi: This issue has been re-classified as a bug that was discovered in the course of discussion. There's too much context here to split it apart, which GitHub doesn't really make easy to do anyway. So I've re-purposed this issue for the bug described in the title. The main details of the bug are described here: #1866 (comment)
Below is the original issue.
Describe your feature request
Vim recently added a new feature called "text properties" that can be used to highlight text. All it needs to know to highlight is:
Ideally there would be an option like
--json-pos
that just prints positions of matches for each file, perhaps in json format like following. If that is not possible or two cumbersome, is there a way I can already do this in a performance efficient way using ripgrep?The text was updated successfully, but these errors were encountered: