Attempting to match ripgrep line counting performance in rust #2997

atlv24 · 2025-02-23T22:49:48Z

atlv24
Feb 23, 2025

I've been toying around with memchr and memmap2 trying to see if i could match ripgrep's performance at counting lines in a file. Here's my test setup, beware generating the file takes a few minutes, and is 68gb.

dd if=/dev/urandom bs=100000000 count=512 | base64 -b120 > test
time rg --count '^' test
568888889
rg --count '^' test  10.99s user 5.50s system 93% cpu 17.546 total

Here's my attempt

main.rs

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let args: Vec<_> = std::env::args().into_iter().collect();
    let input = &args[1];
    let jobs: usize = args.get(2).and_then(|a| a.parse().ok()).unwrap_or(1);
    let advise: bool = args.get(3).map(|a| a.contains('a')).unwrap_or(false);
    let populate: bool = args.get(3).map(|a| a.contains('p')).unwrap_or(false);
    let huge: bool = args.get(3).map(|a| a.contains('h')).unwrap_or(false);

    let file = std::fs::File::open(input)?;
    let file_size = file.metadata()?.len() as usize;
    let mut options = memmap2::MmapOptions::new();
    if populate {
        options.populate();
    }
    if huge {
        options.huge(Some(21));
    }
    let mmap = unsafe { options.len(file_size).map(&file)? };
    if advise {
        mmap.advise(memmap2::Advice::Sequential)?;
    }

    let count = if jobs > 1 {
        use rayon::prelude::*;
        mmap.par_chunks(file_size.div_ceil(jobs))
            .map(|chunk| memchr::memchr_iter(b'\n', chunk).count())
            .sum()
    } else {
        memchr::memchr_iter(b'\n', &mmap[0..file_size]).count()
    };

    println!("{}", count);
    Ok(())
}

Cargo.toml

[package]
name = "linecount"
version = "0.1.0"
edition = "2021"

[dependencies]
memchr = "2.7.4"
memmap2 = "0.9.5"
rayon = "1.10.0"

Running it on my 14-core m4 macbook:

cargo build --release

time ./target/release/linecount test 1 aph
568888889
./target/release/linecount test 1 aph  5.40s user 8.89s system 13% cpu 1:48.44 total

time ./target/release/linecount test 14 aph
568888889
./target/release/linecount test 14 aph  3.23s user 33.25s system 82% cpu 43.998 total

time ./target/release/linecount test 14 ph
568888889
./target/release/linecount test 14 ph  3.08s user 29.51s system 89% cpu 36.595 total

time ./target/release/linecount test 14 p 
568888889
./target/release/linecount test 14 p  3.11s user 27.66s system 87% cpu 35.279 total

I know ripgrep is singlethreaded for single file searches, and its perplexing me how it manages to attain such incredible performance. When I singlethread my naive implementation, it takes 6x longer than ripgrep. I've been reading the ripgrep source code and I seem to be doing most things kind of similarly, i believe. And even with multithreading to my core count, and various configurations, I still don't even get within 2x of ripgrep performance!

Any idea what I'm doing wrong?

Answered by atlv24

Feb 26, 2025

Oh. Well that answers everything

        if cfg!(target_os = "macos") {
            // I guess memory maps on macOS aren't great. Should re-evaluate.
            return None;
        }

View full answer

BurntSushi · 2025-02-24T13:44:08Z

BurntSushi
Feb 24, 2025
Maintainer

First thing to note here is that your benchmark is kinda brutal. It takes a very long time to run. I'd suggest smaller inputs. You still want something big enough so that it's easier to measure throughput, and while 68GB might be the size of the real file you ultimately want to search, it's totally fine to decrease that for a benchmark. Like... 10GB maybe? I ended up just removing a 0:

dd if=/dev/urandom bs=10000000 count=512 status=progress | base64 > test

With that I get these timings:

[andrew@duff ripgrep-d2997]$ time rg --count '^' test
89824562

real    3.156
user    2.975
sys     0.175
maxmem  6601 MB
faults  0
[andrew@duff ripgrep-d2997]$ time ./target/release/linecount test 1 aph
89824562

real    0.601
user    0.415
sys     0.184
maxmem  6597 MB
faults  0
[andrew@duff ripgrep-d2997]$ time ./target/release/linecount test 12 aph
89824562

real    0.421
user    2.619
sys     0.191
maxmem  6597 MB
faults  0
[andrew@duff ripgrep-d2997]$ time ./target/release/linecount test 4 aph
89824562

real    0.432
user    0.958
sys     0.185
maxmem  6598 MB
faults  0
[andrew@duff ripgrep-d2997]$ time ./target/release/linecount test 2 aph
89824562

real    0.443
user    0.515
sys     0.185
maxmem  6597 MB
faults  0
[andrew@duff ripgrep-d2997]$ time ./target/release/linecount test 3 aph
89824562

real    0.421
user    0.700
sys     0.181
maxmem  6597 MB
faults  0
[andrew@duff ripgrep-d2997]$ time ./target/release/linecount test 3 ph
89824562

real    0.441
user    0.684
sys     0.184
maxmem  6597 MB
faults  0
[andrew@duff ripgrep-d2997]$ time ./target/release/linecount test 1 ph
89824562

real    0.537
user    0.353
sys     0.181
maxmem  6597 MB
faults  0
[andrew@duff ripgrep-d2997]$ time ./target/release/linecount test 1 p
89824562

real    0.503
user    0.321
sys     0.180
maxmem  6597 MB
faults  0
[andrew@duff ripgrep-d2997]$ time ./target/release/linecount test 1 aph
89824562

real    0.603
user    0.417
sys     0.184
maxmem  6597 MB
faults  0
[andrew@duff ripgrep-d2997]$ time ./target/release/linecount test 1 p
89824562

real    0.506
user    0.321
sys     0.182
maxmem  6597 MB
faults  0

So your program is, as I would expect, quite a bit faster than ripgrep! Parallelism helps a bit, but overall isn't that much faster than just using single threaded.

However, I am on Linux. Not macOS. And I was in my /dev/shm directory, which is a ramdisk.

I don't know what the macOS equivalent of /dev/shm is. I did a quick search for it, but it looks like you have to actually create one. I didn't want to bother with that. So I just ran in /tmp and hoped that was good enough. (I ran each command multiple times to hopefully put the input into cache.)

I have an M2 mac mini:

$ sysctl -a | rg -i cpu
kern.sched_rt_avoid_cpu0: 0
kern.cpu_checkin_interval: 4000
hw.ncpu: 8
hw.activecpu: 8
hw.perflevel0.physicalcpu: 4
hw.perflevel0.physicalcpu_max: 4
hw.perflevel0.logicalcpu: 4
hw.perflevel0.logicalcpu_max: 4
hw.perflevel0.cpusperl2: 4
hw.perflevel1.physicalcpu: 4
hw.perflevel1.physicalcpu_max: 4
hw.perflevel1.logicalcpu: 4
hw.perflevel1.logicalcpu_max: 4
hw.perflevel1.cpusperl2: 4
hw.physicalcpu: 8
hw.physicalcpu_max: 8
hw.logicalcpu: 8
hw.logicalcpu_max: 8
hw.cputype: 16777228
hw.cpusubtype: 2
hw.cpu64bit_capable: 1
hw.cpufamily: -634136515
hw.cpusubfamily: 2
machdep.cpu.cores_per_package: 8
machdep.cpu.core_count: 8
machdep.cpu.logical_per_package: 8
machdep.cpu.thread_count: 8
machdep.cpu.brand_string: Apple M2

And here are my timings on my M2:

$ time rg --count '^' test
89824562

real    2.744
user    2.262
sys     0.480
maxmem  4016 MB
faults  10
$ time ./target/release/linecount test 1 aph
89824562

real    0.826
user    0.350
sys     0.269
maxmem  6755680 MB
faults  1

Which also matches what I'd expect.

I think the data above suggests that something else is going wrong here. Maybe since your input is so big, it can't fit into memory and is never cached. Which could mean that all you're really measuring is disk read time. However, if this were true, it should in theory impact ripgrep just as much as your program. So that is a bit of a mystery to me. It's possible something else is going wrong with your measurement process, but it's unclear to me what it could be.

It might be worth re-booting and trying to re-create your measurements step-by-step.

11 replies

atlv24 Feb 26, 2025
Author

Maybe building ripgrep from source myself will make it slow, I installed from homebrew. Do you do LTO or something of the sort when building distributables?

atlv24 Feb 26, 2025
Author

Nope, just built ripgrep from source in --release and its fast.

atlv24 Feb 26, 2025
Author

Just checked the memchr and memmap2 versions - i was using the same ones except one patch release ahead on memmap2. Tried bumping it back to match rg, no dice, still slow. Any suggestions on how to start trimming down rg? Im assuming i mostly want to focus on the searcher crate. Maybe some unit test is applicable

atlv24 Feb 26, 2025
Author

Oh. Well that answers everything

        if cfg!(target_os = "macos") {
            // I guess memory maps on macOS aren't great. Should re-evaluate.
            return None;
        }

Answer selected by atlv24

BurntSushi Feb 26, 2025
Maintainer

Welp, yeah. Curious though that I didn't observe that in my M2 mac mini tests.

atlv24 Feb 27, 2025
Author

I'm sure it would show up on larger files that didnt fit in ram.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempting to match ripgrep line counting performance in rust #2997

{{title}}

Replies: 1 comment 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Attempting to match ripgrep line counting performance in rust #2997

atlv24 Feb 23, 2025

Replies: 1 comment · 11 replies

BurntSushi Feb 24, 2025 Maintainer

atlv24 Feb 26, 2025 Author

atlv24 Feb 26, 2025 Author

atlv24 Feb 26, 2025 Author

atlv24 Feb 26, 2025 Author

BurntSushi Feb 26, 2025 Maintainer

atlv24 Feb 27, 2025 Author

atlv24
Feb 23, 2025

Replies: 1 comment 11 replies

BurntSushi
Feb 24, 2025
Maintainer

atlv24 Feb 26, 2025
Author

atlv24 Feb 26, 2025
Author

atlv24 Feb 26, 2025
Author

atlv24 Feb 26, 2025
Author

BurntSushi Feb 26, 2025
Maintainer

atlv24 Feb 27, 2025
Author