@@ -11,16 +11,16 @@ directory exists.
11
11
To try to test things out a small subset (defined in ` run_specs_small.conf ` ) with just 10 eval instances:
12
12
13
13
# Just load the config file
14
- venv/bin/benchmark-present --conf src/benchmark/presentation/run_specs_small.conf --local --max-eval-instances 10 --suite $SUITE --skip-instances
14
+ venv/bin/helm-run --conf src/helm /benchmark/presentation/run_specs_small.conf --local --max-eval-instances 10 --suite $SUITE --skip-instances
15
15
16
16
# Create the instances and the requests, but don't execute
17
- venv/bin/benchmark-present --conf src/benchmark/presentation/run_specs_small.conf --local --max-eval-instances 10 --suite $SUITE --dry-run
17
+ venv/bin/helm-run --conf src/helm /benchmark/presentation/run_specs_small.conf --local --max-eval-instances 10 --suite $SUITE --dry-run
18
18
19
19
# Execute the requests and compute metrics
20
- venv/bin/benchmark-present --conf src/benchmark/presentation/run_specs_small.conf --local --max-eval-instances 10 --suite $SUITE
20
+ venv/bin/helm-run --conf src/helm /benchmark/presentation/run_specs_small.conf --local --max-eval-instances 10 --suite $SUITE
21
21
22
22
# Generate assets for the website
23
- venv/bin/benchmark -summarize --suite $SUITE
23
+ venv/bin/helm -summarize --suite $SUITE
24
24
25
25
Notes:
26
26
- ` --local ` means we bypass the proxy server.
@@ -30,7 +30,7 @@ To run everything (note we're restricting the number of instances and
30
30
scenarios) in parallel:
31
31
32
32
# Generate all the commands to run in parallel
33
- venv/bin/benchmark-present --local --suite $SUITE --max-eval-instances 1000 --priority 2 --num-threads 8 --skip-instances
33
+ venv/bin/helm-run --local --suite $SUITE --max-eval-instances 1000 --priority 2 --num-threads 8 --skip-instances
34
34
35
35
# Run everything in parallel over Slurm
36
36
bash benchmark_output/runs/$SUITE/run-all.sh
@@ -39,11 +39,11 @@ scenarios) in parallel:
39
39
# tail benchmark_output/runs/$SUITE/slurm-*.out
40
40
41
41
# Generate assets for the website
42
- venv/bin/benchmark-present --local --suite $SUITE --max-eval-instances 1000 --skip-instances
43
- venv/bin/benchmark -summarize --suite $SUITE
42
+ venv/bin/helm-run --local --suite $SUITE --max-eval-instances 1000 --skip-instances
43
+ venv/bin/helm -summarize --suite $SUITE
44
44
45
45
# Run a simple Python server to make sure things work at http://localhost:8000
46
- benchmark -server
46
+ helm -server
47
47
48
48
# Copy all website assets to the `www` directory, which can be copied to GitHub pages for static serving.
49
49
sh scripts/create-www.sh $SUITE
@@ -56,15 +56,15 @@ Once everytihng has been sanity checked, push `www` to a GitHub page.
56
56
57
57
To estimate token usage without making any requests, append the ` --dry-run ` option:
58
58
59
- venv/bin/benchmark -run -r <RunSpec to estimate token usage> --suite $SUITE --max-eval-instances <Number of eval instances> --dry-run
59
+ venv/bin/helm -run -r <RunSpec to estimate token usage> --suite $SUITE --max-eval-instances <Number of eval instances> --dry-run
60
60
61
61
and check the output in ` benchmark_output/runs/$SUITE ` .
62
62
63
63
64
64
where ` sum ` indicates the estimated total number of tokens used for the specific ` RunSpec ` .
65
65
66
66
For the OpenAI models, we use a
67
- [ GPT-2 Tokenizer] ( https://github.com/stanford-crfm/benchmarking/blob/master/src/proxy/tokenizer/openai_token_counter.py#L12 )
67
+ [ GPT-2 Tokenizer] ( https://github.com/stanford-crfm/benchmarking/blob/master/src/helm/ proxy/tokenizer/openai_token_counter.py#L12 )
68
68
to estimate the token usage. The tokenizer will be downloaded and cached when running a dry run.
69
69
70
70
## Final benchmarking (Infrastructure team only)
@@ -115,5 +115,5 @@ to estimate the token usage. The tokenizer will be downloaded and cached when ru
115
115
1 . Create a screen session: ` screen -S reproducible ` .
116
116
1 . ` conda activate crfm_benchmarking ` .
117
117
1 . Run `python3 scripts/verify_reproducibility.py --models-to-run openai/davinci openai/code-cushman-001 together/gpt-neox-20b
118
- --conf-path src/benchmark/presentation/run_specs.conf --max-eval-instances 1000 --priority 2 &> reproducible.log`.
118
+ --conf-path src/helm/ benchmark/presentation/run_specs.conf --max-eval-instances 1000 --priority 2 &> reproducible.log`.
119
119
1 . Check the result at ` reproducible.log ` .
0 commit comments