Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failure: postgresql on darwin #371242

Open
wolfgangwalther opened this issue Jan 5, 2025 · 24 comments
Open

Build failure: postgresql on darwin #371242

wolfgangwalther opened this issue Jan 5, 2025 · 24 comments
Labels
0.kind: build failure A package fails to build

Comments

@wolfgangwalther
Copy link
Contributor

Steps To Reproduce

postgresql currently fails to build for all versions 15+ on x86_64-darwin. It fails in the installCheckPhase and versions 13 and 14 have the tests disabled anyway, thus they pass.

Build log

Adding the following debug statement gives some insight into the failure:

      failureHook = ''
        find . -iname 'initdb.log' -print0 | xargs -0 cat
      '';
> running bootstrap script ... 2025-01-05 19:25:16.852 UTC [59779] FATAL:  could not create shared memory segment: Cannot allocate memory
> 2025-01-05 19:25:16.852 UTC [59779] DETAIL:  Failed system call was shmget(key=37036594, size=56, 03600).
> 2025-01-05 19:25:16.852 UTC [59779] HINT:  This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMALL parameter.  You might need to reconfigure the kernel with larger SHMALL.
>         The PostgreSQL documentation contains more information about shared memory configuration.
> child process exited with exit code 1
> initdb: data directory "/private/tmp/nix-build-postgresql-15.10.drv-0/postgresql-15.10/src/interfaces/ecpg/test/tmp_check/data" not removed at user's request

Additional context

Hydra failures:

Notify maintainers

@NixOS/postgres @NixOS/darwin-maintainers


Note for maintainers: Please tag this issue in your PR.


Add a 👍 reaction to issues you find important.

@wolfgangwalther wolfgangwalther added the 0.kind: build failure A package fails to build label Jan 5, 2025
@wolfgangwalther
Copy link
Contributor Author

Of course the simple thing to do would be to disable the tests for x86_64-darwin.

But maybe somebody has a better idea why it happens in that specific case?

@emilazy
Copy link
Member

emilazy commented Jan 5, 2025

My guess is Rosetta 2 weirdness (Hydra emulates x86_64-darwin on aarch64-darwin). I don’t suppose you can ask the tests to create a smaller shared memory segment?

@Ma27
Copy link
Member

Ma27 commented Jan 6, 2025

Correct me if I'm wrong, but isn't x86_64-darwin dying out anyways? Does upstream run their tests on that platform, so that we may be able to steal potential fixes?
Asking because I'd be fine with ignoring tests on x86_64-darwin given that I don't know how worth it is to invest time into fixing issues associated with this platform. Not going to stop anybody, but if it were up to me, I'd pick the easy route here.

@niklaskorz
Copy link
Contributor

niklaskorz commented Jan 6, 2025

Correct me if I'm wrong, but isn't x86_64-darwin dying out anyways?

The last Mac models with Intel CPUs were released in 2020 (running on 10th gen Intel CPUs), about half a year before the release of the M1.
macOS 17 (scheduled for end of 2026) is rumored to be the first macOS without Intel support, in which case, considering the current scheme of macOS versions we support, nixpkgs would likely support x86_64-darwin until end of 2032.

Edit: of course it's possible Rosetta will die before that, in which case nixpkgs might have to sunset x86_64-darwin support sooner than that as the builders rely on Rosetta

@reckenrode
Copy link
Contributor

reckenrode commented Jan 6, 2025

Correct me if I'm wrong, but isn't x86_64-darwin dying out anyways?

The last Mac models with Intel CPUs were released in 2020 (running on 10th gen Intel CPUs), about half a year before the release of the M1. macOS 17 (scheduled for end of 2026) is rumored to be the first macOS without Intel support, in which case, considering the current scheme of macOS versions we support, nixpkgs would likely support x86_64-darwin until end of 2032.

We’re realigning the support window to match Apple’s. The plan after #352129 is to go to 11.3 for 25.05 and 14.4 for 25.11. If macOS 17 is the last version to support Intel hardware, then that would make 28.05 the last release to support x86_64-darwin on Intel hardware.

However, if Rosetta 2 is dropped, it’s likely that x86_64-darwin won’t be cached. Most All of the builders today are aarch64-darwin, and x86_64-darwin packages are built using Rosetta 2. Whether that results in x86_64-darwin being dropped sooner (before 28.05), I don’t know.

Personally, given past transitions, I expect Rosetta 2 to remain supported for one release after Intel hardware support is dropped. That’s going to suck for applications stuck on Intel, but it’s not like that stopped Apple from killing 32-bit support. (RIP my Steam library. Again.)

@sikmir
Copy link
Member

sikmir commented Jan 6, 2025

The last Mac models with Intel CPUs were released in 2020 (running on 10th gen Intel CPUs)

And it still just works for me, sad news that nixpkgs will drop support some day.

@emilazy
Copy link
Member

emilazy commented Jan 6, 2025

FWIW I daily drive x86_64-darwin for now and don’t think there are any plans to kill it off soon, but of course it does have a limited lifespan as a platform that gets anything like the level of support it currently does. If/when Rosetta 2 goes away we won’t be able to build packages for the platform any more, so that will probably be fatal, and since we track the same target OS version across x86_64-darwin and aarch64-darwin, the last OS release that supports Intel going out of support will also likely be the end (even if we let them diverge again, it’ll very much be a legacy platform by then and won’t benefit from aarch64-darwin fixes to the same extent). It’s possible that usage will drop low enough before that point that we decommission it sooner, but I wouldn’t expect that to happen in the next 2–3 years.

Thankfully NixOS will most likely continue to run pretty well on Intel Macs even after that!

@sikmir
Copy link
Member

sikmir commented Jan 6, 2025

Thankfully NixOS will most likely continue to run pretty well on Intel Macs even after that!

Hmm, that's good idea to install NixOS on Mac)

@wolfgangwalther
Copy link
Contributor Author

I don’t suppose you can ask the tests to create a smaller shared memory segment?

I tried and failed. The tests, too.

My guess is Rosetta 2 weirdness (Hydra emulates x86_64-darwin on aarch64-darwin).

I'll probably just disable the tests for x86_64-darwin again. It's not that we had them enabled for long anyway, until recently the tests for darwin were not enabled at all for a long time. Seems like we only tested aarch64-darwin in #358248.

@emilazy
Copy link
Member

emilazy commented Jan 6, 2025

Yes, I think that’s fine in the absence of someone who feels like digging into the details. I’m guessing they might pass on native machines, but Hydra doesn’t have any native x86_64-darwin. aarch64-darwin tests are probably achieving reasonable coverage for x86_64-darwin too.

Sorry for the lengthy digression about the state of the platform 😅

wolfgangwalther added a commit to wolfgangwalther/nixpkgs that referenced this issue Jan 6, 2025
wolfgangwalther added a commit to wolfgangwalther/nixpkgs that referenced this issue Jan 6, 2025
@wolfgangwalther
Copy link
Contributor Author

My guess is Rosetta 2 weirdness (Hydra emulates x86_64-darwin on aarch64-darwin).

A very similar error now popped up in #371463 (comment) for a postgres extension built via buildPgrxExtension, which runs postgresql / initdb as part of the build process.

This time it's on aarch64-darwin, so it doesn't seem to be Rosetta 2 related. Community builder also doesn't seem to be overloaded or so.

What else could be a reason / workaround / fix for this kind of failure?

@wolfgangwalther
Copy link
Contributor Author

Other packages fail with this same error. postgresql-simple on x86_64-darwin, for example:

https://hydra.nixos.org/build/281936070

It built fine as of https://hydra.nixos.org/build/280560529. This log has:

waiting for server to start....2024-11-25 10:51:20.073 UTC [39128] LOG:  starting PostgreSQL 16.5 on x86_64-apple-darwin24.1.0, compiled by clang version 16.0.6, 64-bit
2024-11-25 10:51:20.073 UTC [39128] LOG:  listening on Unix socket "/private/tmp/nix-build-postgresql-simple-0.7.0.0.drv-0/run/postgresql/.s.PGSQL.5432"
2024-11-25 10:51:20.083 UTC [39131] LOG:  database system was shut down at 2024-11-25 10:51:19 UTC
2024-11-25 10:51:20.093 UTC [39128] LOG:  database system is ready to accept connections

That shows us that initdb / postgres was running fine back then. And it was running with 16.5 already.

Thus: This failure appearing is not related to us enabling the tests on darwin and is also not related to updating postgresql.

What else changed?

(Hydra emulates x86_64-darwin on aarch64-darwin).

@emilazy is this something that recently (since November 25 last year) changed?

If not, then something for darwin changed in nixpkgs which is causing those failures.

@emilazy
Copy link
Member

emilazy commented Jan 12, 2025

IIRC Hydra had one single native x86_64-darwin builder until fairly recently. I think that went away earlier than November 25 though.

The builders have had some other changes though (macOS upgrades, using the Nix daemons to run builds, etc.), and Darwin has had a number of changes recently (the SDK rework – though that landed prior to November – but also the LLVM upgrade and so on). I’m not sure what it could be in this case. I’m also not sure who to ping for such an arcane failure; we have people who know Darwin and people who know Postgres, but I’m not sure if we have people who know both :)

I don’t think it would be the end of the world to just mark it as broken, but I do imagine Postgres on Darwin gets some use for development environments. Have you tried reproducing the failure on the community builder? git bisect start --first-parent is usually a good first step for these kinds of situations.

@wolfgangwalther
Copy link
Contributor Author

I don’t think it would be the end of the world to just mark it as broken

Well, if it was only x86_64-darwin, yeah. But since it also appeared on aarch64-darwin at least once (see #371242 (comment)), I don't think this would be enough.

Have you tried reproducing the failure on the community builder? git bisect start --first-parent is usually a good first step for these kinds of situations.

That will be my next step, but I have too many builds running right now to even think about it :D

@wolfgangwalther
Copy link
Contributor Author

So bisect leads me to fc9c333, ofc a merge of staging-next. Before that commit postgresql-simple for x86_64-darwin was still cached and afterwards it isn't. It fails with the same error mentioned above.

But.. when I try to build the commit where it was still passing on the community-builder with a trivial change (changed order of build inputs) I get the same build failure again.

Thus, the problem seems to be introduced by some external factor, indeed, not internal to nixpkgs.

The builders have had some other changes though (macOS upgrades, using the Nix daemons to run builds, etc.)

Where can I track down which kind of changes were made to the hydra builders between 2024-11-25 (the last passing build) and 2024-12-23 (the merge of staging above)?

I assume the community builders would have gone through the same changes, but not necessarily at the same time.

The community builders are currently on macOS 15.2 (build 24C101). Of course.. this version was released on December 11th: https://support.apple.com/en-us/100100. So that fits right in.

The changelog for this version also lists at least 4 items related to "Kernel", 3 of them mentioning "memory"...

I found a comment about something very similar here: https://wsjtx.groups.io/g/main/message/52474. The timeline doesn't match up 100%, because that's from September, but:

But then I found in Sonoma that even after changing kern.sysv.shmmax to 52428800, I was getting the shared memory error [...].

I noticed that there was a difference in another variable, kern.sysv.shmall. This was set to 1024 on the Sonoma Mac I just did the fresh install on, but on the other Sonoma Mac where [it] works, it's 25600.

I therefore set the newly installed Sonoma Mac to this ( sudo sysctl -w kern.sysv.shmall=25600 ), in addition to making the kern.sysv.shmmax change. Now [it] runs fine on both of those Sonoma Macs.

This seems to be a problem with the configuration of both the hydra and the community builders.

I don't have a Mac myself, so I can't investigate / reproduce / fix it that way. If somebody could step up to investigate, that would be great.

@al3xtjames
Copy link
Contributor

al3xtjames commented Jan 12, 2025

I'm seeing postgresql_17 fail on x86_64-darwin (native, not using Rosetta) due to a similar error:

# postmaster failed, examine "/private/tmp/nix-build-postgresql-17.2.drv-9/postgresql-17.2/src/test/modules/test_dsa/log/postmaster.log" for the reason
Bail out!make[3]: *** [../../../../src/makefiles/pgxs.mk:454: check] Error 2
make[3]: Leaving directory '/private/tmp/nix-build-postgresql-17.2.drv-9/postgresql-17.2/src/test/modules/test_dsa'

From postmaster.log:

2025-01-12 20:09:04.866 UTC postmaster[43460] FATAL:  could not create shared memory segment: No space left on device
2025-01-12 20:09:04.866 UTC postmaster[43460] DETAIL:  Failed system call was shmget(key=548508689, size=56, 03600).
2025-01-12 20:09:04.866 UTC postmaster[43460] HINT:  This error does *not* mean that you have run out of disk space.  It occurs either if all available shared memory IDs have been taken, in which case you need to raise the SHMMNI parameter in your kernel, or because the system's overall limit for shared memory has been reached.
	The PostgreSQL documentation contains more information about shared memory configuration.
2025-01-12 20:09:04.866 UTC postmaster[43460] LOG:  database system is shut down

I'm using the default values for the shared memory sysctls:

IPC status from <running system> as of Sun Jan 12 14:17:51 CST 2025
shminfo:
	shmmax: 4194304	(max shared memory segment size)
	shmmin:       1	(min shared memory segment size)
	shmmni:      32	(max number of shared memory identifiers)
	shmseg:       8	(max shared memory segments per process)
	shmall:    1024	(max amount of shared memory in pages)

The tests still fail after increasing kern.sysv.shmall to 25600 and kern.sysv.shmmax to 52428800. I also tried to increase shmmni but it failed with a permission error (due to SIP?).

@wolfgangwalther
Copy link
Contributor Author

Seems like those same errors already appeared in 2022: #198495.

Imho this implies that there must be a way to fix this, very likely via some system configuration.

@wolfgangwalther
Copy link
Contributor Author

wolfgangwalther commented Feb 20, 2025

This is now happening for aarch64-darwin on hydra as well:
https://hydra.nixos.org/eval/1812132?filter=postgresql%5C_&compare=1812115#tabs-now-fail

Edit: Rebuilding seems to have fixed it. The failing builds where postgresql_15, postgresql_17 and postgresql_17_jit, IIRC. So there is some flakiness in play as well.

@wolfgangwalther wolfgangwalther changed the title Build failure: postgresql on x86_64-darwin Build failure: postgresql on darwin Feb 20, 2025
@wolfgangwalther
Copy link
Contributor Author

I'm looking into this again and here are some observations:

  • At first I ran various builds of postgresql in different versions and some packages with postgresqlTestHook successfully, both on x86_64 and aarch64. I was not able to reproduce the error at all.
  • At some point, suddenly they started to fail. And then all of them started to fail with the same error.
  • All of this on the same machine, the community builder.
  • No specific high load / memory pressure or anything.
  • It also happens when I run initdb manually outside nix. So actually unrelated to nix and/or any sandboxing issues.

The current sysctl settings regarding shared memory are:

kern.sysv.shmmax: 4194304
kern.sysv.shmmin: 1
kern.sysv.shmmni: 32
kern.sysv.shmseg: 8
kern.sysv.shmall: 1024

One further note: The shmget call that is failing is trying to request 56 bytes of memory. It's... unlikely that I can still navigate in the shell and run initdb, if I have less than 56 bytes of memory available...

There is an interesting comment in the postgres source code about this topic here:

https://github.com/postgres/postgres/blob/2c53dec7f4407c022f8b83e1a63fe0ae1bbb4dc2/src/backend/port/sysv_shmem.c#L42-L67

After googling around for this specific error with size=56... I found this:

https://discourse.nixos.org/t/nixbld-leaving-around-shared-memory-segments/30043/5

This seems to be on the point. The community-builder currently has some left-over shared memory segments:

> ipcs -ma
IPC status from <running system> as of Thu Feb 20 20:55:27 GMT 2025
T     ID     KEY        MODE       OWNER    GROUP  CREATOR   CGROUP NATTCH  SEGSZ  CPID  LPID   ATIME    DTIME    CTIME
Shared Memory:
m 32833536 0x11b4e1a3 --rw------- _nixbld14   nixbld _nixbld14   nixbld      0     56  67280  67280 18:44:33 18:45:11 18:44:33
m 24510465 0x11b4edc2 --rw------- _nixbld14   nixbld _nixbld14   nixbld      0     56  69961  69961 18:45:40 18:48:00 18:45:40
m 32178178 0x11b4d6b6 --rw------- _nixbld14   nixbld _nixbld14   nixbld      0     56  65288  65288 18:43:06 18:44:09 18:43:06
m 26738692 0x11b5fe7a --rw------- _nixbld14   nixbld _nixbld14   nixbld      0     56  78786  78786 18:48:28 18:52:40 18:48:28
m 786443 0x11b85786 --rw------- _nixbld13   nixbld _nixbld13   nixbld      0     56  29841  29841 19:04:24 19:05:24 19:04:24

The timings say that they were created when I first started to work on the community builder today. I tried to run some derivations with postgresqlTestHook, which failed for other reasons / got stuck - and then I cancelled the build for them. The first couple of tries that worked fine - but then, we were out of shared memory segments. And now, no other build will succeed.

Here's what happens, I think:

  • We run any derivation running a postgresql cluster, e.g. via postgresqlTestHook.
  • The derivation fails in the tests.
  • The cluster is not shut down properly. Shared Memory segments are not released.
  • Repeat a few times - and we can't run initdb anymore.

Since I don't have root access to the community builder, I can't clear those memory segments up for further testing.

This seems to go wrong in a couple of places:

  • It's unclear why postgres even creates such big Sys V memory segments. We might be able to force it to use mmap.
  • We might not clean up resources properly in postgresqlTestHook, but I'm not sure whether we actually can.
  • Why does PostgreSQL leave those shared memory segments behind? Maybe it's killed and can't do anything about them anymore.
  • Should Nix clean this up?

@Ma27
Copy link
Member

Ma27 commented Feb 21, 2025

Oof, thanks a lot for the investigation.

Should Nix clean this up?

I'd argue that ultimately, yes it should. For two reasons:

  • if the build hangs for whatever reason and gets killed, it's up to Nix anyways to do this.
  • the maintainers of each package doing something like this (granted, I don' tknow if anyone else is affected) would have to implement the cleanup themselves.

@wolfgangwalther
Copy link
Contributor Author

It's unclear why postgres even creates such big Sys V memory segments. We might be able to force it to use mmap.

I was again fooled by the output. While those memory segments are blocking, they are not big. Those are the same 56 bytes as can be seen in the SEGSZ column. This also means they can't be changed to mmap, because they are just allocated as some kind of lock on the data directory and this will always use sysv.

They are cleaned up on the Linux sandbox automatically, because Linux stores them on a tmpfs mounted at /dev/shm, which is entirely removed.

PostgreSQL removes them on INT, QUIT and TERM signals. But unsurprisingly doesn't on ABRT and KILL.

So I guess we can't do anything on the nixpkgs side.

@wolfgangwalther
Copy link
Contributor Author

wolfgangwalther commented Feb 21, 2025

Created NixOS/nix#12548 on the Nix side.

And https://git.lix.systems/lix-project/lix/issues/691 for Lix.

@toonn
Copy link
Contributor

toonn commented Feb 22, 2025

Can't you trap the signal and send TERM to PostgreSQL before potentially repeating the ABRT/KILL?

@wolfgangwalther
Copy link
Contributor Author

Can't you trap the signal and send TERM to PostgreSQL before potentially repeating the ABRT/KILL?

IIUC, the idea of KILL is, that I can not trap it.

I'm not exactly sure how nix kills the builder, though. I assume it must be via KILL, because of those observations. Consider this example:

nix-build --no-link -E "with import ./. {}; runCommand \"test\" {} \"echo hello; trap 'echo trapped' EXIT; sleep 1000\""

This will print hello, then sleep. When you cancel the build, it does not show "trapped" for me.

To double check, run it straight in bash:

bash -c "echo hello; trap 'echo trapped' EXIT; sleep 1000"

This will print "trapped" when killing.

So.. I don't think we can trap this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: build failure A package fails to build
Projects
None yet
Development

No branches or pull requests

8 participants