Currently, importing `codersdk` just to interact with the API requires
importing tailscale, which causes builds to fail unless manually using
our fork.
* fix: separate signals for passive, active, and forced shutdown
`SIGTERM`: Passive shutdown stopping provisioner daemons from accepting new
jobs but waiting for existing jobs to successfully complete.
`SIGINT` (old existing behavior): Notify provisioner daemons to cancel in-flight jobs, wait 5s for jobs to be exited, then force quit.
`SIGKILL`: Untouched from before, will force-quit.
* Revert dramatic signal changes
* Rename
* Fix shutdown behavior for provisioner daemons
* Add test for graceful shutdown
I noticed in my logs that sometimes `coder ssh` doesn't gracefully disconnect from the coordinator.
The cause is the `closerStack` construct we use in that function. It has two paths to start closing things down:
1. explicit `close()` which we do in `defer`
2. context cancellation, which happens if the cli function returns an error
sometimes the ssh remote command returns an error, and this triggers context cancellation of the `closerStack`. That is fine in and of itself, but we still want the explicit `close()` to wait until everything is closed before returning, since that's where we do cleanup, including the graceful disconnect. Prior to this fix the `close()` just immediately exits if another goroutine is closing the stack. Here we add a wait until everything is done.
Beginnings of a solution to #12297
Doesn't cover disco or definitively display whether we successfully connected to DERP, but shows some checklist diagnostics for connecting to an agent.
For this first PR, I just added it to `coder ping` to see how we like it, but could be incorporated into `coder ssh` _et al._ after a timeout.
```
$ coder ping dogfood2
p2p connection established in 147ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 147ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 140ms
pong from dogfood2 p2p via 95.217.xxx.yyy:42631 in 140ms
✔ preferred DERP region 999 (Council Bluffs, Iowa)
✔ sent local data to Coder networking coodinator
✔ received remote agent data from Coder networking coordinator
preferred DERP 10013 (Europe Fly.io (Paris))
endpoints: 95.217.xxx.yyy:42631, 95.217.xxx.yyy:37576, 172.17.0.1:37576, 172.20.0.10:37576
✔ Wireguard handshake 11s ago
```
Man, graceful shutdown is hard. Even after my changes, we were still hitting a graceful shutdown race: https://github.com/coder/coder/runs/18886842123
The problem was that while we attempt a graceful shutdown at the SSH layer by closing the session for writing, we were not giving it a chance to complete before continuing to tear down the stack of closers, including one that closes the netstack, and thus drop the TCP connection before it closes.
Re-enables TestSSH/RemoteForward_Unix_Signal and addresses the underlying race: we were not closing the remote forward on context expiry, only the session and connection.
However, there is still a more fundamental issue in that we don't have the ability to ensure that TCP sessions are properly terminated before tearing down the Tailnet conn. This is due to the assumption in the sockets API, that the underlying IP interface is long
lived compared with the TCP socket, and thus closing a socket returns immediately and does not wait for the TCP termination handshake --- that is handled async in the tcpip stack. However, this assumption does not hold for us and tailnet, since on shutdown,
we also tear down the tailnet connection, and this can race with the TCP termination.
Closing the remote forward explicitly should prevent forward state from accumulating, since the Close() function waits for a reply from the remote SSH server.
I've also attempted to workaround the TCP/tailnet issue for `--stdio` by using `CloseWrite()` instead of `Close()`. By closing the write side of the connection, half-close the TCP connection, and the server detects this and closes the other direction, which then
triggers our read loop to exit only after the server has had a chance to process the close.
TODO in a stacked PR is to implement this logic for `vscodessh` as well.
Adds a Logger to cli Invocation and standardizes CLI commands to use it. clitest creates a test logger by default so that CLI command logs are captured in the test logs.
CLI commands that do their own log configuration are modified to add sinks to the existing logger, rather than create a new one. This ensures we still capture logs in CLI tests.
Fixes an issue where remote forwards are not correctly torn down when using OpenSSH with `coder ssh --stdio`. OpenSSH sends a disconnect signal, but then also sends SIGHUP to `coder`. Previously, we just exited when we got SIGHUP, and this raced against properly disconnecting.
Fixes https://github.com/coder/customers/issues/327
* chore: add /v2 to import module path
go mod requires semantic versioning with versions greater than 1.x
This was a mechanical update by running:
```
go install github.com/marwan-at-work/mod/cmd/mod@latest
mod upgrade
```
Migrate generated files to import /v2
* Fix gen
* chore: rename startup logs to agent logs
This also adds a `source` property to every agent log. It
should allow us to group logs and display them nicer in
the UI as they stream in.
* Fix migration order
* Fix naming
* Rename the frontend
* Fix tests
* Fix down migration
* Match enums for workspace agent logs
* Fix inserting log source
* Fix migration order
* Fix logs tests
* Fix psql insert
- (breaking) Protects Logger and LogBodies fields of codersdk.Client with its mutex. This addresses a data race in cli/scaletest.
- Fillets the existing cli/createworkspaces unit test and moves the testing logic there into the tests under scaletest/createworkspaces.
- Adds testutil.RaceEnabled bool const and conditionaly skips previously-skipped tests under scaletest/ if the race detector is enabled. This is unfortunate and sad, but I would prefer to have these tests at least running without the race detector than not running at all.
- Adds IgnoreErrors option to fake in-memory agent loggers; having the agents fail the test immediately when they encounter any sort of error isn't really helpful.
* fix(cli/ssh): Avoid connection hang when workspace is stopped
Two issues are addressed here:
1. We were not detecting disconnects due to waiting for Stdin to close
(disconnect would only propagate after entering input and failing to
write to the connection).
2. In other scenarios, where the connection drop is not detected, we now
also watch workspace status and drop the connection when a workspace
reaches the stopped state.
Fixes: https://github.com/coder/jetbrains-coder/issues/199
Refs: #6180, #6175
The VS Code extension has been refactored to use VS Code
Remote SSH instead of using the private API.
This changes the structure to continue using SSH, but
output network information periodically to a file.