We're seeing some flaky tests related to agent connectivity - https://github.com/coder/coder/actions/runs/7286675441/job/19856270998
I'm pretty sure what happened in this one is that the client opened a connection while the wgengine was in the process of reconfiguring the wireguard device, so the fact that the peer became "active" as a result of traffic being sent was not noticed.
The test calls `AwaitReachable()` but this only tests the disco layer, so it doesn't wait for wireguard to come up.
I think we should be using TSMP for pinging and reachability, since this operates at the IP layer, and therefore requires that wireguard comes up before being successful.
This should also help with the problems we have seen where a TCP connection starts before wireguard is up and the initial round trip has to wait for the 5 second wireguard handshake retry.
fixes: #11294
Refactors our DRPC service definitions slightly.
In the previous version, I inserted the RPCs from the tailnet proto directly into the Agent service. This makes things hard to deal with because DRPC then generates a new set of methods with new interfaces with the `DRPCAgent_` prefixed. Since you can't have a single method that takes different argument types, we couldn't reuse the implementation of those RFCs without a lot of extra classes and pass-thru methods.
Instead, the "right" way to do it is to integrate at the DRPC layer. So, we have two DRPC services available over the Agent websocket, and register them both on the DRPC `mux`.
Since the tailnet proto RPC service is now for both clients and agents, I renamed some things to clarify and shorten.
This PR also removes the `TailnetAPI` implementation from the `agentapi` package, and the next PR in the stack replaces it with the implementation from the `tailnet` package.
Part of #10532
Adds a tailnet ClientService that accepts a net.Conn and serves v1 or v2 of the tailnet API.
Also adds a DRPCService that implements the DRPC interface for the v2 API. This component is within the ClientService, but needs to be reusable and exported so that we can also embed it in the Agent API.
Finally, includes a NewDRPCClient function that takes a net.Conn and runs dRPC in yamux over it on the client side.
* chore(Makefile): use golangci-lint version from dogfood Dockerfile
* chore(dogfood/Dockerfile): update golangci-lint to latest version
* chore(coderd): address linter complaints
re: #10528
Refactors PG Coordinator to work with the Tailnet v2 API, including wrappers for the existing v1 API.
The debug endpoint functions, but doesn't return sensible data, that will be in another stacked PR.
Fixes an issue where remote forwards are not correctly torn down when using OpenSSH with `coder ssh --stdio`. OpenSSH sends a disconnect signal, but then also sends SIGHUP to `coder`. Previously, we just exited when we got SIGHUP, and this raced against properly disconnecting.
Fixes https://github.com/coder/customers/issues/327
I've said it before, I'll say it again: you can't create a timed context before calling `t.Parallel()` and then use it after.
Fixes flakes like https://github.com/coder/coder/actions/runs/6716682414/job/18253279157
I've chosen just to drop `t.Parallel()` entirely rather than create a second context after the parallel call, since the vast majority of the test time happens before where the parallel call was. It does all the tailnet setup before `t.Parallel()`.
Leaving a call to `t.Parallel()` is a bug risk for future maintainers to come in and use the wrong context in the latter part of the test by accident.
* chore: add /v2 to import module path
go mod requires semantic versioning with versions greater than 1.x
This was a mechanical update by running:
```
go install github.com/marwan-at-work/mod/cmd/mod@latest
mod upgrade
```
Migrate generated files to import /v2
* Fix gen
* work around websocket deadline bug
Signed-off-by: Spike Curtis <spike@coder.com>
* Use test context to hold websocket open
Signed-off-by: Spike Curtis <spike@coder.com>
* Fix race creating test websocket
Signed-off-by: Spike Curtis <spike@coder.com>
* set write deadline to time.Time zero
Signed-off-by: Spike Curtis <spike@coder.com>
---------
Signed-off-by: Spike Curtis <spike@coder.com>
* feat: allow DERP headers to be set
* chore: remove custom flag
* Clone DERP header on client create
* Adjust to use interface to cast headers
---------
Co-authored-by: Kyle Carberry <kyle@carberry.com>
Prior to this change, DERP traffic would route from `coderd` to the
`CODER_ACCESS_URL` to reach the internal DERP server, which may have
resulted in slower connections due to proxying, or the failure of
web traffic entirely.
If your Coder deployment has a proxy in front of it, your traffic through
web terminals, apps, and port-forwarding is about to get a lot faster!
* feat: automatically use websockets if DERP upgrade is unavailable
This might be our biggest hangup for deployments at the moment...
Load balancers by default do not support the DERP protocol, so many
of our prospects and customers run into failing workspace connections.
This automatically swaps to use WebSockets, and reports the reason to
coderd.
In a future contribution, a warning will appear by the agent if it was
forced to use WebSockets instead of DERP.
* Fix nil pointer type in Tailscale dep
* Fix requested changes
* fix(tailnet): Improve start and close to detect connection races
* fix: Prevent agentConn use before ready via AwaitReachable
* fix(tailnet): Ensure connstats are closed on conn close
* fix(codersdk): Use AwaitReachable in DialWorkspaceAgent
* fix(tailnet): Improve logging via slog.Helper()
If an agent went away and reconnected, the wsconncache connection would
be polluted for about 10m because there would be two peers with the
same IP. The old peer always had priority, which caused the dashboard to
try and always dial the old peer until it was removed.
Fixes: https://github.com/coder/coder/issues/5292
* Add extio
* feat: Add `vscodeipc` subcommand for VS Code Extension
This enables the VS Code extension to communicate with a Coder client.
The extension will download the slim binary from `/bin/*` for the
respective client architecture and OS, then execute `coder vscodeipc`
for the connecting workspace.
* Add authentication header, improve comments, and add tests for the CLI
* Update cli/vscodeipc_test.go
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
* Update cli/vscodeipc_test.go
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
* Update cli/vscodeipc/vscodeipc_test.go
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
* Fix requested changes
* Fix IPC tests
* Fix shell execution
* Fix nix flake
* Silence usage
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
* fix: Improve tailnet connections by reducing timeouts
This awaits connection ping before running a dial. Before,
we were hitting the TCP retransmission and handshake timeouts,
which could intermittently add 1 or 5 seconds to a connection
being initialized.
* Update Tailscale
This could actually cause connections to intermittently fail too
when a CPU is absolutely pegged. It just so happens that only
our runners have been that slow!
Fixes#4607.
* fix: Tidy up closes for nicer output
There was a context canceled message that would appear
because of traces, and this was using the wrong close.
I don't think it was causing any specific problems, but
it could make a replica warning appear on restart.
* Fix migration and experimental
* feat: HA tailnet coordinator
* fixup! feat: HA tailnet coordinator
* fixup! feat: HA tailnet coordinator
* remove printlns
* close all connections on coordinator
* impelement high availability feature
* fixup! impelement high availability feature
* fixup! impelement high availability feature
* fixup! impelement high availability feature
* fixup! impelement high availability feature
* Add replicas
* Add DERP meshing to arbitrary addresses
* Move packages to highavailability folder
* Move coordinator to high availability package
* Add flags for HA
* Rename to replicasync
* Denest packages for replicas
* Add test for multiple replicas
* Fix coordination test
* Add HA to the helm chart
* Rename function pointer
* Add warnings for HA
* Add the ability to block endpoints
* Add flag to disable P2P connections
* Wow, I made the tests pass
* Add replicas endpoint
* Ensure close kills replica
* Update sql
* Add database latency to high availability
* Pipe TLS to DERP mesh
* Fix DERP mesh with TLS
* Add tests for TLS
* Fix replica sync TLS
* Fix RootCA for replica meshing
* Remove ID from replicasync
* Fix getting certificates for meshing
* Remove excessive locking
* Fix linting
* Store mesh key in the database
* Fix replica key for tests
* Fix types gen
* Fix unlocking unlocked
* Fix race in tests
* Update enterprise/derpmesh/derpmesh.go
Co-authored-by: Colin Adler <colin1adler@gmail.com>
* Rename to syncReplicas
* Reuse http client
* Delete old replicas on a CRON
* Fix race condition in connection tests
* Fix linting
* Fix nil type
* Move pubsub to in-memory for twenty test
* Add comment for configuration tweaking
* Fix leak with transport
* Fix close leak in derpmesh
* Fix race when creating server
* Remove handler update
* Skip test on Windows
* Fix DERP mesh test
* Wrap HTTP handler replacement in mutex
* Fix error message for relay
* Fix API handler for normal tests
* Fix speedtest
* Fix replica resend
* Fix derpmesh send
* Ping async
* Increase wait time of template version jobd
* Fix race when closing replica sync
* Add name to client
* Log the derpmap being used
* Don't connect if DERP is empty
* Improve agent coordinator logging
* Fix lock in coordinator
* Fix relay addr
* Fix race when updating durations
* Fix client publish race
* Run pubsub loop in a queue
* Store agent nodes in order
* Fix coordinator locking
* Check for closed pipe
Co-authored-by: Colin Adler <colin1adler@gmail.com>
* chore: Add comments to indicate what each field on a network node means
* Update tailnet/coordinator.go
Co-authored-by: Colin Adler <colin1adler@gmail.com>
* Update tailnet/coordinator.go
Co-authored-by: Colin Adler <colin1adler@gmail.com>
* Update tailnet/coordinator.go
Co-authored-by: Colin Adler <colin1adler@gmail.com>
Co-authored-by: Colin Adler <colin1adler@gmail.com>
* fix: Only hold `tailnet.*Conn.Close()` for a short duration
The long duration could be cause to a test deadlock.
* Add closed chan to listener struct