Commit Graph

146 Commits

Author SHA1 Message Date
Colin Adler 15157c1c40
chore: add network integration test suite scaffolding (#13072)
* chore: add network integration test suite scaffolding

* dean comments
2024-04-26 17:48:41 +00:00
Colin Adler 777dfbe965
feat(enterprise): add ready for handshake support to pgcoord (#12935) 2024-04-16 15:01:10 -05:00
Colin Adler e801e878ba
feat: add agent acks to in-memory coordinator (#12786)
When an agent receives a node, it responds with an ACK which is relayed
to the client. After the client receives the ACK, it's allowed to begin
pinging.
2024-04-10 17:15:33 -05:00
Marcin Tojek b6359b0a89
fix: ignore gomock temporary files (#12924) 2024-04-10 08:48:56 +00:00
Colin Adler 4d5a7b2d56
chore(codersdk): move all tailscale imports out of `codersdk` (#12735)
Currently, importing `codersdk` just to interact with the API requires
importing tailscale, which causes builds to fail unless manually using
our fork.
2024-03-26 12:44:31 -05:00
Colin Adler 66154f937e
fix(coderd): pass block endpoints into servertailnet (#12149) 2024-03-08 05:29:54 +00:00
Colin Adler e5d911462f
fix(tailnet): enforce valid agent and client addresses (#12197)
This adds the ability for `TunnelAuth` to also authorize incoming wireguard node IPs, preventing agents from reporting anything other than their static IP generated from the agent ID.
2024-03-01 09:02:33 -06:00
Spike Curtis 4e7beee102
feat: show tailnet peer diagnostics after coder ping (#12314)
Beginnings of a solution to #12297 

Doesn't cover disco or definitively display whether we successfully connected to DERP, but shows some checklist diagnostics for connecting to an agent.

For this first PR, I just added it to `coder ping` to see how we like it, but could be incorporated into `coder ssh` _et al._ after a timeout.

```
$ coder ping dogfood2
p2p connection established in 147ms
pong from dogfood2 p2p via  95.217.xxx.yyy:42631  in 147ms
pong from dogfood2 p2p via  95.217.xxx.yyy:42631  in 140ms
pong from dogfood2 p2p via  95.217.xxx.yyy:42631  in 140ms
✔ preferred DERP region 999 (Council Bluffs, Iowa)
✔ sent local data to Coder networking coodinator
✔ received remote agent data from Coder networking coordinator
    preferred DERP 10013 (Europe Fly.io (Paris))
    endpoints: 95.217.xxx.yyy:42631, 95.217.xxx.yyy:37576, 172.17.0.1:37576, 172.20.0.10:37576
✔ Wireguard handshake 11s ago
```
2024-02-27 22:04:46 +04:00
Spike Curtis af3fdc68c3
chore: refactor agent routines that use the v2 API (#12223)
In anticipation of needing the `LogSender` to run on a context that doesn't get immediately canceled when you `Close()` the agent, I've undertaken a little refactor to manage the goroutines that get run against the Tailnet and Agent API connection.

This handles controlling two contexts, one that gets canceled right away at the start of graceful shutdown, and another that stays up to allow graceful shutdown to complete.
2024-02-23 11:04:23 +04:00
Dean Sheather 9861830e87
fix: never send local endpoints if disabled (#12138) 2024-02-20 15:51:25 +10:00
Cian Johnston a2cbb0f87f
fix(enterprise/coderd): check provisionerd API version on connection (#12191) 2024-02-16 18:43:07 +00:00
Spike Curtis 2d0b9106c0
fix: change servertailnet to register the DERP dialer before setting DERP map (#12137)
I noticed a possible race where tailnet.Conn can try to dial the embedded region before we've set our custom dialer that send the DERP in-memory.  This closes that race and adds a test case for servertailnet with no STUN and an embedded relay
2024-02-15 10:51:12 +04:00
Spike Curtis e5ba586e30
fix: fix graceful disconnect in DialWorkspaceAgent (#11993)
I noticed in testing that the CLI wasn't correctly sending the disconnect message when it shuts down, and thus agents are seeing this as a "lost" peer, rather than a "disconnected" one. 

What was happening is that we just used a single context for everything from the netconn to the RPCs, and when the context was canceled we failed to send the disconnect message due to canceled context.

So, this PR splits things into two contexts, with a graceful one set to last up to 1 second longer than the main one.
2024-02-05 14:01:37 +04:00
Spike Curtis bb99cb7d2b
chore: move FakeCoordinator to tailnettest (#11992)
Moves FakeCoordinator to tailnettest since it's reused in testing multiple packages in this stack of PRs.
2024-02-05 13:49:32 +04:00
Spike Curtis 646ac942b2
chore: rename FakeCoordinator for export (#11991)
Part of a stack that fixes graceful disconnect from the CLI to tailnet.  I reuse FakeCoordinator in a test for graceful disconnects.
2024-02-05 13:33:31 +04:00
Spike Curtis 520b12e1a2
fix: close MultiAgentConn when coordinator closes (#11941)
Fixes an issue where a MultiAgentConn isn't closed properly when the coordinator it is connected to is closed.

Since servertailnet checks whether the conn is closed before reinitializing, it is important that we check this, otherwise servertailnet can get stuck if the coordinator closes (e.g. when we switch from AGPL to PGCoordinator after decoding a license).
2024-01-31 00:38:19 +04:00
Spike Curtis d3983e4dba
feat: add logging to client tailnet yamux (#11908)
Adds logging to yamux when used for tailnet client connections, e.g. CLI and wsproxy.  This could be useful for debugging connection issues with tailnet v2 API.
2024-01-30 09:58:59 +04:00
Spike Curtis 1e8a9c09fe
chore: remove legacy wsconncache (#11816)
Fixes #8218

Removes `wsconncache` and related "is legacy?" functions and API calls that were used by it.

The only leftover is that Agents still use the legacy IP, so that back level clients or workspace proxies can dial them correctly.

We should eventually remove this: #11819
2024-01-30 07:56:36 +04:00
Colin Adler bc14e926d8
feat: add option to speedtest to dump a pcap of network traffic (#11848) 2024-01-29 09:57:31 -06:00
Spike Curtis 5cbb76b47a
fix: stop spamming DERP map updates for equivalent maps (#11792)
Fixes 2 related issues:

1. wsconncache had incorrect logic to test whether to send DERPMap updates, sending if the maps were equivalent, instead of if they were _not equivalent_.
2. configmaps used a bugged check to test equality between DERPMaps, since it contains a map and the map entries are serialized in random order. Instead, we avoid comparing the protobufs and instead depend on the existing function that compares `tailcfg.DERPMap`. This also has the effect of reducing the number of times we convert to and from protobuf.
2024-01-24 16:27:15 +04:00
Spike Curtis 059e533544
feat: agent uses Tailnet v2 API for DERPMap updates (#11698)
Switches the Agent to use Tailnet v2 API to get DERPMap updates.

Subsequent PRs will do the same for the CLI (`codersdk`) and `wsproxy`.
2024-01-23 14:42:07 +04:00
Spike Curtis 3e0e7f8739
feat: check agent API version on connection (#11696)
fixes #10531

Adds a check for `version` on connection to the Agent API websocket endpoint.  This is primarily for future-proofing, so that up-level agents get a sensible error if they connect to a back-level Coderd.

It also refactors the location of the `CurrentVersion` variables, to be part of the `proto` packages, since the versions refer to the APIs defined therein.
2024-01-23 14:27:49 +04:00
Spike Curtis eb12fd7d92
feat: make ServerTailnet set peers lost when it reconnects to the coordinator (#11682)
Adds support to `ServerTailnet` to set all peers lost before attempting to reconnect to the coordinator. In practice, this only really affects `wsproxy` since coderd has a local connection to the coordinator that only goes down if we're shutting down or change licenses.
2024-01-23 13:17:56 +04:00
Spike Curtis 5388a1b6d7
fix: use TSMP ping for reachability, not latency (#11749)
Use TSMP ping for reachability, but leave Disco ping for when we call Ping() since we often use that to determine whether we have a direct connection.

Also adds unit tests to make sure Ping() returns direct connection vs DERP correctly.
2024-01-22 17:37:15 +04:00
Spike Curtis 7ffd99cfe2
fix: use DiscoPing (partially reverts #11306) (#11744) 2024-01-22 12:40:21 +00:00
Spike Curtis 3d85cdfa11
feat: set peers lost when disconnected from coordinator (#11681)
Adds support to Coordination to call SetAllPeersLost() when it is closed. This ensure that when we disconnect from a Coordinator, we set all peers lost.

This covers CoderSDK (CLI client) and Agent.  Next PR will cover MultiAgent (notably, `wsproxy`).
2024-01-22 15:26:20 +04:00
Spike Curtis b7b936547d
feat: add setAllPeersLost to the configMaps subcomponent (#11665)
adds setAllPeersLost to the configMaps subcomponent of tailnet.Conn --- we'll call this when we disconnect from a coordinator so we'll eventually clean up peers if they disconnect while we are retrying the coordinator connection (or we don't succeed in reconnecting to the coordinator).
2024-01-22 12:12:15 +04:00
Spike Curtis f01cab9894
feat: use tailnet v2 API for coordination (#11638)
This one is huge, and I'm sorry.

The problem is that once I change `tailnet.Conn` to start doing v2 behavior, I kind of have to change it everywhere, including in CoderSDK (CLI), the agent, wsproxy, and ServerTailnet.

There is still a bit more cleanup to do, and I need to add code so that when we lose connection to the Coordinator, we mark all peers as LOST, but that will be in a separate PR since this is big enough!
2024-01-22 11:07:50 +04:00
Spike Curtis 8910ac715c
feat: add tailnet v2 support to wsproxy coordinate endpoint (#11637)
wsproxy also needs to be updated to use tailnet v2 because the `tailnet.Conn` stores peers by ID, and the peerID was not being carried by the JSON protocol.  This adds a query param to the endpoint to conditionally switch to the new protocol.
2024-01-18 10:10:36 +04:00
Spike Curtis 07427e06f7
chore: add setBlockEndpoints to nodeUpdater (#11636)
nodeUpdater also needs block endpoints, so that it can stop sending nodes with endpoints.
2024-01-18 10:02:15 +04:00
Spike Curtis 5b4de667d6
chore: add setCallback to nodeUpdater (#11635)
we need to be able to (re-)set the node callback when we lose and regain connection to a coordinator over the network.
2024-01-18 09:51:09 +04:00
Spike Curtis e725f9d7d4
chore: stop passing addresses on configMaps constructor (#11634)
moving this out of the constructor so that setting this when creating a new `tailnet.Conn` triggers configuring the engine.
2024-01-18 09:43:28 +04:00
Spike Curtis a514df71ed
chore: add setDERPMap to configMaps (#11590)
Add setDERPMap
2024-01-18 09:34:30 +04:00
Spike Curtis 25e289e1f6
chore: add setAddresses to nodeUpdater (#11571)
Adds setAddresses to nodeUpdater
2024-01-18 09:24:16 +04:00
Spike Curtis 2aa3cbbd03
chore: add logging to nodeUpdater (#11569)
Add debug logging for nodeUpdater and configMaps
2024-01-17 14:15:45 +04:00
Spike Curtis 38d9ce5267
chore: add setStatus support to nodeUpdater (#11568)
Add support for the wgengine Status callback to nodeUpdater
2024-01-17 09:06:34 +04:00
Spike Curtis f6dc707511
chore: add DERPForcedWebsocket to nodeUpdater (#11567)
Add support for DERPForcedWebsocket to nodeUpdater
2024-01-17 08:55:45 +04:00
Spike Curtis 8701dbc874
chore: add nodeUpdater to tailnet (#11539)
Adds a nodeUpdater component, which serves a similar role to configMaps, but tracks information from tailscale going out to the coordinator as node updates.  This first PR just handles netInfo, subsequent PRs will
handle DERP forced websockets, endpoints, and addresses.
2024-01-11 09:29:42 +04:00
Spike Curtis 7005fb1b2f
chore: add support for blockEndpoints to configMaps (#11512)
Adds support for setting blockEndpoints on the configMaps
2024-01-11 09:18:31 +04:00
Spike Curtis 617ecbfb1f
chore: add support for peer updates to tailnet.configMaps (#11487)
Adds support to configMaps to handle peer updates including lost and disconnected peers
2024-01-11 09:11:43 +04:00
Spike Curtis 89e3bbe0f5
chore: add configMaps component to tailnet (#11400)
Work in progress on a subcomponent of the Conn which will handle configuring the wireguard engine on changes.  I've implemented setAddresses as the simplest case and added unit tests of the reconfiguration loop.

Besides making the code easier to test and understand, the goal is for this component to handle disconnect and loss updates about peers, and thereby, implement the v2 Tailnet API.

Further PRs will handle peer updates, status updates, and net info updates.

Then, after the subcomponent is implemented and tested, I will refactor Conn to use it instead of the current monolithic architecture.
2024-01-10 10:58:53 +04:00
Cian Johnston 4d2fe2685a
chore(coderd): extract api version validation to util package (#11407) 2024-01-05 10:22:07 +00:00
Spike Curtis 58873fa7e2
chore: remove unused context/cancel in tailnet Conn (#11399)
Spotted during code read; unused fields
2024-01-05 08:15:42 +04:00
Steven Masley dd05a6b13a
chore: mockgen archived, moved to new location (#11415)
* chore: mockgen archived, moved to new location
2024-01-04 18:35:56 -06:00
Spike Curtis 520c3a8ff7
fix: use TSMP for pings and checking reachability (#11306)
We're seeing some flaky tests related to agent connectivity - https://github.com/coder/coder/actions/runs/7286675441/job/19856270998

I'm pretty sure what happened in this one is that the client opened a connection while the wgengine was in the process of reconfiguring the wireguard device, so the fact that the peer became "active" as a result of traffic being sent was not noticed.

The test calls `AwaitReachable()` but this only tests the disco layer, so it doesn't wait for wireguard to come up.

I think we should be using TSMP for pinging and reachability, since this operates at the IP layer, and therefore requires that wireguard comes up before being successful.

This should also help with the problems we have seen where a TCP connection starts before wireguard is up and the initial round trip has to wait for the 5 second wireguard handshake retry.

fixes: #11294
2024-01-02 15:53:52 +04:00
Spike Curtis 25f2abf9ab
chore: remove tailnet from agent API and rename client API to tailnet (#11303)
Refactors our DRPC service definitions slightly.

In the previous version, I inserted the RPCs from the tailnet proto directly into the Agent service.  This makes things hard to deal with because DRPC then generates a new set of methods with new interfaces with the `DRPCAgent_` prefixed.  Since you can't have a single method that takes different argument types, we couldn't reuse the implementation of those RFCs without a lot of extra classes and pass-thru methods.

Instead, the "right" way to do it is to integrate at the DRPC layer.  So, we have two DRPC services available over the Agent websocket, and register them both on the DRPC `mux`.

Since the tailnet proto RPC service is now for both clients and agents, I renamed some things to clarify and shorten.

This PR also removes the `TailnetAPI` implementation from the `agentapi` package, and the next PR in the stack replaces it with the implementation from the `tailnet` package.
2024-01-02 10:02:45 +04:00
Spike Curtis d257f8163d
feat: implement DERP streaming on tailnet Client API (#11302)
Implements DERPMap streaming from client API.

In a subsequent PR I plan to remove the implementation in coderd/agentapi in favor of the tailnet one
2024-01-02 08:07:57 +04:00
Dean Sheather e46431078c
feat: add AgentAPI using DRPC (#10811)
Co-authored-by: Spike Curtis <spike@coder.com>
2023-12-18 22:53:28 +10:00
Spike Curtis a58e4febb9
feat: add tailnet v2 Service and Client (#11225)
Part of #10532

Adds a tailnet ClientService that accepts a net.Conn and serves v1 or v2 of the tailnet API.

Also adds a DRPCService that implements the DRPC interface for the v2 API.  This component is within the ClientService, but needs to be reusable and exported so that we can also embed it in the Agent API.

Finally, includes a NewDRPCClient function that takes a net.Conn and runs dRPC in yamux over it on the client side.
2023-12-15 12:48:39 +04:00
Spike Curtis 30f032d282
feat: add tailnet ValidateVersion (#11223)
Part of #10532

Adds a method to validate a requested version of the tailnet API
2023-12-15 11:49:30 +04:00