coder/.vscode/settings.json

223 lines
4.1 KiB
JSON
Raw Normal View History

{
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"cSpell.words": [
"afero",
"agentsdk",
feat: Add workspace application support (#1773) * feat: Add app support This adds apps as a property to a workspace agent. The resource is added to the Terraform provider here: https://github.com/coder/terraform-provider-coder/pull/17 Apps will be opened in the dashboard or via the CLI with `coder open <name>`. If `command` is specified, a terminal will appear locally and in the web. If `target` is specified, the browser will open to an exposed instance of that target. * Compare fields in apps test * Update Terraform provider to use relative path * Add some basic structure for routing * chore: Remove interface from coderd and lift API surface Abstracting coderd into an interface added misdirection because the interface was never intended to be fulfilled outside of a single implementation. This lifts the abstraction, and attaches all handlers to a root struct named `*coderd.API`. * Add basic proxy logic * Add proxying based on path * Add app proxying for wildcards * Add wsconncache * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * Add workspace route proxying endpoint - Makes the workspace conn cache concurrency-safe - Reduces unnecessary open checks in `peer.Channel` - Fixes the use of a temporary context when dialing a workspace agent * Add embed errors * chore: Refactor site to improve testing It was difficult to develop this package due to the embed build tag being mandatory on the tests. The logic to test doesn't require any embedded files. * Add test for error handler * Remove unused access url * Add RBAC tests * Fix dial agent syntax * Fix linting errors * Fix gen * Fix icon required * Adjust migration number * Fix proxy error status code * Fix empty db lookup
2022-06-04 20:13:37 +00:00
"apps",
"ASKPASS",
"authcheck",
"autostop",
feat: Add workspace application support (#1773) * feat: Add app support This adds apps as a property to a workspace agent. The resource is added to the Terraform provider here: https://github.com/coder/terraform-provider-coder/pull/17 Apps will be opened in the dashboard or via the CLI with `coder open <name>`. If `command` is specified, a terminal will appear locally and in the web. If `target` is specified, the browser will open to an exposed instance of that target. * Compare fields in apps test * Update Terraform provider to use relative path * Add some basic structure for routing * chore: Remove interface from coderd and lift API surface Abstracting coderd into an interface added misdirection because the interface was never intended to be fulfilled outside of a single implementation. This lifts the abstraction, and attaches all handlers to a root struct named `*coderd.API`. * Add basic proxy logic * Add proxying based on path * Add app proxying for wildcards * Add wsconncache * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * Add workspace route proxying endpoint - Makes the workspace conn cache concurrency-safe - Reduces unnecessary open checks in `peer.Channel` - Fixes the use of a temporary context when dialing a workspace agent * Add embed errors * chore: Refactor site to improve testing It was difficult to develop this package due to the embed build tag being mandatory on the tests. The logic to test doesn't require any embedded files. * Add test for error handler * Remove unused access url * Add RBAC tests * Fix dial agent syntax * Fix linting errors * Fix gen * Fix icon required * Adjust migration number * Fix proxy error status code * Fix empty db lookup
2022-06-04 20:13:37 +00:00
"awsidentity",
"bodyclose",
feat: Add workspace application support (#1773) * feat: Add app support This adds apps as a property to a workspace agent. The resource is added to the Terraform provider here: https://github.com/coder/terraform-provider-coder/pull/17 Apps will be opened in the dashboard or via the CLI with `coder open <name>`. If `command` is specified, a terminal will appear locally and in the web. If `target` is specified, the browser will open to an exposed instance of that target. * Compare fields in apps test * Update Terraform provider to use relative path * Add some basic structure for routing * chore: Remove interface from coderd and lift API surface Abstracting coderd into an interface added misdirection because the interface was never intended to be fulfilled outside of a single implementation. This lifts the abstraction, and attaches all handlers to a root struct named `*coderd.API`. * Add basic proxy logic * Add proxying based on path * Add app proxying for wildcards * Add wsconncache * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * Add workspace route proxying endpoint - Makes the workspace conn cache concurrency-safe - Reduces unnecessary open checks in `peer.Channel` - Fixes the use of a temporary context when dialing a workspace agent * Add embed errors * chore: Refactor site to improve testing It was difficult to develop this package due to the embed build tag being mandatory on the tests. The logic to test doesn't require any embedded files. * Add test for error handler * Remove unused access url * Add RBAC tests * Fix dial agent syntax * Fix linting errors * Fix gen * Fix icon required * Adjust migration number * Fix proxy error status code * Fix empty db lookup
2022-06-04 20:13:37 +00:00
"buildinfo",
"buildname",
"circbuf",
"cliflag",
"cliui",
"codecov",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"coderd",
"coderdenttest",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"coderdtest",
"codersdk",
"cronstrue",
"databasefake",
"dbfake",
"dbgen",
"dbtype",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"DERP",
"derphttp",
"derpmap",
"devel",
"devtunnel",
feat: Add high availability for multiple replicas (#4555) * feat: HA tailnet coordinator * fixup! feat: HA tailnet coordinator * fixup! feat: HA tailnet coordinator * remove printlns * close all connections on coordinator * impelement high availability feature * fixup! impelement high availability feature * fixup! impelement high availability feature * fixup! impelement high availability feature * fixup! impelement high availability feature * Add replicas * Add DERP meshing to arbitrary addresses * Move packages to highavailability folder * Move coordinator to high availability package * Add flags for HA * Rename to replicasync * Denest packages for replicas * Add test for multiple replicas * Fix coordination test * Add HA to the helm chart * Rename function pointer * Add warnings for HA * Add the ability to block endpoints * Add flag to disable P2P connections * Wow, I made the tests pass * Add replicas endpoint * Ensure close kills replica * Update sql * Add database latency to high availability * Pipe TLS to DERP mesh * Fix DERP mesh with TLS * Add tests for TLS * Fix replica sync TLS * Fix RootCA for replica meshing * Remove ID from replicasync * Fix getting certificates for meshing * Remove excessive locking * Fix linting * Store mesh key in the database * Fix replica key for tests * Fix types gen * Fix unlocking unlocked * Fix race in tests * Update enterprise/derpmesh/derpmesh.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Rename to syncReplicas * Reuse http client * Delete old replicas on a CRON * Fix race condition in connection tests * Fix linting * Fix nil type * Move pubsub to in-memory for twenty test * Add comment for configuration tweaking * Fix leak with transport * Fix close leak in derpmesh * Fix race when creating server * Remove handler update * Skip test on Windows * Fix DERP mesh test * Wrap HTTP handler replacement in mutex * Fix error message for relay * Fix API handler for normal tests * Fix speedtest * Fix replica resend * Fix derpmesh send * Ping async * Increase wait time of template version jobd * Fix race when closing replica sync * Add name to client * Log the derpmap being used * Don't connect if DERP is empty * Improve agent coordinator logging * Fix lock in coordinator * Fix relay addr * Fix race when updating durations * Fix client publish race * Run pubsub loop in a queue * Store agent nodes in order * Fix coordinator locking * Check for closed pipe Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-10-17 13:43:30 +00:00
"dflags",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"drpc",
"drpcconn",
"drpcmux",
"drpcserver",
"Dsts",
"embeddedpostgres",
"enablements",
"enterprisemeta",
"errgroup",
"eventsourcemock",
"Failf",
"fatih",
"Formik",
"gitauth",
feat: Add workspace application support (#1773) * feat: Add app support This adds apps as a property to a workspace agent. The resource is added to the Terraform provider here: https://github.com/coder/terraform-provider-coder/pull/17 Apps will be opened in the dashboard or via the CLI with `coder open <name>`. If `command` is specified, a terminal will appear locally and in the web. If `target` is specified, the browser will open to an exposed instance of that target. * Compare fields in apps test * Update Terraform provider to use relative path * Add some basic structure for routing * chore: Remove interface from coderd and lift API surface Abstracting coderd into an interface added misdirection because the interface was never intended to be fulfilled outside of a single implementation. This lifts the abstraction, and attaches all handlers to a root struct named `*coderd.API`. * Add basic proxy logic * Add proxying based on path * Add app proxying for wildcards * Add wsconncache * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * Add workspace route proxying endpoint - Makes the workspace conn cache concurrency-safe - Reduces unnecessary open checks in `peer.Channel` - Fixes the use of a temporary context when dialing a workspace agent * Add embed errors * chore: Refactor site to improve testing It was difficult to develop this package due to the embed build tag being mandatory on the tests. The logic to test doesn't require any embedded files. * Add test for error handler * Remove unused access url * Add RBAC tests * Fix dial agent syntax * Fix linting errors * Fix gen * Fix icon required * Adjust migration number * Fix proxy error status code * Fix empty db lookup
2022-06-04 20:13:37 +00:00
"gitsshkey",
"goarch",
"gographviz",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"goleak",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"gonet",
"gossh",
"gsyslog",
"GTTY",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"hashicorp",
"hclsyntax",
"httpapi",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"httpmw",
"idtoken",
"Iflag",
"incpatch",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"ipnstate",
"isatty",
"Jobf",
feat: Add workspace application support (#1773) * feat: Add app support This adds apps as a property to a workspace agent. The resource is added to the Terraform provider here: https://github.com/coder/terraform-provider-coder/pull/17 Apps will be opened in the dashboard or via the CLI with `coder open <name>`. If `command` is specified, a terminal will appear locally and in the web. If `target` is specified, the browser will open to an exposed instance of that target. * Compare fields in apps test * Update Terraform provider to use relative path * Add some basic structure for routing * chore: Remove interface from coderd and lift API surface Abstracting coderd into an interface added misdirection because the interface was never intended to be fulfilled outside of a single implementation. This lifts the abstraction, and attaches all handlers to a root struct named `*coderd.API`. * Add basic proxy logic * Add proxying based on path * Add app proxying for wildcards * Add wsconncache * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * Add workspace route proxying endpoint - Makes the workspace conn cache concurrency-safe - Reduces unnecessary open checks in `peer.Channel` - Fixes the use of a temporary context when dialing a workspace agent * Add embed errors * chore: Refactor site to improve testing It was difficult to develop this package due to the embed build tag being mandatory on the tests. The logic to test doesn't require any embedded files. * Add test for error handler * Remove unused access url * Add RBAC tests * Fix dial agent syntax * Fix linting errors * Fix gen * Fix icon required * Adjust migration number * Fix proxy error status code * Fix empty db lookup
2022-06-04 20:13:37 +00:00
"Keygen",
"kirsle",
"Kubernetes",
"ldflags",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"magicsock",
"manifoldco",
"mapstructure",
"mattn",
"mitchellh",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"moby",
"namesgenerator",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"namespacing",
"netaddr",
"netip",
"netmap",
"netns",
"netstack",
"nettype",
"nfpms",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"nhooyr",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"nmcfg",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"nolint",
"nosec",
"ntqry",
"OIDC",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"oneof",
"opty",
"paralleltest",
"parameterscopeid",
"pqtype",
"prometheusmetrics",
"promhttp",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"protobuf",
"provisionerd",
"provisionerdserver",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"provisionersdk",
"ptty",
"ptys",
"ptytest",
"quickstart",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"reconfig",
feat: Add high availability for multiple replicas (#4555) * feat: HA tailnet coordinator * fixup! feat: HA tailnet coordinator * fixup! feat: HA tailnet coordinator * remove printlns * close all connections on coordinator * impelement high availability feature * fixup! impelement high availability feature * fixup! impelement high availability feature * fixup! impelement high availability feature * fixup! impelement high availability feature * Add replicas * Add DERP meshing to arbitrary addresses * Move packages to highavailability folder * Move coordinator to high availability package * Add flags for HA * Rename to replicasync * Denest packages for replicas * Add test for multiple replicas * Fix coordination test * Add HA to the helm chart * Rename function pointer * Add warnings for HA * Add the ability to block endpoints * Add flag to disable P2P connections * Wow, I made the tests pass * Add replicas endpoint * Ensure close kills replica * Update sql * Add database latency to high availability * Pipe TLS to DERP mesh * Fix DERP mesh with TLS * Add tests for TLS * Fix replica sync TLS * Fix RootCA for replica meshing * Remove ID from replicasync * Fix getting certificates for meshing * Remove excessive locking * Fix linting * Store mesh key in the database * Fix replica key for tests * Fix types gen * Fix unlocking unlocked * Fix race in tests * Update enterprise/derpmesh/derpmesh.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Rename to syncReplicas * Reuse http client * Delete old replicas on a CRON * Fix race condition in connection tests * Fix linting * Fix nil type * Move pubsub to in-memory for twenty test * Add comment for configuration tweaking * Fix leak with transport * Fix close leak in derpmesh * Fix race when creating server * Remove handler update * Skip test on Windows * Fix DERP mesh test * Wrap HTTP handler replacement in mutex * Fix error message for relay * Fix API handler for normal tests * Fix speedtest * Fix replica resend * Fix derpmesh send * Ping async * Increase wait time of template version jobd * Fix race when closing replica sync * Add name to client * Log the derpmap being used * Don't connect if DERP is empty * Improve agent coordinator logging * Fix lock in coordinator * Fix relay addr * Fix race when updating durations * Fix client publish race * Run pubsub loop in a queue * Store agent nodes in order * Fix coordinator locking * Check for closed pipe Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-10-17 13:43:30 +00:00
"replicasync",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"retrier",
"rpty",
feat: Add high availability for multiple replicas (#4555) * feat: HA tailnet coordinator * fixup! feat: HA tailnet coordinator * fixup! feat: HA tailnet coordinator * remove printlns * close all connections on coordinator * impelement high availability feature * fixup! impelement high availability feature * fixup! impelement high availability feature * fixup! impelement high availability feature * fixup! impelement high availability feature * Add replicas * Add DERP meshing to arbitrary addresses * Move packages to highavailability folder * Move coordinator to high availability package * Add flags for HA * Rename to replicasync * Denest packages for replicas * Add test for multiple replicas * Fix coordination test * Add HA to the helm chart * Rename function pointer * Add warnings for HA * Add the ability to block endpoints * Add flag to disable P2P connections * Wow, I made the tests pass * Add replicas endpoint * Ensure close kills replica * Update sql * Add database latency to high availability * Pipe TLS to DERP mesh * Fix DERP mesh with TLS * Add tests for TLS * Fix replica sync TLS * Fix RootCA for replica meshing * Remove ID from replicasync * Fix getting certificates for meshing * Remove excessive locking * Fix linting * Store mesh key in the database * Fix replica key for tests * Fix types gen * Fix unlocking unlocked * Fix race in tests * Update enterprise/derpmesh/derpmesh.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Rename to syncReplicas * Reuse http client * Delete old replicas on a CRON * Fix race condition in connection tests * Fix linting * Fix nil type * Move pubsub to in-memory for twenty test * Add comment for configuration tweaking * Fix leak with transport * Fix close leak in derpmesh * Fix race when creating server * Remove handler update * Skip test on Windows * Fix DERP mesh test * Wrap HTTP handler replacement in mutex * Fix error message for relay * Fix API handler for normal tests * Fix speedtest * Fix replica resend * Fix derpmesh send * Ping async * Increase wait time of template version jobd * Fix race when closing replica sync * Add name to client * Log the derpmap being used * Don't connect if DERP is empty * Improve agent coordinator logging * Fix lock in coordinator * Fix relay addr * Fix race when updating durations * Fix client publish race * Run pubsub loop in a queue * Store agent nodes in order * Fix coordinator locking * Check for closed pipe Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-10-17 13:43:30 +00:00
"SCIM",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"sdkproto",
feat: Add workspace application support (#1773) * feat: Add app support This adds apps as a property to a workspace agent. The resource is added to the Terraform provider here: https://github.com/coder/terraform-provider-coder/pull/17 Apps will be opened in the dashboard or via the CLI with `coder open <name>`. If `command` is specified, a terminal will appear locally and in the web. If `target` is specified, the browser will open to an exposed instance of that target. * Compare fields in apps test * Update Terraform provider to use relative path * Add some basic structure for routing * chore: Remove interface from coderd and lift API surface Abstracting coderd into an interface added misdirection because the interface was never intended to be fulfilled outside of a single implementation. This lifts the abstraction, and attaches all handlers to a root struct named `*coderd.API`. * Add basic proxy logic * Add proxying based on path * Add app proxying for wildcards * Add wsconncache * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * Add workspace route proxying endpoint - Makes the workspace conn cache concurrency-safe - Reduces unnecessary open checks in `peer.Channel` - Fixes the use of a temporary context when dialing a workspace agent * Add embed errors * chore: Refactor site to improve testing It was difficult to develop this package due to the embed build tag being mandatory on the tests. The logic to test doesn't require any embedded files. * Add test for error handler * Remove unused access url * Add RBAC tests * Fix dial agent syntax * Fix linting errors * Fix gen * Fix icon required * Adjust migration number * Fix proxy error status code * Fix empty db lookup
2022-06-04 20:13:37 +00:00
"sdktrace",
"Signup",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"slogtest",
"sourcemapped",
"Srcs",
"stdbuf",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"stretchr",
"STTY",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"stuntest",
"tanstack",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"tailbroker",
"tailcfg",
"tailexchange",
"tailnet",
"tailnettest",
"Tailscale",
"tbody",
"TCGETS",
"tcpip",
"TCSETS",
"templateversions",
"testdata",
"testid",
"testutil",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"tfexec",
"tfjson",
"tfplan",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"tfstate",
"thead",
"tios",
"tmpdir",
"tokenconfig",
"tparallel",
"trialer",
"trimprefix",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"tsdial",
"tslogger",
"tstun",
feat: Add workspace application support (#1773) * feat: Add app support This adds apps as a property to a workspace agent. The resource is added to the Terraform provider here: https://github.com/coder/terraform-provider-coder/pull/17 Apps will be opened in the dashboard or via the CLI with `coder open <name>`. If `command` is specified, a terminal will appear locally and in the web. If `target` is specified, the browser will open to an exposed instance of that target. * Compare fields in apps test * Update Terraform provider to use relative path * Add some basic structure for routing * chore: Remove interface from coderd and lift API surface Abstracting coderd into an interface added misdirection because the interface was never intended to be fulfilled outside of a single implementation. This lifts the abstraction, and attaches all handlers to a root struct named `*coderd.API`. * Add basic proxy logic * Add proxying based on path * Add app proxying for wildcards * Add wsconncache * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * Add workspace route proxying endpoint - Makes the workspace conn cache concurrency-safe - Reduces unnecessary open checks in `peer.Channel` - Fixes the use of a temporary context when dialing a workspace agent * Add embed errors * chore: Refactor site to improve testing It was difficult to develop this package due to the embed build tag being mandatory on the tests. The logic to test doesn't require any embedded files. * Add test for error handler * Remove unused access url * Add RBAC tests * Fix dial agent syntax * Fix linting errors * Fix gen * Fix icon required * Adjust migration number * Fix proxy error status code * Fix empty db lookup
2022-06-04 20:13:37 +00:00
"turnconn",
"typegen",
"typesafe",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"unconvert",
"Untar",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"Userspace",
"VMID",
"walkthrough",
"weblinks",
"webrtc",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"wgcfg",
"wgconfig",
"wgengine",
"wgmonitor",
"wgnet",
feat: Add workspace application support (#1773) * feat: Add app support This adds apps as a property to a workspace agent. The resource is added to the Terraform provider here: https://github.com/coder/terraform-provider-coder/pull/17 Apps will be opened in the dashboard or via the CLI with `coder open <name>`. If `command` is specified, a terminal will appear locally and in the web. If `target` is specified, the browser will open to an exposed instance of that target. * Compare fields in apps test * Update Terraform provider to use relative path * Add some basic structure for routing * chore: Remove interface from coderd and lift API surface Abstracting coderd into an interface added misdirection because the interface was never intended to be fulfilled outside of a single implementation. This lifts the abstraction, and attaches all handlers to a root struct named `*coderd.API`. * Add basic proxy logic * Add proxying based on path * Add app proxying for wildcards * Add wsconncache * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * Add workspace route proxying endpoint - Makes the workspace conn cache concurrency-safe - Reduces unnecessary open checks in `peer.Channel` - Fixes the use of a temporary context when dialing a workspace agent * Add embed errors * chore: Refactor site to improve testing It was difficult to develop this package due to the embed build tag being mandatory on the tests. The logic to test doesn't require any embedded files. * Add test for error handler * Remove unused access url * Add RBAC tests * Fix dial agent syntax * Fix linting errors * Fix gen * Fix icon required * Adjust migration number * Fix proxy error status code * Fix empty db lookup
2022-06-04 20:13:37 +00:00
"workspaceagent",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"workspaceagents",
feat: Add workspace application support (#1773) * feat: Add app support This adds apps as a property to a workspace agent. The resource is added to the Terraform provider here: https://github.com/coder/terraform-provider-coder/pull/17 Apps will be opened in the dashboard or via the CLI with `coder open <name>`. If `command` is specified, a terminal will appear locally and in the web. If `target` is specified, the browser will open to an exposed instance of that target. * Compare fields in apps test * Update Terraform provider to use relative path * Add some basic structure for routing * chore: Remove interface from coderd and lift API surface Abstracting coderd into an interface added misdirection because the interface was never intended to be fulfilled outside of a single implementation. This lifts the abstraction, and attaches all handlers to a root struct named `*coderd.API`. * Add basic proxy logic * Add proxying based on path * Add app proxying for wildcards * Add wsconncache * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * Add workspace route proxying endpoint - Makes the workspace conn cache concurrency-safe - Reduces unnecessary open checks in `peer.Channel` - Fixes the use of a temporary context when dialing a workspace agent * Add embed errors * chore: Refactor site to improve testing It was difficult to develop this package due to the embed build tag being mandatory on the tests. The logic to test doesn't require any embedded files. * Add test for error handler * Remove unused access url * Add RBAC tests * Fix dial agent syntax * Fix linting errors * Fix gen * Fix icon required * Adjust migration number * Fix proxy error status code * Fix empty db lookup
2022-06-04 20:13:37 +00:00
"workspaceapp",
"workspaceapps",
"workspacebuilds",
"workspacename",
feat: Add workspace application support (#1773) * feat: Add app support This adds apps as a property to a workspace agent. The resource is added to the Terraform provider here: https://github.com/coder/terraform-provider-coder/pull/17 Apps will be opened in the dashboard or via the CLI with `coder open <name>`. If `command` is specified, a terminal will appear locally and in the web. If `target` is specified, the browser will open to an exposed instance of that target. * Compare fields in apps test * Update Terraform provider to use relative path * Add some basic structure for routing * chore: Remove interface from coderd and lift API surface Abstracting coderd into an interface added misdirection because the interface was never intended to be fulfilled outside of a single implementation. This lifts the abstraction, and attaches all handlers to a root struct named `*coderd.API`. * Add basic proxy logic * Add proxying based on path * Add app proxying for wildcards * Add wsconncache * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * fix: Race when writing to a closed pipe This is such an intermittent race it's difficult to track, but regardless this is an improvement to the code. * Add workspace route proxying endpoint - Makes the workspace conn cache concurrency-safe - Reduces unnecessary open checks in `peer.Channel` - Fixes the use of a temporary context when dialing a workspace agent * Add embed errors * chore: Refactor site to improve testing It was difficult to develop this package due to the embed build tag being mandatory on the tests. The logic to test doesn't require any embedded files. * Add test for error handler * Remove unused access url * Add RBAC tests * Fix dial agent syntax * Fix linting errors * Fix gen * Fix icon required * Adjust migration number * Fix proxy error status code * Fix empty db lookup
2022-06-04 20:13:37 +00:00
"wsconncache",
feat: Add Tailscale networking (#3505) * fix: Add coder user to docker group on installation This makes for a simpler setup, and reduces the likelihood a user runs into a strange issue. * Add wgnet * Add ping * Add listening * Finish refactor to make this work * Add interface for swapping * Fix conncache with interface * chore: update gvisor * fix tailscale types * linting * more linting * Add coordinator * Add coordinator tests * Fix coordination * It compiles! * Move all connection negotiation in-memory * Migrate coordinator to use net.conn * Add closed func * Fix close listener func * Make reconnecting PTY work * Fix reconnecting PTY * Update CI to Go 1.19 * Add CLI flags for DERP mapping * Fix Tailnet test * Rename ConnCoordinator to TailnetCoordinator * Remove print statement from workspace agent test * Refactor wsconncache to use tailnet * Remove STUN from unit tests * Add migrate back to dump * chore: Upgrade to Go 1.19 This is required as part of #3505. * Fix reconnecting PTY tests * fix: update wireguard-go to fix devtunnel * fix migration numbers * linting * Return early for status if endpoints are empty * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Update cli/server.go Co-authored-by: Colin Adler <colin1adler@gmail.com> * Fix frontend entites * Fix agent bicopy * Fix race condition for the last node * Fix down migration * Fix connection RBAC * Fix migration numbers * Fix forwarding TCP to a local port * Implement ping for tailnet * Rename to ForceHTTP * Add external derpmapping * Expose DERP region names to the API * Add global option to enable Tailscale networking for web * Mark DERP flags hidden while testing * Update DERP map on reconnect * Add close func to workspace agents * Fix race condition in upstream dependency * Fix feature columns race condition Co-authored-by: Colin Adler <colin1adler@gmail.com>
2022-09-01 01:09:44 +00:00
"wsjson",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"xerrors",
"xstate",
feat: Add provisionerdaemon to coderd (#141) * feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
"yamux"
],
"cSpell.ignorePaths": ["site/package.json", ".vscode/settings.json"],
"emeraldwalk.runonsave": {
"commands": [
{
"match": "database/queries/*.sql",
"cmd": "make gen"
},
{
"match": "provisionerd/proto/provisionerd.proto",
"cmd": "make provisionerd/proto/provisionerd.pb.go"
}
]
},
"eslint.workingDirectories": ["./site"],
"files.exclude": {
"**/node_modules": true
},
"search.exclude": {
"scripts/metricsdocgen/metrics": true,
"docs/api/*.md": true
},
// Ensure files always have a newline.
"files.insertFinalNewline": true,
"go.lintTool": "golangci-lint",
"go.lintFlags": ["--fast"],
"go.lintOnSave": "package",
"go.coverOnSave": true,
"go.coverageDecorator": {
"type": "gutter",
"coveredGutterStyle": "blockgreen",
"uncoveredGutterStyle": "blockred"
},
// The codersdk is used by coderd another other packages extensively.
// To reduce redundancy in tests, it's covered by other packages.
// Since package coverage pairing can't be defined, all packages cover
// all other packages.
"go.testFlags": ["-short", "-coverpkg=./..."],
// We often use a version of TypeScript that's ahead of the version shipped
// with VS Code.
2023-01-25 16:15:06 +00:00
"typescript.tsdk": "./site/node_modules/typescript/lib",
"grammarly.selectors": [
{
"language": "markdown",
"scheme": "file",
"pattern": "docs/contributing/frontend.md"
}
]
}