Commit Graph

680 Commits

Author SHA1 Message Date
Kyle Carberry 2b41ac697e
ci: Add DataDog tracing (#163) 2022-02-04 18:24:58 -06:00
Bryan 94f71feeba
refactor: Add storybook + initial story (#118)
This hooks up `storybook`, which the front-end team has enjoyed using in the v1 codebase - it makes it quick and easy to view and test components in isolation.

The `<LoadingButton />` has a simple story added now, so if you run `yarn storybook`, you can preview it in various states:

![2022-01-31 19 24 24](https://user-images.githubusercontent.com/88213859/151908656-27dac0a8-9c6e-4353-ad25-3eafee979bd4.gif)

This will be helpful as we bring more front-end devs to help build v2 out.
2022-02-04 08:36:58 -08:00
Kyle Carberry fb020a5d1b
fix: Update pion/webrtc to fix ICE negotiation race (#153)
* Add trace logging for pion (dtls,ice,pc)

* Temporarily disable postgres tests to spend more cycles on mock tests

* experiment: Add trace logging for WebRTC offer and answer

* Use forked pion/webrtc

Co-authored-by: Bryan Phelps <bryan@coder.com>
2022-02-03 22:10:21 +00:00
Kyle Carberry e75bde4e31
feat: Add provisionerdaemon to coderd (#141)
* feat: Add history middleware parameters

These will be used for streaming logs, checking status,
and other operations related to workspace and project
history.

* refactor: Move all HTTP routes to top-level struct

Nesting all structs behind their respective structures
is leaky, and promotes naming conflicts between handlers.

Our HTTP routes cannot have conflicts, so neither should
function naming.

* Add provisioner daemon routes

* Add periodic updates

* Skip pubsub if short

* Return jobs with WorkspaceHistory

* Add endpoints for extracting singular history

* The full end-to-end operation works

* fix: Disable compression for websocket dRPC transport (#145)

There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing.

This is just tracking some experimentation to fix that race condition

## Run results: ##
- Run 1: peer test failure
- Run 2: peer test failure
- Run 3: `TestWorkspaceHistory/CreateHistory`  - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45
```
status code 412: The provided project history is running. Wait for it to complete importing!`
```
- Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176
```
    workspacehistory_test.go:122: 
        	Error Trace:	workspacehistory_test.go:122
        	Error:      	Condition never satisfied
        	Test:       	TestWorkspaceHistory/CreateHistory
```
- Run 5: peer failure
- Run 6: Pass  
- Run 7: Peer failure

## Open Questions: ##

### Is `dRPC` or `websocket` at fault for the data race?

It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](f6e369438f/drpcwire/error.go (L15)) - so `dRPC` has created this buffer and owns it.

From `dRPC`'s perspective, the callstack looks like this:
- [`sendPacket`](f6e369438f/drpcstream/stream.go (L253))
  - [`writeFrame`](f6e369438f/drpcwire/writer.go (L65))
    - [`AppendFrame`](f6e369438f/drpcwire/packet.go (L128))
      - with finally the data race happening here:
```go
// AppendFrame appends a marshaled form of the frame to the provided buffer.
func AppendFrame(buf []byte, fr Frame) []byte {
...
	out := buf
	out = append(out, control).   // <---------
```

This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame.

Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: f6e369438f/drpcwire/writer.go (L73)

However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](8dee580a7f/write.go (L180)), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](a1a9cfc821/flate/stateless.go (L94)), which is where get our race.

In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly).

### Why does cloning on `Read` fail?

Get a bunch of errors like:
```
2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0
2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF
2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF
2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0
```

# UPDATE:

We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now:

- Run 1:  
- Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338
- Run 3:  
- Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168
- Run 5: 

* fix: Remove race condition with acquiredJobDone channel (#148)

Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83

__Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs.

__Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up.

* fix: Bump up workspace history timeout (#149)

This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32

Looking at the timing of the test:
```
    t.go:56: 2022-02-02 21:33:21.964 [DEBUG]	(terraform-provisioner)	<provision.go:139>	ran apply
    t.go:56: 2022-02-02 21:33:21.991 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    t.go:56: 2022-02-02 21:33:22.050 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    t.go:56: 2022-02-02 21:33:22.090 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    t.go:56: 2022-02-02 21:33:22.140 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    t.go:56: 2022-02-02 21:33:22.195 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    t.go:56: 2022-02-02 21:33:22.240 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    workspacehistory_test.go:122: 
        	Error Trace:	workspacehistory_test.go:122
        	Error:      	Condition never satisfied
        	Test:       	TestWorkspaceHistory/CreateHistory
```

It  appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout.

Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here.

In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that.

Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
Bryan 56b3ec18f4
chore: Fix dependabot path after moving package.json -> site/package.json (#133)
Missed in #128 - need to update the `dependabot.yml` to point to the correct package.json, which was moved from `/package.json` -> `/site/package.json`
2022-02-01 14:44:01 -08:00
Bryan 78e652a268
refactor: Move package.json and other front-end collateral into 'site' (#128)
This refactors the front-end collateral to all live within `site` - so no `package.json` at the root.

The reason we had this initially is that the jest test run and NextJS actually require having _two_ different `tsconfig`s - Next needs `jsx:"preserve"`, while jest needs `jsx:"react"` - we were using `tsconfig`s at different levels at the hierarchy to manage this.

I changed this behavior to still use two different `tsconfig.json`s, which is mandatory - but just side-by-side in `site`.

Once that's fixed, it was easy to move everything into `site`

Follow up from: https://github.com/coder/coder/pull/118#discussion_r796244577
2022-02-01 13:34:43 -08:00
Bryan 38867b0ad3
fix: Re-enable parallel run of Postgres-backed tests (#119)
@kylecarbs and I were debugging a gnarly postgres issue over the weekend, and unfortunately it looks like it is still coming up occassionally: https://github.com/coder/coder/runs/5014420662?check_suite_focus=true#step:8:35 - so thought this might be a good testing Monday task.

Intermittently, the test would fail with something like a `401` - invalid e-mail, or a `409` - initial user already created. This was quite surprising, because the tests are designed to spin up their own, isolated database.

We tried a few things to debug this...

## Attempt 1: Log out the generated port numbers when running the docker image.

Based on the errors, it seemed like one test must be connecting to another test's database - that would explain why we'd get these conflicts! However, logging out the port number that came from docker always gave a unique number... and we couldn't find evidence of one database connecting to another.

## Attempt 2: Store the database in unique, temporary folder.

@kylecarbs and I found that the there was a [volume](a83005b407/11/alpine/Dockerfile (L155)) for the postgres data... so @kylecarbs implemented mounting the volume to a unique, per-test temporary folder in https://github.com/coder/coder/pull/89

It sounded really promising... but unfortunately we hit the issue again!

### Attempt 3... this PR

After we hit the failure again, we noticed in the `docker ps` logs something quite strange:
![image](https://user-images.githubusercontent.com/88213859/151913133-522a6c2e-977a-4a65-9315-804531ab7d77.png)

When the docker image is run - it creates two port bindings, an IPv4 and an IPv6 one. These _should be the same_ - but surprisingly, they can sometimes be different. It isn't deterministic, and seems to be more common when there are multiple containers running. Importantly, __they can overlap__ as in the above image. 

Turns out, it seems this is a docker bug: https://github.com/moby/moby/issues/42442 - which may be fixed in newer versions.

To work around this bug, we have to manipulate the port bindings (like you would with `-p`) at the command line. We can do this with `docker`/`dockertest`, but it means we have to get a free port ahead of time to know which port to map.

With that fix in - the `docker ps` is a little more sane:
![image](https://user-images.githubusercontent.com/88213859/151913432-5f86bc09-8604-4355-ad49-0abeaf8cc0fe.png)

...and hopefully means we can safely run the containers in parallel again.
2022-02-01 09:22:02 -08:00
Jonathan Yu 515e55db33
chore: cancel concurrent builds with native feature (#116)
Use the native 'concurrency' configuration feature to cancel
concurrent builds, rather than the cancel-workflow-action.
This also allows us to reduce permissions for the workflow.
2022-01-31 16:13:33 -08:00
Kyle Carberry 46d2550eda
ci: Remove code coverage step for Dependabot (#107) 2022-01-31 03:04:57 +00:00
Jonathan Yu 3fccfc5ef3
chore: add Stale to close old pull requests/issues (#98)
Add configuration for the Probot Stale bot, in order to close old
pull requests and issues.
2022-01-30 18:59:28 -08:00
Jonathan Yu 34fc62def5
chore: add Dependabot configuration (#97) 2022-01-30 18:53:37 -08:00
Kyle Carberry 9db5fb0952
refactor: Improve handshake resiliency of peer (#95)
* fix: Synchronize peer logging with a channel

We were depending on the close mutex to properly
report connection state. This ensures the RTC
connection is properly closed before returning.

* Disable pion logging

* Remove buffer

* Try ICE servers

* Remove flushed

* Add diagram explaining handshake

* Fix candidate accept ordering

* Add debug logging to peerbroker

* Fix send ordering

* Lock adding ICE candidate

* Add test for negotiating out of order

* Reduce connection to a single negotiation channel

* Improve test times by pre-installing Terraform

* Lock remote session description being applied

* Organize conn

* Revert to multi-channel setup

* Properly close ICE gatherer

* Improve comments

* Try removing buffered candidates

* Buffer local and remote messages

* Log dTLS transport state

* Add pion logging
2022-01-30 20:11:18 -06:00
Kyle Carberry a7d6f4b673
ci: Lock PostgreSQL database creation (#94)
There have been race conditions when multiple instances
are created at once. This is an attempt to fix!
2022-01-29 21:16:34 -06:00
Kyle Carberry f9e594fbad
ci: Run PostgreSQL with a scratch directory to improve CI durability (#89)
When using parallel before, multiple PostgreSQL containers would
unintentionally interfere with the other's data. This ensures
both containers have separated data, and don't create a volume.

🌮 @bryphe-coder for the idea!
2022-01-29 18:39:59 -06:00
Kyle Carberry b3c5bb3576
feat: Compute project build parameters (#82)
* feat: Add parameter and jobs database schema

This modifies a prior migration which is typically forbidden,
but because we're pre-production deployment I felt grouping
would be helpful to future contributors.

This adds database functions that are required for the provisioner
daemon and job queue logic.

* feat: Compute project build parameters

Adds a projectparameter package to compute build-time project
values for a provided scope.

This package will be used to return which variables are being
used for a build, and can visually indicate the hierarchy to
a user.

* Fix terraform provisioner

* Improve naming, abstract inject to consume scope

* Run CI on all branches
2022-01-29 17:45:42 -06:00
Kyle Carberry b503c8b099
feat: Add parameter and jobs database schema (#81)
* feat: Add parameter and jobs database schema

This modifies a prior migration which is typically forbidden,
but because we're pre-production deployment I felt grouping
would be helpful to future contributors.

This adds database functions that are required for the provisioner
daemon and job queue logic.

* Add comment to acquire provisioner job query

* PostgreSQL hates running in parallel
2022-01-29 17:38:32 -06:00
Kyle Carberry 5d7112f0d7
ci: Pin the golangci-lint version to prevent breakage (#62)
* ci: Pin the golangci-lint version to prevent breakage

The main branch broke because golangci-lint released a new version.
This pins it, so hopefully it never happens again!

* Fix version string
2022-01-25 10:04:25 -06:00
Kyle Carberry 50d8151995
ci: Run tests using PostgreSQL database and mock (#49)
* ci: Run tests using PostgreSQL database and mock

This allows us to use the mock database for quick iterative testing,
and have confidence from CI using a real PostgreSQL database.

PostgreSQL tests are only ran on Linux. They are *really* slow on MacOS
and Windows runners, and don't provide much additional confidence.

* Only run PostgreSQL tests once for speed

* Fix race condition of log after close

Not all resources were cleaned up immediately after a peer connection was
closed. DataChannels could have a goroutine exit after Close() prior to this.

* Fix comment
2022-01-22 21:58:26 +00:00
Bryan 7b9347bce6
chore: Add linter for typescript code (#45)
- Add and configure `eslint`
- Add to build pipeline
- Fix lint failures
2022-01-20 22:00:14 -08:00
Kyle Carberry a461bc1454
test: Increase disconnectTimeout to reduce test flakes (#26)
* test: Increase disconnectTimeout to reduce test flakes

WebRTC uses UDP, which means a network connection is never open or closed. It uses timeouts to determine connection state; on a slow CI runner, these timeouts could be reached. Increasing this timeout should reduce flakes, but is unlikely to remove this flake entirely.

* Fix close after offline

* Run tests in parallel
2022-01-14 10:12:07 -06:00
Bryan 423611b001
chore: Add initial jest tests + code coverage (#13)
- Adds initial infra for running front-end tests (`jest`, `ts-jest`, `jest.config.js`, etc)
- Adds codecov integration front-end code
2022-01-13 18:48:23 -08:00
Kyle Carberry 550c4fbbb3
ci: Run tests 3 times to reduce flakes (#20)
* ci: Run tests 10 times to reduce flakes

* Reduce runs to 3

* Use forked dependency

* Fix formatting
2022-01-13 12:05:39 -06:00
Bryan 92710ede54
chore: Add caching for node_modules (#19) 2022-01-13 09:11:52 -08:00
Bryan ace89161fb
feat(cdr): Initial UI scaffolding
This is testing out [Approach 3](https://www.notion.so/coderhq/Workspaces-v2-Initial-UI-Scaffolding-3b07d2847eed48839a7e6f0f2bb9bf56#56256f25d2954897a8ee315f0820cedd) in the UI scaffolding RFC.

Fixes https://github.com/coder/coder/issues/11

The folder structure looks like:
- `site`
    - `components` (buttons, empty state, etc)
    - `pages` (large sections of UI -> composition of components)
    - `theme` (files defining our palette)

Several components were able to be brought in essentially unmodified:
- `SplitButton`
- `EmptyState`
- `Footer`
-  All the icons / logos
- Theming (removed several items that aren't necessary, yet, though)

Other components had more coupling, and need more refactoring:
- `NavBar`
- `Confetti`

Current State:

![2022-01-06 17 16 31](https://user-images.githubusercontent.com/88213859/148475521-96e080cc-1d33-4b8e-a434-29e388936e3f.gif)

For a full working app, there's potentially a lot more to bring in:
- User / Account Settings Stuff
- Users Page
- Organizations Page
(and all the supporting dependencies)
2022-01-12 14:25:12 -08:00
Kyle Carberry 53cfa8a45a
feat: Create broker for negotiating connections (#14)
* feat: Create broker for negotiating connections

WebRTC require an exchange of encryption keys and network hops to connect. This package pipes the exchange over gRPC. This will be used in all connecting clients and agents.

* Regenerate protobuf definition

* Cache Go build and test

* Fix gRPC language with dRPC

Co-authored-by: Bryan <bryan@coder.com>

Co-authored-by: Bryan <bryan@coder.com>
2022-01-11 09:28:41 -06:00
Kyle Carberry 7c260f88d1
feat: Create provisioner abstraction (#12)
* feat: Create provisioner abstraction

Creates a provisioner abstraction that takes prior art from the Terraform plugin system. It's safe to assume this code will change a lot when it becomes integrated with provisionerd.

Closes #10.

* Ignore generated files in diff view

* Check for unstaged file changes

* Install protoc-gen-go

* Use proper drpc plugin version

* Fix serve closed pipe

* Install sqlc with curl for speed

* Fix install command

* Format CI action

* Add linguist-generated and closed pipe test

* Cleanup code from comments

* Add dRPC comment

* Add Terraform installer for cross-platform

* Build provisioner tests on Linux only
2022-01-08 11:24:02 -06:00
Bryan 2769f4c2e0
chore: Formatting - bring .prettierrc over from cdr/m (#9)
This brings over the same `.prettierrc` we used in `cdr/m`, and runs the formatter w/ the new settings
2022-01-06 13:02:05 -08:00
Bryan 3a3161aa63
chore: Add semantic pull requests (#5)
Add a semantic-pull-requests configuration (like we have for `coder/m`), to validate commit messages.
2022-01-06 12:49:40 -08:00
Kyle Carberry a6b2dd76a0
chore: Add golangci-lint and codecov (#3)
* chore: Add golangci-lint and codecov

* Use consistent file names

* Format settings.json

* Add golangci-lint and codecov GitHub Actions

* Add base Go file for linting

* Add test coverage
2022-01-05 08:48:56 -06:00
Bryan 78973eaf3f
chore: Initial GHA workflow (#1)
This implements an initial GitHub Actions workflow for us - to be run on PRs and on `main` commits.

This just implements a really simple `style/fmt` check - running `prettier` on the `README.md`.

I assumed we'll stick with using a top-level `Makefile` for commands like in `m` and `link` - but open to alternatives, too!

Since I was adding a `package.json` and `node_modules` for this, I realized we were missing `.gitignore`s, so I added some a subset of the ignore files from `coder/m`
2022-01-03 18:54:27 -08:00