* feat(api): Add agent shutdown lifecycle states
* feat(agent): Add shutdown_script support
* feat(agent): Add shutdown_script timeout
* feat(site): Support new agent lifecycle states
---
Co-authored-by: Marcin Tojek <marcin@coder.com>
* Add git auth providers schema
* Pipe git auth providers to the schema
* Add git auth providers to the API
* Add gitauth endpoint to query authenticated state
* Add endpoint to query git state
* Use BroadcastChannel to automatically authenticate with Git
* Add error validation for submitting the create workspace form
* Fix panic on template dry-run
* Add tests for the template version Git auth endpoint
* Show error if no gitauth is configured
* Add gitauth to cliui
* Fix unused method receiver
* Fix linting errors
* Fix dbauthz querier test
* Fix make gen
* Add JavaScript test for git auth
* Fix bad error message
* Fix provisionerd test race
See https://github.com/coder/coder/actions/runs/4277960646/jobs/7447232814
* Fix requested changes
* Add comment to CreateWorkspacePageView
This PR adds the prometheus metric coderd_workspace_builds_total.
It measures the total number of workspace builds, along with a number of labels intended to be useful for an operator debugging a failed workspace build trying to discover the scope of the issue.
* feat: Add connection_timeout and troubleshooting_url to agent
This commit adds the connection timeout and troubleshooting url fields
to coder agents.
If an initial connection cannot be established within connection timeout
seconds, then the agent status will be marked as `"timeout"`.
The troubleshooting URL will be present, if configured in the Terraform
template, it can be presented to the user when the agent state is either
`"timeout"` or `"disconnected"`.
Fixes#4678
On Windows, files in tar archives were stored with Windows
path-separators resulting in them being individual files as opposed to
contained in a folder.
This commit ensures Unix-based paths (slash) are being used inside tar
archives.
Exmple of previous output:
```
/tmp/provisionerd673501182/images:
/tmp/provisionerd673501182/:
README.md
images
images\base.Dockerfile
images\java.Dockerfile
images\node.Dockerfile
main.tf
```
Fixes#2815
* feat: use app wildcards for apps if configured
* feat: relative_path -> subdomain
- rename relative_path -> subdomain when referring to apps
- migrate workspace_apps.relative_path to workspace_apps.subdomain
- upgrade coder/coder terraform module to 0.5.0
* chore: Add nix shell for simple development setup
This enables contributors using Nix to set up their environment with ease.
* improve nix style, flake output schema
* fix error message
* Update scripts/build_go_slim.sh
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
* Update scripts/build_go_slim.sh
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
* Add UTC default for timezone and remove unnecessary goreleaser dependency
* Skip TZ test if localtime does not exist
Co-authored-by: Charlie Moog <moogcharlie@gmail.com>
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
* Initial support for metadata in provisioner API and Terraform provisioner
* add support for nullable metadata fields
* handle metadata fields in provisionerd and API
* fix: Improve `coder server` shutdown procedure
This commit improves the `coder server` shutdown procedure so that all
triggers for shutdown do so in a graceful way without skipping any
steps.
We also improve cancellation and shutdown of services by ensuring
resources are cleaned up at the end.
Notable changes:
- We wrap `cmd.Context()` to allow us to control cancellation better
- We attempt graceful shutdown of the http server (`server.Shutdown`)
because it's less abrupt (compared to `shutdownConns`)
- All exit paths share the same shutdown procedure (except for early
exit)
- `provisionerd`s are now shutdown concurrently instead of one at a
time, the also now get a new context for shutdown because
`cmd.Context()` may be cancelled
- Resources created by `newProvisionerDaemon` are cleaned up
- Lifecycle `Executor` exits its goroutine on context cancellation
Fixes#3245
* fix: Change uses of t.Cleanup -> defer in test bodies
Mixing t.Cleanup and defer can lead to unexpected order of execution.
* fix: Ensure t.Cleanup is not aborted by require
* chore: Add helper annotations
* Pass workspace owner email address to provisioner
* Remove owner_email and owner_username fields from agent metadata
* Add Git environment variables to example templates
* Remove "owner_name" field from provisioner metadata, use username instead
* Remove Git configuration from most templates, add documentation
* Proofreading/typo fixes from @mafredri
* Update example templates to latest version of terraform-provider-coder
- Adds distinct exit statuses to the bootstrap scripts
- Makes the bootstrap scripts loop forever trying to download the coder agent
- Surfaces and logs the status codes returned by the download tool
* fix: Add `trap` to agent startup script to sleep on failure
The Docker Terraform provider removes containers immediately on exit, making
it difficult to debug a failed container start with Coder. This will sleep on
exit and output a friendly log, which should assist with debugging failures.
* Update provisionersdk/agent.go
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
* Update provisionersdk/agent.go
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
This commit extracts the existing hard-coded agent scripts to their own files and embeds them using go:embed.
We can more easily use e.g. shellcheck to validate our agent scripts.
This commit changes the `coder agent` path in `ps` listing from
`/tmp/tmp.coderwWs87Y/coder agent` to `./coder agent`.
The path is also updated to `/tmp/coder.wWs87Y`.
There were two options considered for turning `./coder agent` into
`coder agent`:
1. Run `exec -a coder /path/to/coder agent`
2. Run `PATH=/path/to:$PATH exec coder agent`
Option 1 is not supported by `dash`, and thus discarded.
Option 2 duplicates functionality in `coder agent` which _appends_ the
path, here we would want to _prepend_ it to ensure we're starting the
downloaded `coder` binary in case there is a binary with a conflicting
name on the system.
Fixes#2407
* feat: Add app support
This adds apps as a property to a workspace agent.
The resource is added to the Terraform provider here:
https://github.com/coder/terraform-provider-coder/pull/17
Apps will be opened in the dashboard or via the CLI
with `coder open <name>`. If `command` is specified, a
terminal will appear locally and in the web. If `target`
is specified, the browser will open to an exposed instance
of that target.
* Compare fields in apps test
* Update Terraform provider to use relative path
* Add some basic structure for routing
* chore: Remove interface from coderd and lift API surface
Abstracting coderd into an interface added misdirection because
the interface was never intended to be fulfilled outside of a single
implementation.
This lifts the abstraction, and attaches all handlers to a root struct
named `*coderd.API`.
* Add basic proxy logic
* Add proxying based on path
* Add app proxying for wildcards
* Add wsconncache
* fix: Race when writing to a closed pipe
This is such an intermittent race it's difficult to track,
but regardless this is an improvement to the code.
* fix: Race when writing to a closed pipe
This is such an intermittent race it's difficult to track,
but regardless this is an improvement to the code.
* fix: Race when writing to a closed pipe
This is such an intermittent race it's difficult to track,
but regardless this is an improvement to the code.
* fix: Race when writing to a closed pipe
This is such an intermittent race it's difficult to track,
but regardless this is an improvement to the code.
* Add workspace route proxying endpoint
- Makes the workspace conn cache concurrency-safe
- Reduces unnecessary open checks in `peer.Channel`
- Fixes the use of a temporary context when dialing a workspace agent
* Add embed errors
* chore: Refactor site to improve testing
It was difficult to develop this package due to the
embed build tag being mandatory on the tests. The logic
to test doesn't require any embedded files.
* Add test for error handler
* Remove unused access url
* Add RBAC tests
* Fix dial agent syntax
* Fix linting errors
* Fix gen
* Fix icon required
* Adjust migration number
* Fix proxy error status code
* Fix empty db lookup
* feat: build armv7 linux releases
* upload ARM binaries to bin
* Only build arm 7 for Linux
* add ARM agent scripts
* fix: specify armv7 to match tf provider
* append arm version to slim builds
* use descript armv7 binary
* Add script mappings for each architecture
Co-authored-by: kylecarbs <kyle@carberry.com>
* fix: Update GIT_COMMITTER_NAME to use username
This was a mistake when adding the committer fields 🤦.
* fix: Use environment variables for agent authentication
Using files led to situations where running "coder server --dev" would
break `gitssh`. This is applicable in a production environment too. Users
should be able to log into another Coder deployment from their workspace.
Users can still set "CODER_URL" if they'd like with agent env vars!
Connections could fail when massive payloads were transmitted.
This fixes an upstream bug in dRPC where the connection would
end with a context canceled if a message was too large.
This adds retransmission of completion and failures too. If
Coder somehow loses connection with a provisioner daemon,
upon the next connection the state will be properly reported.
This fixes the dependency tree by adding recursion. It
now finds indirect connections and associates it with
an agent.
An example is attached which surfaced this issue.
Windows (reasonably) detected our CLI as a virus due to the name
being "sshd" for VS Code support. See:
https://github.com/microsoft/vscode-remote-release/issues/5699
This disables monitoring for our binary prior to run, which
fixes our Windows example.
These were added under the impression that there was significant
user-experience impact if multiple resources share the same name.
This hasn't proven to be true yet, so figured we'd take this out
until it becomes necessary.
* Improve CLI documentation
* feat: Allow workspace resources to attach multiple agents
This enables a "kubernetes_pod" to attach multiple agents that
could be for multiple services. Each agent is required to have
a unique name, so SSH syntax is:
`coder ssh <workspace>.<agent>`
A resource can have zero agents too, they aren't required.
* Add tree view
* Improve table UI
* feat: Allow workspace resources to attach multiple agents
This enables a "kubernetes_pod" to attach multiple agents that
could be for multiple services. Each agent is required to have
a unique name, so SSH syntax is:
`coder ssh <workspace>.<agent>`
A resource can have zero agents too, they aren't required.
* Rename `tunnel` to `skip-tunnel`
This command was `true` by default, which causes
a confusing user experience.
* Add disclaimer about editing templates
* Add help to template create
* Improve workspace create flow
* Add end-to-end test for config-ssh
* Improve testing of config-ssh
* Fix workspace list
* Fix config ssh tests
* Update cli/configssh.go
Co-authored-by: Cian Johnston <public@cianjohnston.ie>
* Fix requested changes
* Remove socat requirement
* Fix resources not reading in TTY
Co-authored-by: Cian Johnston <public@cianjohnston.ie>
This enables a "kubernetes_pod" to attach multiple agents that
could be for multiple services. Each agent is required to have
a unique name, so SSH syntax is:
`coder ssh <workspace>.<agent>`
A resource can have zero agents too, they aren't required.
This update exposes the workspace name and owner, and changes
authentication methods to be explicit. Implicit authentication
added unnecessary complexity and introduced inconsistency.
* ci: Update DataDog GitHub branch to fallback to GITHUB_REF
This was detecting branches, but not our "main" branch before.
Hopefully this fixes it!
* Add basic Terraform Provider
* Rename post files to upload
* Add tests for resources
* Skip instance identity test
* Add tests for ensuring agent get's passed through properly
* Fix linting errors
* Add echo path
* Fix agent authentication
* fix: Convert all jobs to use a common resource and agent type
This enables a consistent API for project import and provisioned resources.
* Add "coder_workspace" data source
* feat: Remove magical parameters from being injected
This is a much cleaner abstraction. Explicitly declaring the user
parameters for each provisioner makes for significantly simpler
testing.
* feat: Add graceful exits to provisionerd
Terraform (or other provisioners) may need to cleanup state, or
cancel actions before exit. This adds the ability to gracefully
exit provisionerd.
* Fix cancel error check
* feat: Add destroy to workspace provision job
This enables the full flow of create/update/delete.
* ci: Update DataDog GitHub branch to fallback to GITHUB_REF
This was detecting branches, but not our "main" branch before.
Hopefully this fixes it!
* Add basic Terraform Provider
* Rename post files to upload
* Add tests for resources
* Skip instance identity test
* Add tests for ensuring agent get's passed through properly
* Fix linting errors
* Add echo path
* Fix agent authentication
* fix: Convert all jobs to use a common resource and agent type
This enables a consistent API for project import and provisioned resources.
* Add "coder_workspace" data source
* feat: Remove magical parameters from being injected
This is a much cleaner abstraction. Explicitly declaring the user
parameters for each provisioner makes for significantly simpler
testing.
* feat: Add graceful exits to provisionerd
Terraform (or other provisioners) may need to cleanup state, or
cancel actions before exit. This adds the ability to gracefully
exit provisionerd.
* Fix cancel error check
* ci: Update DataDog GitHub branch to fallback to GITHUB_REF
This was detecting branches, but not our "main" branch before.
Hopefully this fixes it!
* Add basic Terraform Provider
* Rename post files to upload
* Add tests for resources
* Skip instance identity test
* Add tests for ensuring agent get's passed through properly
* Fix linting errors
* Add echo path
* Fix agent authentication
* fix: Convert all jobs to use a common resource and agent type
This enables a consistent API for project import and provisioned resources.
* Add "coder_workspace" data source
* feat: Remove magical parameters from being injected
This is a much cleaner abstraction. Explicitly declaring the user
parameters for each provisioner makes for significantly simpler
testing.
* feat: Add agent authentication based on instance ID
Each cloud has it's own unique instance identity signatures, which
can be used for zero-token authentication. This change adds support
for tracking by "instance_id", and automatically authenticating
with Google Cloud.
* Add test for CLI
* Fix workspace agent request name
* Fix race with adding to wait group
* Fix name of instance identity token
* Refactor parameter parsing to return nil values if none computed
* Refactor parameter to allow for hiding redisplay
* Refactor parameters to enable schema matching
* Refactor provisionerd to dynamically update parameter schemas
* Refactor job update for provisionerd
* Handle multiple states correctly when provisioning a project
* Add project import job resource table
* Basic creation flow works!
* Create project fully works!!!
* Only show job status if completed
* Add create workspace support
* Replace Netflix/go-expect with ActiveState
* Fix linting errors
* Use forked chzyer/readline
* Add create workspace CLI
* Add CLI test
* Move jobs to their own APIs
* Remove go-expect
* Fix requested changes
* Skip workspacecreate test on windows
* refactor: Rename ProjectParameter to ProjectVersionParameter
This was confusing with ParameterValue before. It still is a bit,
but this should help distinguish scope.
* Add project version resources table
* Allow project parameters to optionally have user and workspace
* Add dry run for provisioners
* Add resource detection on project import
* chore: Rename ProjectHistory to ProjectVersion
Version more accurately represents version storage. This
forks from the WorkspaceHistory name, but I think it's
easier to understand Workspace history.
* Rename files
* Standardize tests a bit more
* Remove Server struct from coderdtest
* Improve test coverage for workspace history
* Fix linting errors
* Fix coderd test leak
* Fix coderd test leak
* Improve workspace history logs
* Standardize test structure for codersdk
* Fix linting errors
* Fix WebSocket compression
* Update coderd/workspaces.go
Co-authored-by: Bryan <bryan@coder.com>
* Add test for listing project parameters
* Cache npm dependencies with setup node
* Remove windows npm cache key
Co-authored-by: Bryan <bryan@coder.com>
* feat: Add parameter querying to the API
* feat: Add streaming endpoint for workspace history
Enables a buildlog-like flow for reading job output.
* Fix empty parameter source and destination
* Add comment for usage of workspace history logs endpoint
* feat: Add history middleware parameters
These will be used for streaming logs, checking status,
and other operations related to workspace and project
history.
* refactor: Move all HTTP routes to top-level struct
Nesting all structs behind their respective structures
is leaky, and promotes naming conflicts between handlers.
Our HTTP routes cannot have conflicts, so neither should
function naming.
* Add provisioner daemon routes
* Add periodic updates
* Skip pubsub if short
* Return jobs with WorkspaceHistory
* Add endpoints for extracting singular history
* The full end-to-end operation works
* fix: Disable compression for websocket dRPC transport (#145)
There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing.
This is just tracking some experimentation to fix that race condition
## Run results: ##
- Run 1: peer test failure
- Run 2: peer test failure
- Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45
```
status code 412: The provided project history is running. Wait for it to complete importing!`
```
- Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176
```
workspacehistory_test.go:122:
Error Trace: workspacehistory_test.go:122
Error: Condition never satisfied
Test: TestWorkspaceHistory/CreateHistory
```
- Run 5: peer failure
- Run 6: Pass ✅
- Run 7: Peer failure
## Open Questions: ##
### Is `dRPC` or `websocket` at fault for the data race?
It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](f6e369438f/drpcwire/error.go (L15)) - so `dRPC` has created this buffer and owns it.
From `dRPC`'s perspective, the callstack looks like this:
- [`sendPacket`](f6e369438f/drpcstream/stream.go (L253))
- [`writeFrame`](f6e369438f/drpcwire/writer.go (L65))
- [`AppendFrame`](f6e369438f/drpcwire/packet.go (L128))
- with finally the data race happening here:
```go
// AppendFrame appends a marshaled form of the frame to the provided buffer.
func AppendFrame(buf []byte, fr Frame) []byte {
...
out := buf
out = append(out, control). // <---------
```
This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame.
Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: f6e369438f/drpcwire/writer.go (L73)
However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](8dee580a7f/write.go (L180)), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](a1a9cfc821/flate/stateless.go (L94)), which is where get our race.
In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly).
### Why does cloning on `Read` fail?
Get a bunch of errors like:
```
2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0
2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF
2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF
2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0
```
# UPDATE:
We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now:
- Run 1: ✅
- Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338
- Run 3: ✅
- Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168
- Run 5: ✅
* fix: Remove race condition with acquiredJobDone channel (#148)
Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83
__Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs.
__Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up.
* fix: Bump up workspace history timeout (#149)
This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32
Looking at the timing of the test:
```
t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply
t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running
t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running
t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running
t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running
t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running
t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running
workspacehistory_test.go:122:
Error Trace: workspacehistory_test.go:122
Error: Condition never satisfied
Test: TestWorkspaceHistory/CreateHistory
```
It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout.
Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here.
In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that.
Co-authored-by: Bryan <bryan@coder.com>
This brings an async service that parses and
provisions to life! It's separated from coderd
intentionally to allow for simpler testing.
Integration with coderd will come in another PR!