216 lines
12 KiB
Plaintext
216 lines
12 KiB
Plaintext
package ui
|
|
|
|
templ About() {
|
|
<div>
|
|
<h1>about</h1>
|
|
<p>
|
|
Converge is a utility for troubleshooting builds on continuous integration servers.
|
|
It solves a common problem where the cause of job failure is difficult to determine.
|
|
This is complicated further by the fact that build jobs are usually run on a build
|
|
farm where there is no access to the build agents or in more modern envrionments when
|
|
jobs are run in ephemeral containers.
|
|
</p>
|
|
|
|
<p>
|
|
With Converge it is possible to get remote shell access to such jobs. This works
|
|
by configuring the build job to connect to a Converge server using an agent program.
|
|
The agent program can be downloaded from within the CI job using curl or wget.
|
|
Next, an end-user can connect to the Converge server, a rendez-vous server, that connects
|
|
the client and server together based on a common identifier specified by both client and
|
|
server.
|
|
</p>
|
|
|
|
|
|
|
|
<h2>how it works</h2>
|
|
|
|
<p>
|
|
The basic principle of converge is described below. Access to a running remote continous integration
|
|
job is usually not possible without a lot of access to the backend environment where jobs are running.
|
|
However, the job can connect to a server running outside, and so can the client.
|
|
</p>
|
|
|
|
|
|
<div>
|
|
<img src="../static/images/converge.svg" style="max-width: 800px"/>
|
|
</div>
|
|
|
|
The connection between
|
|
client and agent is established as follows:
|
|
<ul>
|
|
<li>(1): the agent, started by the continuous integration job, connects to converge server through a websocket, this establishes a connection that
|
|
is similar to a TCP connection. In connecting, the agent specifies a
|
|
rendez-vous id. After connecting, the agenta and Converge server perform multiplexing of connections
|
|
over this single connection. This allows the agent to run an embedded SSH server and listen for incoming
|
|
connections, just like normally is done with a TCP listener. </li>
|
|
<li>(2): the client connects to converge server through SSH and also specifies the same rendez-vous id.
|
|
Since SSH by itself cannot connect over websockets, a helper program <code>wsproxy</code> is used as
|
|
a proxy command for SSH. Using <code>wsproxy</code>, the rendez-vous id is passed to the server as part
|
|
of the websocket URL. </li>
|
|
<li>(3): converge server connects the two connections after matching them based on the rendez-vous id.
|
|
Now when a connection is setup from a client, it can connect to the appropriate agent, identifie dby
|
|
rendez-vous id and setup a bi-directional connection. After this, Converge simply copies data between
|
|
client and agent. </li>
|
|
<li>(4): the agent runs an embedded SSH server and incoming connections to the agent are handed over to
|
|
that server. At this moment an end-to-end SSH session is established. </li>
|
|
<li>(5): The agent spawns a shell that receives input from the user. Output from the shell is communicated
|
|
back over the SSH session. The shell can be any shell (bash, cmd.exe, powershell.exe) or in fact any process.
|
|
At this point, the user is connected to a remote shell running in the continuous integration job.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>There are a few special situations:
|
|
<ul>
|
|
<li> If no rendez-vous id is specified than a rendez-vous id is generated. </li>
|
|
<li> If the agent uses an id already in use by another agent, then converge server will
|
|
generate a new rendez-vous id. </li>
|
|
</ul>
|
|
The agent will always print the rendez-vous id and command required to connect to it.
|
|
</p>
|
|
|
|
<h2>security</h2>
|
|
|
|
<p>
|
|
The setup is such that the connection from client (end-user) to server (agent on CI job)
|
|
is end-to-end encrypted. The Converge server itself is no more than a bitpipe which pumps
|
|
data between client and agent.
|
|
</p>
|
|
|
|
<p>Using authorized keys is a secure way of connecting. When running the agent, the authorized keys
|
|
must be put in a file, allowing only the designated users to connect. The file containing authorized keys
|
|
can also be edited during a session with the agent, allowing more people to be added when required without
|
|
having to start over again.
|
|
Using authorized keys is made easy through the
|
|
<a href="usage.html">usage</a> page, which provides the exact commands to execute based
|
|
on the target environment. If users are hesitant to use their public key it is also possible
|
|
to generate a separate ssh key-pair using <code>ssh-keygen</code> and use that instead.
|
|
</p>
|
|
|
|
<p>To be able to use Converge, you must already have access to the configuration of a build job.
|
|
Having that access means it is possible to execute any command on a build agent. The Converge
|
|
agent is started by the build job and does not have any additional rights compared to what you
|
|
could script in the continous integration job definition.
|
|
</p>
|
|
|
|
<p>Converge does not provide any stealth features to hide it. The public sessions page show all
|
|
agents and clients including details about the clients and the agents. The idea is that it should
|
|
be light-weight and easy to use. There is no reason to hide the fact that someone is debugging
|
|
a continuous integration job. Also, all sessions are logged,both using standard kubernetes tooling
|
|
such as (fluentbit/filebeat, and loki/elasticsearch depending on the environment). This logging includes
|
|
only the details about the sessions, but not what the user is doing inside a session. Also, Converge
|
|
provides a prometheus metrics endpoint which allows user sessions to be tracked over time after
|
|
the fact. Thie data is also made accessible using a grafana dashboard.
|
|
</p>
|
|
|
|
<h2>SSH and SFTP</h2>
|
|
|
|
<p>
|
|
Both ssh and sftp are supported. Multiple concurrent sessions to same agent are allowed as well
|
|
as multiple agents are also allowed.
|
|
</p>
|
|
|
|
<h2>timeouts</h2>
|
|
|
|
<p>
|
|
There is a timeout mechanism in the agent such that jobs do not hang indefinitely
|
|
waiting for a connection. This mechanism is useful to make sure build agents do not keep
|
|
build agents occupied for a long time. By default, the agent exits with status 0 when
|
|
the last client exits after logging in. The timeout is an inactivity timeout. Activity is
|
|
detected as follows:
|
|
<ul>
|
|
<li><b>ssh</b>: any key press is considered activity</li>
|
|
<li><b>sftp</b>: any output from the server side is considered activity. This is done to
|
|
make sure that longer downloads cannot be killed by a timeout. A simple <code>ls</code> command
|
|
in an sftp session will also lead to activity since the server will output the result of the command. </li>
|
|
</ul>
|
|
</p>
|
|
<p>When the user touches a .hold file, the agent keeps waiting for connections even
|
|
after the last client logs out, taking into account the timeout. By default the agent
|
|
exits when the last user has logged out.
|
|
</p>
|
|
|
|
<h2>remote shell usage</h2>
|
|
|
|
<p>
|
|
The agent supports a <code>--shells</code> command-line option by which a comma-separated
|
|
list of shells can be prepended to the default search path for shells, e.g.
|
|
<code>--shells zsh,csh,sh</code> (linux) or <code>cmd,powershell</code> for
|
|
windows.
|
|
</p>
|
|
|
|
<p>
|
|
The agent sets an <code>agentdir</code> environment variable that points to
|
|
the directory where the agent is running.
|
|
</p>
|
|
|
|
<p>The user will get notifications from the agent any time something important happens such
|
|
as the session being close to timeout.
|
|
</p>
|
|
|
|
<h2>other tools</h2>
|
|
|
|
<p>Using available existing tools such as
|
|
<a href="https://github.com/namespacelabs/breakpoint">breakpoint</a> in combination
|
|
with a websocket tunneling tool such as
|
|
<a href="https://github.com/erebe/wstunnel">wstunnel</a> a similar solution can be
|
|
obtained. There are however some problems with these solutions that converge is
|
|
trying to address:
|
|
</p>
|
|
|
|
<p>
|
|
<ul>
|
|
<li>Breakpoint uses an embedded SSH server which is a really good idea but
|
|
uses the QUIC protocol for connecting to a rendez-vous server. The rendez-vous server then
|
|
exposes a random port for every client. This make deployment on kubernetes really hard
|
|
where fixed ports must be used and QUIC is also not a widely supported protocol.</li>
|
|
<li>The problem with the random ports can be solved by using wstunnel running together
|
|
with breakpoint server in a kubernetes pod, where wstunnel can forward traffic over an
|
|
external websocket connection to the local random port that breakpoint server is listening on.</li>
|
|
<li>breakpoint leaves it open on how users install the breakpoint executable (agent). </li>
|
|
<li>Because of the hacky nature of this setup, it is very difficult for users to use
|
|
and troubleshoot when things go wrong. </li>
|
|
</ul>
|
|
|
|
</p>
|
|
Converve server addresses these issues in the following ways:
|
|
<ul>
|
|
<li>Use the websocket protocol both for agents and for clients, providing a fixed port and
|
|
a supported protocol for kubernetes deployment. Websockets are also supported by
|
|
kubernetes ingress controllers so this makes it easy to deploy on kubernetes.
|
|
To make this work with SSH which does not natively support websockets, a proxycommand
|
|
<code>wsproxy</code> is provided that allows SSH to connect using websockets.
|
|
</li>
|
|
<li>Providing online documentation where the instructions take into account the
|
|
hostname and protocol where converge is running allowing users to cut and paste
|
|
instructions that can be used without modification. In the usage page the users
|
|
can even generate the correct agent startup commands and client connection commands
|
|
based on the type of shell they are connecting to. </li>
|
|
<li>Converge server provides out of the box downloads of required software. This makes sure
|
|
client and server are always up to date and can be downloaded in any continuous integration
|
|
job without having to package the required executables in an ad-hoc way.
|
|
In addition a protocol version check is done. </li>
|
|
<li>User-friendly error messages can be given to users in most cases when things do not work
|
|
out because of <code>wsproxy</code>. This is an SSH proxy command that communicates with converge
|
|
and provides additional information to the user. </li>
|
|
<li>A live screen showing the current sessions that are running. The sessions webpage provides
|
|
additional feedback about the running sessions. </li>
|
|
<li>Interactivity in the user's session with notifications about timeouts and a very
|
|
simple inactivity timeout mechanism. </li>
|
|
<li>Possibility for the user to define the remote shell to use. </li>
|
|
<li>Support for unix like bash shells and command prompt and powershell. </li>
|
|
<li>Observability w.r.t. non-functionals of converge and of agent and client sessions through
|
|
prometheus monitoring. For session monitoring, separate grafana dashboard is provided. </li>
|
|
</ul>
|
|
<p>
|
|
</p>
|
|
|
|
</div>
|
|
}
|
|
|
|
|
|
templ AboutTab() {
|
|
@BasePage(1) {
|
|
@About()
|
|
}
|
|
}
|