Deploy trin to network

First time Setup

Get access to cluster repo (add person to @trin-deployments)
git clone the cluster repo: https://github.com/ethereum/cluster.git

Install dependencies within cluster virtualenv:

cd cluster
python3 -m venv venv
. venv/bin/activate
pip install ansible
pip install docker
apt install ansible-core

On mac you can do brew install ansible instead of apt.

Install keybase
Publish your pgp public key with keybase, using: keybase pgp select --import
- This fails if you don't have a pgp key yet. If so, create one with gpg --generate-key
Install sops
Contact @paulj, get public pgp key into cluster repo
Contact @paulj, get public ssh key onto cluster nodes

Make sure your pgp key is working by running:

sops portal-network/trin/ansible/inventories/dev/group_vars/secrets.sops.yml

Log in to Docker with: docker login
Ask Nick to be added as collaborator on Docker repo
Needed for rebooting nodes
- Install doctl
- Contact @paulj to get doctl API key
- Make sure API key works by running: doctl auth init

Each Deployment

Prepare

Generally we want to cut a new release before deployment, see previous page for instructions.
Announce in Discord #trin that you're about to run the deployment
Make sure to schedule plenty of time to react to deployment issues

Update Docker images

Docker images are how Ansible moves the binaries to the nodes. Dockerhub repositories are located here: hub.docker.com/u/portalnetwork.

Docker images should be tagged correctly after the release is finished. Double-check that they are correct:

portalnetwork/trin:prod - pushed to regular nodes (including stun nodes and state-network nodes)
portalnetwork/bridge:prod - pushed to trin bridges

Read about the tags to understand more.

Run ansible

Check monitoring tools to understand network health, and compare against post-deployment, eg~
- Glados
- Grafana
Activate the virtual environment in the cluster repo:
```
. venv/bin/activate
```
Make sure you've pulled the latest master branch of the deployment scripts, to include any recent changes:
```
git pull origin master
```
Go into the Portal section of Ansible:
```
cd portal-network/trin/ansible/
```

Run the deployment:

Regular nodes:

ansible-playbook playbook.yml --tags trin

State network nodes:

ansible-playbook playbook.yml --tags state-network

Run Glados deployment: updates glados + portal client (currently configured as trin, but this could change)
```
cd ../../glados/ansible
ansible-playbook playbook.yml --tags glados
```
if you experience "couldn't resolve module/action 'community.docker.docker_compose_v2'" error, you might need to re-install the community.docker collection:
```
ansible-galaxy collection install community.docker --force
```
Wait for completion
Launch a fresh trin node, check it against the bootnodes
ssh into random nodes, one of each kind, to check the logs:
- find an IP address
- node types
  - bootnode: trin-*-1
  - bridge node: trin-*-2
  - backfill node: trin-*-3
  - regular nodes: all remaining ips
- login into node:
```
ssh ubuntu@$IP_ADDR
```
- check logs, ignoring DEBUG
```
sudo docker logs trin -n 1000 | grep -v DEBUG
```
- for glados logins, use this instead
```
ssh devops@$IP_ADDR
```
Check monitoring tools to see if network health is the same or better as before deployment. Glados might lag for 10-15 minutes, so keep checking back.

Communicate

Notify in Discord chat about the network nodes being updated.

Update these docs

Immediately after a release is the best time to improve these docs:

add a line of example code
fix a typo
add a warning about a common mistake
etc.

For more about generally working with mdbook see the guide to Contribute to the book.

FAQ

What do the Docker tags mean?

Docker images are split into repositories:

portalnetwork/trin - Is built using docker/Dockerfile.trin image
- Used for running trin binary, the regular portal network node
portalnetwork/bridge - Is built using docker/Dockerfile.bridge image
- Used for running bridge binary, responsible for gossiping content into portal network

All repositories use following tags:

latest - built on every push to master
vX.Y.Z-<commit-hash> - built when version git tag is pushed to master
stable - indicates the version that is recommended for community to use
- updated automatically to the latest vX.Y.Z-<commit-hash> image
prod - indicates deployed version
- usually the same as stable
- also updated automatically to the latest vX.Y.Z-<commit-hash> image

Note that building the Docker image on git's master takes some time. If you merge to master and immediately pull the latest Docker image, you won't be getting the build of that latest commit. You have to wait for the Docker build to complete. You should be able to see on github when the Docker build is published.

Why can't I decrypt the SOPS file?

You might see this when running ansible, or the sops check:

Failed to get the data key required to decrypt the SOPS file.

Group 0: FAILED
  32F602D86B61912D7367607E6D285A1D2652C16B: FAILED
    - | could not decrypt data key with PGP key:
      | github.com/ProtonMail/go-crypto/openpgp error: Could not
      | load secring: open ~/.gnupg/secring.gpg: no such
      | file or directory; GPG binary error: exit status 2

  81550B6FE9BC474CA9FA7347E07CEA3BE5D5AB60: FAILED
    - | could not decrypt data key with PGP key:
      | github.com/ProtonMail/go-crypto/openpgp error: Could not
      | load secring: open ~/.gnupg/secring.gpg: no such
      | file or directory; GPG binary error: exit status 2

Recovery failed because no master key was able to decrypt the file. In
order for SOPS to recover the file, at least one key has to be successful,
but none were.

It means your key isn't working. Check with @paulj.

If using gpg and decryption problems persist, see this potential fix.

What do I do if Ansible says a node is unreachable?

You might see this during a deployment:

fatal: [trin-ams3-1]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host XXX.XXX.XXX.XXX port XX: Connection timed out", "unreachable": true}

Retry once more. If it times out again, run reboot script (check First time Setup chapter for setup):

./reboot_node.sh <host name1>,<host name2>,...,<host nameN>

What if everything breaks and I need to rollback the deployment?

If you observe things breaking or (significantly) degraded network performance after a deployment, you might want to rollback the changes to a previously working version until the breaking change can be identified and fixed. Keep in mind that you might want to rollback just the bridge nodes, or the backfill nodes, as opposed to every node on the network.

Find the previously released version (e.g. v0.1.2)
Find the docker tag that corresponds to that version
1. Visit tags page of the dockerhub repository of the binary that you want to rollback (e.g. https://hub.docker.com/r/portalnetwork/trin/tags)
2. Find tag that has the same prefix as previous release version (e.g. v0.1.2-a1b2c3d4)
3. There should be only one such tag, but it can't hurt to confirm that commit hash matches the expected value

Pull this specific image locally

docker pull portalnetwork/trin:v0.1.2-a1b2c3d4

Retag the image with the prod tag

docker image tag portalnetwork/trin:v0.1.2-a1b2c3d4 portalnetwork/trin:prod

Push the newly tagged prod image to Docker Hub
```
docker push portalnetwork/trin:prod
```
Re-run the ansible script, which will use the newly updated image
- Use the --limit cli flag if you only want to redeploy a subset of nodes. eg: ansible-playbook playbook.yml --tags trin --limit backfill_nodes.
Verify that the network is back to regular operation.

Permission denied while using docker on local machine

You might be getting following error while using docker:

permission denied while trying to connect to the Docker daemon socket ...

By default, Docker daemon runs as the root user. See https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user to understand details and possible solutions.

In short, most common ways to handle this:

preface docker command with sudo
add user to docker group
run Docker daemon as a non-root user (Rootless mode)

Trin