Deploy trin to network
First time Setup
-
Get access to cluster repo (add person to @trin-deployments)
-
git clone
the cluster repo: https://github.com/ethereum/cluster.git -
Install dependencies within
cluster
virtualenv:cd cluster python3 -m venv venv . venv/bin/activate pip install ansible pip install docker sudo apt install ansible-core
On mac you can do
brew install ansible
instead ofapt
. -
Publish your pgp public key with keybase, using:
keybase pgp select --import
- This fails if you don't have a pgp key yet. If so, create one with
gpg --generate-key
- This fails if you don't have a pgp key yet. If so, create one with
-
Contact
@paulj
, get public pgp key into cluster repo -
Contact
@paulj
, get public ssh key onto cluster nodes -
Make sure your pgp key is working by running:
sops portal-network/trin/ansible/inventories/dev/group_vars/secrets.sops.yml
-
Log in to Docker with:
docker login
-
Ask Nick to be added as collaborator on Docker repo
-
Needed for rebooting nodes
- Install doctl
- Contact
@paulj
to getdoctl
API key - Make sure API key works by running:
doctl auth init
Each Deployment
Prepare
- Generally we want to cut a new release before deployment, see previous page for instructions.
- Announce in Discord #trin that you're about to run the deployment
- Make sure to schedule plenty of time to react to deployment issues
Update Docker images
Docker images are how Ansible moves the binaries to the nodes. Update the Docker tags with:
docker pull portalnetwork/trin:latest
docker pull portalnetwork/trin:latest-bridge
docker image tag portalnetwork/trin:latest portalnetwork/trin:testnet
docker image tag portalnetwork/trin:latest-bridge portalnetwork/trin:bridge
docker push portalnetwork/trin:testnet
docker push portalnetwork/trin:bridge
This step directs Ansible to use the current master version of trin. Read about the tags to understand more.
Run ansible
- Check monitoring tools to understand network health, and compare against post-deployment, eg~
- Activate the virtual environment in the cluster repo:
. venv/bin/activate
- Make sure you've pulled the latest master branch of the deployment scripts, to include any recent changes:
git pull origin master
- Go into the Portal section of Ansible:
cd portal-network/trin/ansible/
- Run the deployment:
- Trin nodes:
ansible-playbook playbook.yml --tags trin
- State network nodes (check with the team if there is a reason not to update them):
- Recently, we don't regularly deploy state bridge nodes (because they run for a long time and we don't want to restart them). To deploy all other state nodes, use following command:
ansible-playbook playbook.yml --tags state-network --limit state_stun,state_bootnode,state_regular
- To deploy to all state network nodes:
ansible-playbook playbook.yml --tags state-network
- Recently, we don't regularly deploy state bridge nodes (because they run for a long time and we don't want to restart them). To deploy all other state nodes, use following command:
- Trin nodes:
- Run Glados deployment: updates glados + portal client (currently configured as trin, but this could change)
cd ../../glados/ansible
ansible-playbook playbook.yml --tags glados
- if you experience "couldn't resolve module/action 'community.docker.docker_compose_v2'" error, you might need to re-install the community.docker collection:
ansible-galaxy collection install community.docker --force
- Wait for completion
- Launch a fresh trin node, check it against the bootnodes
- ssh into random nodes, one of each kind, to check the logs:
- find an IP address
- node types
- bootnode:
trin-*-1
- bridge node:
trin-*-2
- backfill node:
trin-*-3
- regular nodes: all remaining ips
- bootnode:
ssh ubuntu@$IP_ADDR
- check logs, ignoring DEBUG:
sudo docker logs trin -n 1000 | grep -v DEBUG
- Check monitoring tools to see if network health is the same or better as before deployment. Glados might lag for 10-15 minutes, so keep checking back.
Communicate
Notify in Discord chat about the network nodes being updated.
Update these docs
Immediately after a release is the best time to improve these docs:
- add a line of example code
- fix a typo
- add a warning about a common mistake
- etc.
For more about generally working with mdbook see the guide to Contribute to the book.
Celebrate
Another successful release! 🎉
FAQ
What do the Docker tags mean?
latest
: This image withtrin
is built on every push to masterlatest-bridge
: This image withportal-bridge
is built on every push to masterangelfood
: This tag is used by Ansible to loadtrin
onto the nodes we hostbridge
: This tag is used by Ansible to loadportal-bridge
onto the nodes we host
Note that building the Docker image on git's master takes some time. If you merge to master and immediately pull the latest
Docker image, you won't be getting the build of that latest commit. You have to wait for the Docker build to complete. You should be able to see on github when the Docker build has finished.
Why can't I decrypt the SOPS file?
You might see this when running ansible, or the sops check:
Failed to get the data key required to decrypt the SOPS file.
Group 0: FAILED
32F602D86B61912D7367607E6D285A1D2652C16B: FAILED
- | could not decrypt data key with PGP key:
| github.com/ProtonMail/go-crypto/openpgp error: Could not
| load secring: open ~/.gnupg/secring.gpg: no such
| file or directory; GPG binary error: exit status 2
81550B6FE9BC474CA9FA7347E07CEA3BE5D5AB60: FAILED
- | could not decrypt data key with PGP key:
| github.com/ProtonMail/go-crypto/openpgp error: Could not
| load secring: open ~/.gnupg/secring.gpg: no such
| file or directory; GPG binary error: exit status 2
Recovery failed because no master key was able to decrypt the file. In
order for SOPS to recover the file, at least one key has to be successful,
but none were.
It means your key isn't working. Check with @paulj
.
If using gpg
and decryption problems persist, see this potential fix.
What do I do if Ansible says a node is unreachable?
You might see this during a deployment:
fatal: [trin-ams3-1]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host XXX.XXX.XXX.XXX port XX: Connection timed out", "unreachable": true}
Retry once more. If it times out again, run reboot script (check First time Setup chapter for setup):
./reboot_node.sh <host name1>,<host name2>,...,<host nameN>
What if everything breaks and I need to rollback the deployment?
If you observe things breaking or (significantly) degraded network performance after a deployment, you might want to rollback the changes to a previously working version until the breaking change can be identified and fixed. Keep in mind that you might want to rollback just the bridge nodes, or the backfill nodes, as opposed to every node on the network.
- Go to the commit from the previously released version tag. Click into the CI workflows for that commit and look for the
docker-publish
ordocker-publish-bridge
flow, depending on what images you want to rollback. - In the logs for these flows, find the sha256 digest from the
Publish docker image to Docker Hub
step. - Pull this specific image locally, using
docker pull portalnetwork/trin@sha256:<HASH>
- Retag the target image to this version, for example, if you want to re-deploy the bridges, do:
docker image tag portalnetwork/trin@sha256:6dc0577a2121b711ae0e43cd387df54c8f69c8671abafb9f83df23ae750b9f14 portalnetwork/trin:bridge
- Push the newly tagged
bridge
image to Docker Hub. eg.docker push portalnetwork/trin:bridge
- Re-run the ansible script, which will use the newly updated image. Use the
--limit
cli flag if you only want to redeploy a subset of nodes. eg:ansible-playbook playbook.yml --tags trin --limit backfill_nodes
. - Verify that the network is back to regular operation.