Observability: Epsagon

Epsagon was one of the KubeCon sponsors this year that caught my eye. They have a tracing suite with a free tier that seemed interesting. One can track a thousand traces a month in the free tier which includes alerting and monitoring.

So how easy is it to setup and use? And what, if any, are the limitations.

Setup

Everything is done via a wizard. Such as setting up monitoring of clusters

running the command locally

$ curl -L -o epsagon_k8s.sh https://k8s.epsagon.com && chmod +x epsagon_k8s.sh && ./epsagon_k8s.sh 12345-12345-12345-12345-12345
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  8732  100  8732    0     0  14291      0 --:--:-- --:--:-- --:--:-- 14291
Fetching epsagon-role.yaml
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1067  100  1067    0     0   4521      0 --:--:-- --:--:-- --:--:--  4521
          Welcome to Epsagon!

             ######               # ##  
             ######              #####  
           ############## #     ######  
           ###################  ######  
      #### ###########################  
     ##### ###########################  
 ### ################################## 
 ####  ###     ##   ###################  
 ###                 #################   
                          ##########    
                          ############  
                             ###    ###
Are you using Rancher Management System? [Y/N] n
Available clusters:
1. default

Choose clusters to integrate. Use spaces for multiple clusters, e.g: 1 2 3...
1

Now installing Epsagon to: default
Applying epsagon-role.yaml to default

namespace/epsagon-monitoring created
serviceaccount/epsagon-monitoring created
Warning: rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
clusterrole.rbac.authorization.k8s.io/epsagon-prometheus created
Warning: rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
clusterrolebinding.rbac.authorization.k8s.io/epsagon-prometheus created
Testing Epsagon connection to server https://192.168.1.74:6443...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1140  100    41  100  1099      1     37  0:00:41  0:00:29  0:00:12     8
Connection to Epsagon failed, please see: https://docs.epsagon.com/docs/environments-kubernetes

This showed a problem as my k8s is not exposed externally. However, running on my AKS cluster went fine.

https://app.epsagon.com/dashboards/overview

Instrumenting an App

The next step is to setup some form of tracing. Since I'm most comfortable in NodeJS, let's just update some Express apps

$ cat app.js 
const express = require('express')
const epsagon = require('epsagon-frameworks');

epsagon.init({
  token: 'asdfasdf-asdf-asdf-asdf-asdfasdf',
  appName: 'MyApp',
  metadataOnly: false,
});

const app = express()
const port = 3000

app.get('/', (req, res) => {
  req.epsagon.label('key', 'run7');
  req.epsagon.setError(Error('My custom error'));
  res.send('Hello World!')
})

app.listen(port, () => {
  console.log(`Example app listening at http://localhost:${port}`)
})

Once launched into my cluster:

$ cat Dockerfile 
FROM node:12-alpine

RUN mkdir -p /usr/src/app

WORKDIR /usr/src/app

COPY . .

RUN npm install

EXPOSE 3000

CMD ["npm", "run", "start"]


$ cat Deployment.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: idjexpress-deployment
  labels:
    app: idjexpress
spec:
  replicas: 1
  selector:
    matchLabels:
      app: idjexpress
  template:
    metadata:
      labels:
        app: idjexpress
    spec:
      containers:
      - name: idjexpress
        image: idjohnson/expressapp:6
        ports:
        - containerPort: 3000

I can hit the ingress (http://52.189.67.186:3000/) and see it respond

This showed no traces however

But clearly it's something I'm doing since hitting the pod directly worked

$ kubectl port-forward idjexpress-deployment-785c4bcdd7-49ntc 8080:3000
Forwarding from 127.0.0.1:8080 -> 3000
Forwarding from [::1]:8080 -> 3000
Handling connection for 8080
Handling connection for 8080

I also instrumented up another express demo just for different results

$ kubectl port-forward helloworld-deployment-7bcb884994-qnhqh 8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80
Handling connection for 8080
Handling connection for 8080

At first no results

but after a few reloads, we can see some traces

Looking at details:

Here we can see the trace, albeit rather simple:

We can expand on that:

We can also look at clusters

This gives us a nice view of some basics like Node health

We can also look at pods across our namespaces:

As well as deployment details:

This is pretty handy for tracking when changes occur.

Alerting

We can create alerts to things like Slack:

We can also do email alerts:

The email alert:

The slack alert:

From alerts we can see which ones are firing:

Now when we get errors, such as the “my custom error” we can see them aggregated on the Issues page.

Here we can such to mute them, perhaps they are more of a “warning” and we need not see them all the time.

Once the alerts are resolved, we can see that as well:

User Management

Adding users

We can invite by email:

Pricing

Let’s dig into billing.

Of course we have been using a rather unlimited free trial following the sessions at KubeCon

The prices are listed for monthly:

And pay up front annually:

Here is where it can be hard to figure out. Clearly they are about APM thus pricing on traces, but then what about the general cluster observability. I must assume that’s free icing for traces.

Consider Datadog. Datadog is about APM as well as logs and monitoring. For Datadog we pay $31 per “host” per “month”.. We get up to 5 hosts free for logging. And Datadog includes 15-days of 1million spans with $1.27 per million beyond that (for 7d).

I think my struggle is the “simple” pricing on Epsagon actually makes me wary - what am I getting and for how long.

There is a very expansive 2 week demo as well as a free tier with 1k traces an month with metrics and alerting.

Remote Access

I wanted to circle back on that error when i went to onboard a private cluster:

Testing Epsagon connection to server https://192.168.1.74:6443...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1140  100    41  100  1099      1     37  0:00:41  0:00:29  0:00:12     8
Connection to Epsagon failed, please see: https://docs.epsagon.com/docs/environments-kubernetes

Something didn’t sit right about this. I looked at what was installed in the cluster presently monitored.

We have a cluster role “epsagon-prometheus”

$ kubectl get clusterrole epsagon-prometheus -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1beta1","kind":"ClusterRole","metadata":{"annotations":{},"name":"epsagon-prometheus"},"rules":[{"apiGroups":[""],"resources":["nodes","nodes/proxy","services","services/proxy","endpoints","pods","pods/proxy","pods/log","namespaces","configmaps"],"verbs":["get","list","watch"]},{"apiGroups":["extensions"],"resources":["ingresses"],"verbs":["get","list","watch"]},{"apiGroups":["extensions","apps"],"resources":["deployments"],"verbs":["get","list","watch"]},{"nonResourceURLs":["/metrics"],"verbs":["get"]}]}
  creationTimestamp: "2020-11-20T16:36:14Z"
  name: epsagon-prometheus
  resourceVersion: "3120"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterroles/epsagon-prometheus
  uid: 76be1a97-db94-4f51-9f96-7307edf38092
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/proxy
  - services
  - services/proxy
  - endpoints
  - pods
  - pods/proxy
  - pods/log
  - namespaces
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - extensions
  - apps
  resources:
  - deployments
  verbs:
  - get
  - list
  - watch
- nonResourceURLs:
  - /metrics
  verbs:
  - get

And that is bound to a service account:

$ kubectl get clusterrolebinding epsagon-prometheus -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1beta1","kind":"ClusterRoleBinding","metadata":{"annotations":{},"name":"epsagon-prometheus"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"epsagon-prometheus"},"subjects":[{"kind":"ServiceAccount","name":"epsagon-monitoring","namespace":"epsagon-monitoring"}]}
  creationTimestamp: "2020-11-20T16:36:14Z"
  name: epsagon-prometheus
  resourceVersion: "3121"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/epsagon-prometheus
  uid: 68723641-70f4-4cc6-aa49-9f6cc75203db
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: epsagon-prometheus
subjects:
- kind: ServiceAccount
  name: epsagon-monitoring
  namespace: epsagon-monitoring

$ kubectl get sa epsagon-monitoring -n epsagon-monitoring
NAME                 SECRETS   AGE
epsagon-monitoring   1         11d

So what I think is happening is that they are polling my cluster externally - like a remote Kubernetes web viewer, as opposed to pushing out metrics.

This has the advantage of not having to capture and store customers’ logs and metrics. While this method means not having to consume bandwidth on data ingestion and storage costs, it does mean one has to expose publicly one’s cluster with the wide viewing permission on many things. For instance, that CRB includes the verb “get” on resource “pods/log”… what if I have PHI or PII in there.

Comparatively, DataDog can sanitize via the config and I believe Sumo could do it via the FluentD plugin.

Additionally I didn't ask them to monitor my cluster. Installing the Epsagon Azure plugin was a oneliner for tracing. Datadog works the same for APM tracing. For Nodejs, in a similar fashion, we just import the ‘dd-trace’ tracer and set the requisite tags (see guide).

However, one accepts for cluster agents that run on clusters, there are some compute and storage impacts of said agent but at the benefit of tight control.

Summary

We only touched the surface on Epsagon and for a company that’s just two years old, they have a really solid offering. It’s still run by the founders in Tel Aviv and recently secured $10m in financing back in July bringing their total to $30m in funding. In the states, they are based in New York and have between 50 and 100 developers.

This tells me it’s a startup that is young and hungry and often that means some great deals, but then one assumes some risk throwing all in on a suite that might be easily acquired by a competitor. Additionally, someone has to pay for free storage and without advertisements, they need to convert folks out of that free tier to mitigate their cash burn.

There is a full featured 14 day demo and as mentioned previously, a nice entry level free tier one can try out. The tech support is chat based and quite fast. Overall pretty friendly and helpful, though I ultimately solved my own problem setting it up (did not’ trigger the alert right in my express app).