Published: May 3, 2022 by Isaac Johnson
Two years ago, I reviewed an up-and-coming observability suite, Epsagon. At the time, as a two year old company they were a hungry up-and-coming startup. While I found their product compelling, there was no free tier and the 14-day demo was a nice amuse-bouche, but not enought to get me to commit to a Pro-tier.
Since then, they have been acquired and radically changed their pricing as well as packed in more features. Let’s revisit Epsagon, what their Free tier offers and show how we can integrate it into our systems.
Pricing changes
Let’s tackle the first thing I noticed.
Back in 2020, they had a Pro plan that was either billed monthly
or Annually:
Today we see that has changed a lot:
We now essentially have a free tier up to 10M traces a month. Then a negotiate-with-sales option for “Enterprise”.
I could not find any published data on the Enterprise costs. However, in the past they charged around $90/user/feature for Pro.
Company profile
Espagon was still a hot item in 2020 seeking VC funding having done multiple rounds of Series A. However, in Oct of 2021 they were acquired by Cisco for $500m. At the time, Techcrunch noted that this had followed the recent Cisco acquisition of AppDynamics (which it had bought for $3.7b).
However, post-acquisition Nitzan Shapira and Ran Ribenzaft stayed on as CEO and CTO, respectively. They have around 59 employees with offices in NYC and Tel Aviv.
Since they have gone private, we don’t have insights on revenue but we do know they had raised over $30m in VC funding prior to the half a billion acquisition.
Concerns
One concern that caught my attention was a blog post by Yosef Arbiv on Ciscos Techblog site.
He talks about the Epsagon journey with Open Telemetry and noted in “Act III: Joining Cisco and the OpenTelemetry Community”
“Shortly after the first successful experiment [SDK with OTel], we had a significant change in plans — Cisco acquired Epsagon. The Epsagon product was to be deprecated gradually, and we started to work on a new Full Stack Observability product.”
So is Cisco planning to gut Epsagon for IP and focus on AppD? I guess we’ll have to see.
Signup
We will click “Start Now” on the homepage to get to the Sign up site.
I chose to signup with Google and it quickly loaded the Quickstart page
Monitoring Kubernetes
Let’s choose to monitor Kubernetes. It then asks they type of cluster, EKS or Other
I’ll give my cluster a name and it shows the helm commands with a pulsing red icon at the bottom letting me know it is awaiting a signal from their agent
Next, I’ll install the Epsagon agent(s)
builder@DESKTOP-QADGF36:~/Workspaces/dockerWithTests2$ helm repo add epsagon https://helm.epsagon.com
"epsagon" has been added to your repositories
builder@DESKTOP-QADGF36:~/Workspaces/dockerWithTests2$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "epsagon" chart repository
...Successfully got an update from the "azure-samples" chart repository
...Successfully got an update from the "sonarqube" chart repository
...Successfully got an update from the "uptime-kuma" chart repository
...Successfully got an update from the "dapr" chart repository
...Successfully got an update from the "kubecost" chart repository
...Successfully got an update from the "kuma" chart repository
...Successfully got an update from the "nginx-stable" chart repository
...Successfully got an update from the "sumologic" chart repository
...Successfully got an update from the "harbor" chart repository
...Successfully got an update from the "incubator" chart repository
...Successfully got an update from the "rancher-latest" chart repository
...Successfully got an update from the "newrelic" chart repository
...Successfully got an update from the "datadog" chart repository
...Successfully got an update from the "myharbor" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈
builder@DESKTOP-QADGF36:~/Workspaces/dockerWithTests2$ helm install epsagon-agent \
--set epsagonToken="99210ee9-c982-4185-a508-db0dd40fe8ba" \
--set clusterName="MyOnPremK3s" \
--set metrics.enabled=true \
--set "metrics-agent.server.remoteWrite[0].url=https://collector.epsagon.com/ingestion?99210ee9-c982-4185-> --set epsagonToken="99210ee9-c982-4185-a508-db0dd40fe8ba" \
dd40fe8b> --set clusterName="MyOnPremK3s" \
> --set metrics.enabled=true \
> --set "metrics-agent.server.remoteWrite[0].url=https://collector.epsagon.com/ingestion?99210ee9-c982-4185-a508-db0dd40fe8ba,metrics-agent.server.remoteWrite[0].basic_auth.username=99210ee9-c982-4185-a508-db0dd40fe8ba,metrics-agent.server.remoteWrite[0].write_relabel_configs[0].target_label=cluster_name,metrics-agent.server.remoteWrite[0].write_relabel_configs[0].replacement=MyOnPremK3s" \
epsagon> epsagon/cluster-agent
NAME: epsagon-agent
LAST DEPLOYED: Fri Apr 29 06:59:04 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Epsagon cluster agent insalled successfully!
If I see what helm installed, I can see the release;
$ helm list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
...snip...
epsagon-agent default 1 2022-04-29 06:59:04.279666936 -0500 CDT deployed cluster-agent-0.1.19 1.0.13
...snip...
This created three deployments with three pods
$ kubectl get deployments --all-namespaces | grep epsagon
default epsagon-agent-kube-state-metrics 1/1 1 1 2m46s
default epsagon-agent-cluster-agent 1/1 1 1 2m46s
default epsagon-agent-metrics-agent-server 1/1 1 1 2m46s
$ kubectl get pods | grep epsagon
epsagon-agent-kube-state-metrics-7cb699f7d6-6pzs4 1/1 Running 0 3m7s
epsagon-agent-cluster-agent-8bfb6955c-wt8wk 2/2 Running 0 3m7s
epsagon-agent-metrics-agent-server-55b558596b-cs7k5 2/2 Running 0 3m7s
Back in Epsagon, I went to my cluster dashboard
Clicking the name of my cluster shows some general overall details
As I just launched it, I was not surprised that I did not yet see metrics
We’ll come back to that later. For now, let’s click traces and see what shows up.
Not surprisingly, we have no traces as we’ve yet to instrument code or set up an “Application”
Service Map was similarly empty
Metrics
I could create metrics graphs based on any data collected. You can use their Metric Query form or switch to PromQL if preferred.
Here I queried for “kube_node_info” and “Kube_pod_status_Unschedulable”
I took a moment to do some cleanup
It worked for a few minutes.
If want to add such a metric to a dashboard, we have to have a dashboard first.
Dashboards
From Dashboards/My Dashboards, create a new Dashboard
We can call it “MyNewDashboard” and save
I’ll create a new metric query on unschedulable pods and nodes
then click “+ Add To Dashboard”
Here I can put in the Panel Title and Dashboard name and save
Going back to MyNewDashboard and refreshing shows the new panel. I can click the lower right to pull the panel out to a new size (here I made it bigger)
Side note: Dapr.io
I realized the core issue with Dapr.io is even with upgrades, the helm chart holds the x509 certs that expire after 1 year.
$ kubectl logs perl-debugger-5967f99ff6-zqdtx
error: a container name must be specified for pod perl-debugger-5967f99ff6-zqdtx, choose one of: [perl-debugger daprd]
WS-C02GD09AMD6V:ansible isaac.johnson$ kubectl logs perl-debugger-5967f99ff6-zqdtx daprd
time="2022-04-29T14:23:35.598869933Z" level=info msg="starting Dapr Runtime -- version 1.7.2 -- commit 1f8c0abfc73bb14a71435519e56b2aecb4385f01" app_id=perl-debugger instance=perl-debugger-5967f99ff6-zqdtx scope=dapr.runtime type=log ver=1.7.2
time="2022-04-29T14:23:35.598896771Z" level=info msg="log level set to: info" app_id=perl-debugger instance=perl-debugger-5967f99ff6-zqdtx scope=dapr.runtime type=log ver=1.7.2
time="2022-04-29T14:23:35.598983816Z" level=info msg="metrics server started on :9090/" app_id=perl-debugger instance=perl-debugger-5967f99ff6-zqdtx scope=dapr.metrics type=log ver=1.7.2
time="2022-04-29T14:23:35.599086346Z" level=info msg="Initializing the operator client (config: appconfig)" app_id=perl-debugger instance=perl-debugger-5967f99ff6-zqdtx scope=dapr.runtime type=log ver=1.7.2
time="2022-04-29T14:23:40.619169409Z" level=fatal msg="error loading configuration: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2022-04-29T14:23:38Z is after 2022-04-02T13:37:45Z\"" app_id=perl-debugger instance=perl-debugger-5967f99ff6-zqdtx scope=dapr.runtime type=log ver=1.7.2
It seems this is a known issue. We can use the dapr CLI to force an update
$ dapr mtls renew-certificate -k --valid-until 365 --restart
⌛ Starting certificate rotation
ℹ️ generating fresh certificates
ℹ️ Updating certifcates in your Kubernetes cluster
ℹ️ Dapr control plane version 1.7.2 detected in namespace default
✅ Certificate rotation is successful! Your new certicate is valid through Sat, 29 Apr 2023 14:24:40 UTC
ℹ️ Restarting deploy/dapr-sentry..
ℹ️ Restarting deploy/dapr-operator..
ℹ️ Restarting statefulsets/dapr-placement-server..
✅ All control plane services have restarted successfully!
So I first upgraded via helm
kubectl replace -f https://raw.githubusercontent.com/dapr/dapr/v1.7.2/charts/dapr/crds/components.yaml
kubectl replace -f https://raw.githubusercontent.com/dapr/dapr/v1.7.2/charts/dapr/crds/configuration.yaml
kubectl replace -f https://raw.githubusercontent.com/dapr/dapr/v1.7.2/charts/dapr/crds/subscription.yaml
helm repo update
helm upgrade dapr dapr/dapr --version 1.7.2 --namespace dapr-system --wait
Kubernetes observability
Under Nodes, we can get details on the cluster’s nodes such as version and which are the master.
The 3 dots at the end let us jump to various details
Say I choose to Jump To Pods from there on a node, we can see the Pods associate with that particular node
I found the search ability on Containers to be pretty useful. Here I searched for containers related to perl. I have two perl related pods and each are Dapr.io orchestrated. So as we would expect, 4 containers are shown in the results
If I pick a container, say the one that is waiting in our results, we can get details such as the YAML and ports it is using
We can see the kind of Metrics collected
As I mentioned earlier, we would come back to the Kubernetes Cluster Metrics after some time had passed.
After several days, I came back and indeed, the CPU, Memory, Network and Disk metrcis were populated, as were the empty (but rightfully so) Application Metrics
Alerts
Let’s create an alert on the metrics we have collected so far
We’ll give it a name
then we can use the kube_node_spec_unschedulable metric
We can then define our alert frequency
For ease, I’ll set an email target and click test to test it
since that didn’t come in, I decided to set up a Teams webhook
We can test it in the notifications
Once saved, we can see it in the listed
And we can see details
To test it in action, I’ll change the check to force a fail by looking for less than 1 unschedulable
And we can now see the triggered alert
We can both see the alert is in an alarm state now but also we can mute it here as well
I then updated the alert to be what it should be (a condition of unschedulable above 1 for over 5min)
And soon we see the alert move to a good state
Instrumenting
Let’s create an app and instrument it. Since we are using Express in Node.JS, let’s pick that
We can pick either in code or via Environment variables
I tried running through Nodemon and outside of it. But regardless, I saw no traces
builder@DESKTOP-72D2D9T:~/Workspaces/dockerWithTests2$ npm start
> nodewithtests@1.0.0 start
> nodemon --inspect=0.0.0.0:9229 server.js
[nodemon] 2.0.15
[nodemon] to restart at any time, enter `rs`
[nodemon] watching path(s): *.*
[nodemon] watching extensions: js,mjs,json
[nodemon] starting `node --inspect=0.0.0.0:9229 server.js`
Debugger listening on ws://0.0.0.0:9229/0c8a8279-c206-4330-a1f7-30f10ffd385f
For help, see: https://nodejs.org/en/docs/inspector
Example app listening on port 3000
^C
builder@DESKTOP-72D2D9T:~/Workspaces/dockerWithTests2$ node server.js
Example app listening on port 3000
I also tried converting the express app to a cjs file and yanking out the module syntax.
still no go
~/Workspaces/dockerWithTests2$ node server.cjs
Example app listening on port 3000
I eventually settled on a calculator demo and in doing so, managed to get some form of traces going
SSO
They recently rolled out SSO. I swear the option on the login page appeared during the course of writing this blog. I do not recall seeing it a week ago when I started.
We can setup SSO in the Settings page:
Billing
We can see current usage in the billing. This could be useful in determining if one is in need of a paid license.
Traces
Here we can see a couple of apps and their service map details
We can move over to Traces for a report of traces
and from there I can see a given trace, which as it’s just a hello world express app, there is not much to see
Besides the graph, we can view the timeline
and sequence, which only has data when we chain instrumented services
However, short of running a few instrumented sevices or building up a docker compose or loading instrumented services in a cluster, I’m not sure the best way to demo the trace data.
Summary
This brings me to my conclusion; I really am not interested in Observability tools that don’t support Open Telemetry. Even if I can find the monitoring useful (and I do), I’m not planning to promote a suite to my employer that would require us to hardcode a specific vendor library in all our code. To that end, I generally want to see Open Telemetry both to avoid vendor lock-in and to rollout observability solutions that do not require recompiling.
I cannot see why they avoided rolling out OTel. I found blog posts on their site walking through a great ETL demo using RabbitMQ and Jaeger with OpenCensus (Now OTel) here and another comparing Jaeger and Zipkin for OpenTracing here but in neither article to they cover the last mile, getting traces into Epsagon.
Epsagon does have many frameworks supported and have both a low touch (environment variable) and higher touch (configured library) for many languages. If you are willing to rebuild containers, it would be easy to make an instrumented one just by changing the Dockerfile.
This means one has to weigh the unsettling possible future of the product under Cisco to the rather expansive free offering. 10M traces really cover a lot of small to medium companies.