Published: Nov 17, 2021 by Isaac Johnson
New Relic is a performance monitoring and logging tool that has been around since 2008 when it was founded by Lew Cirne (the company is an anagram of his name). He remained CEO until July of this year until handing it off to Bill Staples who came over from Adobe in 2020.
New Relic is a full observability suite like Datadog and Splunk. They increased their play into containerized tracing and observability when in 2020, they acquired Pixie Labs and subsequently open-sourced it and integrated it into their offerings.
Pixie is a rather unique offering in that it handles eBPF for kernel layer data and telemtry capture and runs on the cluster itself. This reduces how much data has to be transmitted in real time up to a cloud SaaS offering.
Let’s check out New Relic and see how we can use it as well as how it compares to other offerings.
Setup
Once we signup for New Relic, we can use the Wizard to “Add your Data” in the upper right to navigate to the Kubernetes data ingestion wizard:
In the next screen we will name our cluster. This will be reflected later in the helm deploy as well as how we identify the cluster in our NewRelic dashboards.
The next options turn on and off optional configurations of the helm chart. For instance, if you are in a shared cluster and really should limit your monitoring to just a specific namespace or you may not have the RBAC rights to monitor all things, then you can “Setup for unprivledged mode”
Lastly, we get the kubectl and helm commands to install the NewRelic agent into our cluster.
$ helm list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
newrelic-bundle default 1 2021-11-06 21:15:46.3882052 -0500 CDT deployed nri-bundle-3.2.4 1.0
If things go right, in a few minute the page will refresh to indicte that they are indeed getting results from your cluster.
New Relic Kubernetes Dasbhoards
Here we see a nice overview of nodes, pods, deployments and more. Note, this cluster has no content installed. It’s just a fresh minikube.
We can then look at events to see a running catalogue of events.
Alerts
We can set up alerts under Infrastructure settings. There we can create an alert based on a metric like many tools. However, there is a predefined option that covers “Host not reporting” which is was my initial goal.
We scroll down and we can set the threshold and and create an on-the-fly notification policy for email alerts.
Later we will cover sending other kind of alerts.
Once saved, we can see the alert defined in our Alerts page.
Triggering an alert
In Hyper-V I paused the cluster VM to force an error condition.
Pretty soon I got an email about the Host Stopping:
When I got the message I “Acknowledge”d it to indicate I was aware of the condition:
When the condition was resolved, New Relic movedt he Incident to the state of Closed:
And we got an email notification about that as well:
We can then see Incident under Incidents in New Relic
Adding more Notification channels
Email might be sufficient in some cases, but what if we want to page our Ops team using something like Pagerduty or OpsGenie?
In order to create a new notification channel, we need to select from the list below. In this case we will use OpsGenie
Side-step: creating a basic OpsGenie acccount
One nice thing about Atlassian products, they almost all have a small free tier. In this case, we will create a basic free OpsGenie account.
Back to New Relic, we continue the Notification Channel wizard having selected OpsGenie
We will use our API key and if we desire, add a team and/or tag on the notification channel.
Now, back in our Alerts, we can select this OpsGenie channel we made.
With the host not reporting, we can get an alert
We see we have both email and OpsGenie as our Notification Channels on this policy
We can also exclude that particular cluster from alerts if we determine we would rather not be paged when “Minikube” goes down:
Adding Primary cluster
Now that we experimented with minikube, let’s add our real cluster.
We’ll need to give it a name in add data
Then we can get the helm steps
Verify the right cluster
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
isaac-macbookpro Ready <none> 317d v1.21.5+k3s1
builder-hp-elitebook-850-g1 Ready <none> 12d v1.21.5+k3s1
isaac-macbookair Ready control-plane,master 317d v1.21.5+k3s1
hp-hp-elitebook-850-g2 Ready <none> 32d v1.21.5+k3s1
builder-hp-elitebook-850-g2 Ready <none> 7d5h v1.21.5+k3s1
anna-macbookair Ready <none> 205d v1.21.5+k3s1
and install:
$ kubectl apply -f https://download.newrelic.com/install/kubernetes/pixie/latest/px.dev_viziers.yaml && \
> kubectl apply -f https://download.newrelic.com/install/kubernetes/pixie/latest/olm_crd.yaml && \
> helm repo add newrelic https://helm-charts.newrelic.com && helm repo update && \
elm upgr> helm upgrade --install newrelic-bundle newrelic/nri-bundle \
> --set global.licenseKey=casdfasdfasdfasdfasdfasdfasdfasdfasdfasdfL \
> --set global.cluster=homek3s \
> --namespace=default \
> --set newrelic-infrastructure.privileged=true \
> --set global.lowDataMode=true \
> --set ksm.enabled=true \
> --set prometheus.enabled=true \
t kubeE> --set kubeEvents.enabled=true \
> --set logging.enabled=true \
> --set newrelic-pixie.enabled=true \
> --set newrelic-pixie.apiKey=px-api-123123-1231-1231-1231-123123123 \
> --set pixie-chart.enabled=true \
> --set pixie-chart.deployKey=px-dep-12312313-1231-1231-1231-123123123 \
> --set pixie-chart.clusterName=homek3s
...Successfully got an update from the "datawire" chart repository
...Successfully got an update from the "bitnami" chart repository
...Successfully got an update from the "stable" chart repository
Update Complete. ⎈Happy Helming!⎈
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/builder/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/builder/.kube/config
Release "newrelic-bundle" does not exist. Installing it now.
W1108 21:45:25.594574 999 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1108 21:45:26.099508 999 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1108 21:45:33.717699 999 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1108 21:45:38.739046 999 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1108 21:45:39.339646 999 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1108 21:45:54.547808 999 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: newrelic-bundle
LAST DEPLOYED: Mon Nov 8 21:45:22 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Once launched, we get a window indicating NewRelic is waiting on data
We can see more interesting data with a multi-node cluster.
We can see some colours stand out right away. For instance, the bottom square tells us one of the hosts is using high CPU
Wheras the orange equal signs show high Memory usage
APM
In time, New Relic (Pixie via New Relic) finds our Open Telemetry endpoints and auto discovers our trace data. This caught me off guard.
With Datadog, I needed to redirect my OTel settings to a Datadog endpoint. Here, Pixie automatically started capturing APM data of services:
I can pick a service and immediately see Response and Thouroughput. This was captured by Pixie on the cluster.
I can of course change the time window on the upper right, but if we click Map View we can see that Pixie figured out this pod, using Dapr, is sending data to the OTel service!
Unfortunately (see FIX below), the “run scripts” part of Pixie, which there are lots of links to throughout New Relic, never seems to work for me. I have no idea why, but i tried many things to get it to work.
FIX on Pixie Page Errors
This has come up with others.. It has to do with how NewRelic’s Angular libraries are trying to leverage Auth0 behind the scenes. The fact that Chrome (and chromium-based browsers, which are most now) block 3rdParty coookies break the Pixie windows.
To fix, you can set your browser to allow 3rdParty cookies:
As you see, i just changed from the default “Block” to “Allow” and the Pixie pages started to work!
We can now see the Pixie generated DNS graph:
Logs
We saw about that the echoapp was sending data to the Otel collector.
What if we wanted to find it’s logs?
We can use the Logs area to search for logs from a pod and since it isn’t often used, we’ll set the date slider to 7days.
We can select an entry to get more details:
Here we see that it was the Daprd container (not the app container) in the pod generating a message.
If I click the eye icon by the container, we are redirected to the log query, this time focussed just on that exact pod. Since this is a service with only one pod in the Replica Set, the results look the same.
One of the features I’ve always liked is having the ability to LiveTail logs. Here we can do the same. Once we have something the selector, we can pick Live tail to see a current live report of logs
Alerting on Log data
Say we want to alert if our Open Telemetry collector goes down. Yeah, it’s noisy, but the absense of noise might be a failure we need to investigate.
Here we can click the “Create alert conditon” on the left of the above log results.
From there we can start to define the Alert condition.
First we will give it a condition name, then sepcify the thresholds. The graph changes colours as we tweak the values in the thresholds (making it easy to see when things would trigger).
Here we are setting a critical error at less than one message for 15 minutes and a warning at less than 1 for 5 minutes.
Next we can define the policy or select an existing alert policy.
In this case, let’s define a new one.
Next, we need to go to “Alerts and AI” to define the Notification Channels for this policy:
then add a slack notification channel. If you havent created one. You may need to take a moment to make yourself a Slack App to add to your channel (as they don’t let you use generic global webhooks anymore).
Validation
We can test this by changing the condition.
To see what it would really look like, I’ll edit the condition to trigger on “Above” 1 logs.
You can also see how the “warning” and “critical” are highlighted on the graph.
Initially, I have no incidents as we need to see this “bad” condition for more than 5 minutes
but soon we will see an incident created as well as an alert
If I go to the incident, I can acknowledge it
which has the benefit of also updating slack should someone else have gotten the notice and wondered if someone was taking a look already
I can save the alert to be “Below” on the condition and immediately I see the Incident is resolved and Slack is updated
AKS
I wanted to do one more test, this time using AKS
$ az group create -n myakstest --location centralus
{
"id": "/subscriptions/d4c094eb-e397-asdf-asdf-fd877504a619/resourceGroups/myakstest",
"location": "centralus",
"managedBy": null,
"name": "myakstest",
"properties": {
"provisioningState": "Succeeded"
},
"tags": null,
"type": "Microsoft.Resources/resourceGroups"
}
$ az aks create -g myakstest -n myakstest1 --location centralus --network-plugin azure --network-policy azure --generate-ssh-keys
{
"aadProfile": null,
"addonProfiles": null,
"agentPoolProfiles": [
{
"availabilityZones": null,
"count": 3,
"enableAutoScaling": null,
"enableEncryptionAtHost": false,
...
Verification of create
$ az aks list -o table
Name Location ResourceGroup KubernetesVersion ProvisioningState Fqdn
---------- ---------- --------------- ------------------- ------------------- ------------------------------------------------------------
myakstest1 centralus myakstest 1.20.9 Succeeded myakstest1-myakstest-d4c094-35eb2884.hcp.centralus.azmk8s.io
$ rm -f ~/.kube/config && az aks get-credentials -n myakstest1 -g myakstest --admin
Merged "myakstest1-admin" as current context in /home/builder/.kube/config
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-nodepool1-15561397-vmss000000 Ready agent 17m v1.20.9
aks-nodepool1-15561397-vmss000001 Ready agent 17m v1.20.9
aks-nodepool1-15561397-vmss000002 Ready agent 16m v1.20.9
This time I’ll not reduce amount of data to see how that affects things
Installing:
$ kubectl apply -f https://download.newrelic.com/install/kubernetes/pixie/latest/px.dev_viziers.yaml && \
ctl appl> kubectl apply -f https://download.newrelic.com/install/kubernetes/pixie/latest/olm_crd.yaml && \
> helm repo add newrelic https://helm-charts.newrelic.com && helm repo update && \
> kubectl create namespace newrelic ; helm upgrade --install newrelic-bundle newrelic/nri-bundle \
> --set global.licenseKey=c8asdfasdfasdfasdfasdfAL \
> --set global.cluster=TestAKS \
> --namespace=newrelic \
> --set newrelic-infrastructure.privileged=true \
> --set ksm.enabled=true \
> --set prometheus.enabled=true \
> --set kubeEvents.enabled=true \
> --set logging.enabled=true \
> --set newrelic-pixie.enabled=true \
> --set newrelic-pixie.apiKey=px-api-asdfasdf-asdf-asdf-asdf-2fb1b80c78b6 \
> --set pixie-chart.enabled=true \
> --set pixie-chart.deployKey=px-dep-asdfasdf-asdf-asdf-asdf-7d5798fe112b \
--set p> --set pixie-chart.clusterName=TestAKS
customresourcedefinition.apiextensions.k8s.io/viziers.px.dev created
customresourcedefinition.apiextensions.k8s.io/catalogsources.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/clusterserviceversions.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/installplans.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/operatorconditions.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/operatorgroups.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/operators.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/subscriptions.operators.coreos.com created
"newrelic" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "azure-samples" chart repository
...Successfully got an update from the "datadog" chart repository
...Successfully got an update from the "kuma" chart repository
...Successfully got an update from the "newrelic" chart repository
...Successfully got an update from the "rancher-latest" chart repository
Update Complete. ⎈Happy Helming!⎈
namespace/newrelic created
Release "newrelic-bundle" does not exist. Installing it now.
NAME: newrelic-bundle
LAST DEPLOYED: Fri Nov 12 14:07:46 2021
NAMESPACE: newrelic
STATUS: deployed
REVISION: 1
TEST SUITE: None
We can see it come up in the Kubernetes explorer:
This cluster was set up with Azure CNI for a flat network. I was interested to see how that might look in the DNS Flow Graph:
Now let’s add a basic Vote app
$ helm repo add azure-samples https://azure-samples.github.io/helm-charts/
"azure-samples" already exists with the same configuration, skipping
$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "kuma" chart repository
...Successfully got an update from the "azure-samples" chart repository
...Successfully got an update from the "datadog" chart repository
...Successfully got an update from the "rancher-latest" chart repository
...Successfully got an update from the "newrelic" chart repository
Update Complete. ⎈Happy Helming!⎈
$ helm install azure-samples/azure-vote --generate-name
NAME: azure-vote-1636748326
LAST DEPLOYED: Fri Nov 12 14:18:46 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Azure Vote application has been started on your Kubernetes cluster.
Title: Azure Vote App
Vote 1 value: Cats
Vote 2 value: Dogs
The externally accessible IP address can take a minute or so to provision. Run the following command to monitor the provisioning status. Once an External IP address has been provisioned, brows to this IP address to access the Azure Vote application.
kubectl get service -l name=azure-vote-front -w
$ kubectl get service -l name=azure-vote-front -w
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
azure-vote-front LoadBalancer 10.0.223.187 52.154.40.139 80:31884/TCP 71s
Now we can see the new pods in the DNS Flow Graph
We can see the services under OpenTelemetry. Since I know i didnt setup the OTel connector, it shows it was New Relic that picked it up
We can also see trace data as picked up automatically:
We can also see what Metrics are picked up:
We can also use the selector cluster_name: "TestAKS"
to live tail logs from the cluster
And a log details shows all the metadata collected
Muting Rules
Something I haven’t seen elsewhere that I think is pretty damn nice is the ability to add muting rules.
For instance, I like to know when my on-prem cluster goes down, but I also need my Z’s. I could, of course, work with OpsGenie to mute alerts at night…
However, New Relic has a “Muting Rule” that we can pick a policy to mute during time windows. Perhaps we want to mute all alerts on the big “Release Night” (one time) or every night (as I do)
Here we can see a Nightly Rule set to mute alerts when I am likely asleep.
Once saved, we can see that it’s “scheduled” for tonight and i could optionally disable it.
Pricing
New Relic Pricing is a little different, and frankly something they changed it radically a year ago moving from 3 products to one and pricing on a user model.
Like DD, they have a Free model which gives me 1 free user and 100Gb of ingest a month. Here is where Pixie comes in, in a major way: by only sending the basic data up and letting me use Pixie to query data on my cluster, I greatly reduce cost and am able to do a whole lot while staying in the free tier.
Summary
I feel like I’ve only touched the leading edge of the iceberg on this product. I had not used it since my times at Intuit and even then, only was I loosely aware of it for alerting.
The fact is, I came in thinking I would quickly try and disgard this offering. But the more I looked, the more impressed I became. Features like muting rules and automatic tracing discovery with low level metrics capture.
Once I sorted out that Pixie issues all were due to some (perhaps ill advised) reliance on 3rd Party cookies, I found this to be the most feature rich offering I’ve tried out to date.
The only outstanding question is what features are part of the free 1 user offering and which are only available at the paid levels.