NetObserv Operator is a Kubernetes / OpenShift operator for network observability. It deploys a monitoring pipeline that consists in:
- an eBPF agent, that generates network flows from captured packets
- flowlogs-pipeline, a component that collects, enriches and exports these flows
- when used in OpenShift, a Console plugin for flows visualization with powerful filtering options, a topology representation and more
Flow data is then available in multiple ways, each optional:
- As Prometheus metrics
- As raw flow logs stored in Loki
- As raw flow logs exported to a collector
You can install NetObserv Operator using OLM if it is available in your cluster, or directly from its repository.
Important
There hasn't been recent releases pushed to the community OperatorHub. This is mostly due to the lack of demand. Unless there is demand, going forward we only release the downstream NetObserv aka Network Observability operator for OpenShift. But there's nothing written in stone. Let us know if you would like that to change.
NetObserv Operator is available in OperatorHub with guided steps on how to install this. It is also available in the OperatorHub catalog directly in the OpenShift Console.
Please read the operator description in OLM.
After the operator is installed, create a FlowCollector
resource:
Refer to the Configuration section of this document.
A couple of make
targets are provided in this repository to allow installing without OLM:
git clone https://github.com/netobserv/network-observability-operator.git && cd network-observability-operator
USER=netobserv make deploy deploy-loki deploy-grafana
It will deploy the operator in its latest version, with port-forwarded Loki and Grafana (they are optional).
Note: the
loki-deploy
script is provided as a quick install path and is not suitable for production. It deploys a single pod, configures a 10GB storage PVC, with 24 hours of retention. For a scalable deployment, please refer to our distributed Loki guide or Grafana's official documentation.
To deploy the monitoring pipeline, this make
target installs a FlowCollector
with default values:
make deploy-sample-cr
Alternatively, you can grab and edit this config before installing it.
You can still edit the FlowCollector
after it's installed: the operator will take care about reconciling everything with the updated configuration:
kubectl edit flowcollector cluster
Refer to the Configuration section of this document.
Historically, Grafana Loki was a strict dependency but it isn't anymore. If you don't want to install it, you can still get the Prometheus metrics, and/or export raw flows to a custom collector. But be aware that some of the Console plugin features will be disabled. For instance, you will not be able to view raw flows there, and the metrics / topology will have a more limited level of details, missing information such as pods or IPs.
Pre-requisite: OpenShift 4.10 or above
If the OpenShift Console is detected in the cluster, a console plugin is deployed when a FlowCollector
is installed. It adds new pages and tabs to the console:
Charts on this page show overall, aggregated metrics on the cluster network. The stats can be refined with comprehensive filtering and display options. Different levels of aggregations are available: per zone, per node, per namespace, per owner or per pod/service. For instance, it allows to identify biggest talkers in different contexts: top X inter-namespace flows, or top X pod-to-pod flows within a namespace, etc.
The watched time interval can be adjusted, as well as the refresh frequency, hence you can get an almost live view on the cluster traffic. This also applies to the other pages described below.
The topology view represents traffic between elements as a graph. The same filtering and aggregation options as described above are available, plus extra display options e.g. to group element by node, namespaces, etc. A side panel provides contextual information and metrics related to the selected element.
This screenshot shows the NetObserv architecture itself: Nodes (via eBPF agents) sending traffic (flows) to the collector flowlogs-pipeline, which in turn sends data to Loki. The NetObserv console plugin fetches these flows from Loki.
The table view shows raw flows, ie. non aggregated, still with the same filtering options, and configurable columns.
These views are accessible directly from the main menu, and also as contextual tabs for any Pod, Deployment, Service (etc.) in their details page, with filters set to focus on that particular resource.
The FlowCollector
resource is used to configure the operator and its managed components. A comprehensive documentation is available here, and a full sample file there.
To edit configuration in cluster, run:
kubectl edit flowcollector cluster
As it operates cluster-wide on every node, only a single FlowCollector
is allowed, and it has to be named cluster
.
A couple of settings deserve special attention:
-
Agent features (
spec.agent.ebpf.features
) can enable more features such as tracking packet drops, TCP latency (RTT) and DNS requests and responses. -
Sampling
spec.agent.ebpf.sampling
: a value of100
means: one flow every 100 is sampled.1
means all flows are sampled. The lower it is, the more flows you get, and the more accurate are derived metrics, but the higher amount of resources are consumed. By default, sampling is set to 50 (ie. 1:50). Note that more sampled flows also means more storage needed. We recommend to start with default values and refine empirically, to figure out which setting your cluster can manage. -
Loki (
spec.loki
): configure here how to reach Loki. The default URL values match the Loki quick install paths mentioned in the Getting Started section, but you may have to configure differently if you used another installation method. You will find more information in our guides for deploying Loki: with Loki Operator, or an alternative "distributed Loki" guide. You should setspec.loki.mode
according to the chosen installation method, for instance useLokiStack
if you use the Loki Operator. Make sure to disable Loki (spec.loki.enable
) if you don't want to use it. -
Quick filters (
spec.consolePlugin.quickFilters
): configure preset filters to be displayed in the Console plugin. They offer a way to quickly switch from filters to others, such as showing / hiding pods network, or infrastructure network, or application network, etc. They can be tuned to reflect the different workloads running on your cluster. For a list of available filters, check this page. -
Kafka (
spec.deploymentModel: Kafka
andspec.kafka
): when enabled, integrates the flow collection pipeline with Kafka, by splitting ingestion from transformation (kube enrichment, derived metrics, ...). Kafka can provide better scalability, resiliency and high availability. It's also an option to consider when you have a bursty traffic. This page provides some guidance on why to use Kafka. When configured to use Kafka, NetObserv operator assumes it is already deployed and a topic is created. For convenience, we provide a quick deployment using Strimzi: runmake deploy-kafka
from the repository. -
Exporters (
spec.exporters
) an optional list of exporters to which to send enriched flows. Currently, KAFKA and IPFIX are available (only KAFKA being actively maintained). This allows you to define any custom storage or processing that can read from Kafka or from an IPFIX collector. -
To enable availability zones awareness, set
spec.processor.addZone
totrue
.
More information on Prometheus metrics is available in a dedicated page: Metrics.md.
In addition to sampling and using Kafka or not, other settings can help you get an optimal setup without compromising on the observability.
Here is what you should pay attention to:
-
Resource requirements and limits (
spec.agent.ebpf.resources
,spec.agent.processor.resources
): adapt the resource requirements and limits to the load and memory usage you expect on your cluster. The default limits (800MB) should be sufficient for most medium sized clusters. You can read more about reqs and limits here. -
eBPF agent's cache max flows (
spec.agent.ebpf.cacheMaxFlows
) and timeout (spec.agent.ebpf.cacheActiveTimeout
) control how often flows are reported by the agents. The higher arecacheMaxFlows
andcacheActiveTimeout
, the less traffic will be generated by the agents themselves, which also ties with less CPU load. But on the flip side, it leads to a slightly higher memory consumption, and might generate more latency in the flow collection. There is a blog entry dedicated to this fine-tuning. -
It is possible to reduce the overall observed traffic by restricting or excluding interfaces via
spec.agent.ebpf.interfaces
andspec.agent.ebpf.excludeInterfaces
. Note that the interface names may vary according to the CNI used. -
Each component offers more advanced settings via
spec.agent.ebpf.advanced
,spec.processor.advanced
,spec.loki.advanced
andspec.consolePlugin.advanced
. The agent has environment variables that you can set throughspec.agent.ebpf.advanced.env
.
The FlowCollector
resource includes configuration of the Loki client, which is used by the processor (flowlogs-pipeline
) to connect and send data to Loki for storage. They impact two things: batches and retries.
-
spec.loki.writeBatchWait
andspec.loki.writeBatchSize
control the batching mechanism, ie. how often data is flushed out to Loki. Like in the eBPF agent batching, higher values will generate fewer traffic and consume less CPU, however it will increase a bit the memory consumption offlowlogs-pipeline
, and may increase a bit collection latency. -
spec.loki.advanced.writeMinBackoff
,spec.loki.writeMaxBackoff
andspec.loki.writeMaxRetries
control the retry mechanism. Retries may happen when Loki is unreachable or when it returns errors. Often, it is due to the rate limits configured on Loki server. When such situation occurs, it might not always be the best solution to increase rate limits (on server configuration side) or to increase retries. Increasing rate limits will put more pressure on Loki, so expect more memory and CPU usage, and also more traffic. Increasing retries will put more pressure onflowlogs-pipeline
, as it will retain data for longer and accumulate more flows to send. When all the retry attempts fail, flows are simply dropped. Flow drops are counted in the metricnetobserv_loki_dropped_entries_total
.
On the Loki server side, configuration differs depending on how Loki was installed, e.g. via Helm chart, Loki Operator, etc. Nevertheless, here are a couple of settings that may impact the flow processing pipeline:
-
Rate limits (cf Loki documentation), especially ingestion rate limit, ingestion burst size, per-stream rate limit and burst size. When these rate limits are reached, Loki returns an error when
flowlogs-pipeline
tries to send batches, visible in logs. A good practice is to define an alert, to get notified when these limits are reached: cf this example. It uses a metrics provided by the Loki operator:loki_request_duration_seconds_count
. In case you don't use the Loki operator, you can replace it by the same metric provided by NetObserv Loki client, namednetobserv_loki_request_duration_seconds_count
. -
Max active streams / max streams per user: this limit is reached when too many streams are created. In Loki terminology, a stream is a given set of labels (keys and values) used for indexing. NetObserv defines labels for source and destination namespaces and pod owners (ie. aggregated workloads, such as Deployments). So the more workloads are running and generating traffic on the cluster, the more chances there are to hit this limit, when it's set. We recommend setting a high limit or turning it off (0 stands for unlimited).
More performance fine-tuning is possible when using Kafka, ie. with spec.deploymentModel
set to Kafka
:
-
You can set the size of the batches (in bytes) sent by the eBPF agent to Kafka, with
spec.agent.ebpf.kafkaBatchSize
. It has a similar impact thancacheMaxFlows
mentioned above, with higher values generating less traffic and less CPU usage, but more memory consumption and more latency. We expect the default values to be a good fit for most environments. -
If you find that the Kafka consumer might be a bottleneck, you can increase the number of replicas with
spec.processor.kafkaConsumerReplicas
, or set up an horizontal autoscaler withspec.processor.kafkaConsumerAutoscaler
. -
Other advanced settings for Kafka include
spec.processor.kafkaConsumerQueueCapacity
, that defines the capacity of the internal message queue used in the Kafka consumer client, andspec.processor.kafkaConsumerBatchSize
, which indicates to the broker the maximum batch size, in bytes, that the consumer will read.
NetObserv is meant to be used by cluster admins, or, when using the Loki Operator (v5.7 or above), project admins (ie. users having admin permissions on some namespaces only). Multi-tenancy is based on namespaces permissions, with allowed users able to get flows limited to their namespaces. Flows across two namespaces will be visible to them as long as they have access to at least one of these namespaces.
Since FlowCollector v1beta2
, NetObserv is automatically configured with multi-tenancy enabled when spec.loki.mode
is LokiStack
.
To give flow logs access to a test
user, run:
oc adm policy add-cluster-role-to-user netobserv-reader test
More information about multi-tenancy can be found on this page.
Note that multi-tenancy is not possible without using the Loki Operator.
For a production deployment, it is also highly recommended to lock down the netobserv
namespace (or wherever NetObserv is installed) using network policies.
An example of network policy is provided here.
By default, communications between internal components are not secured. Note that, when using the Loki Operator, securing communication with TLS is necessary. There are several places where TLS can be set up:
- Connections to Loki (from the processor
flowlogs-pipeline
and from the Console plugin), by settingspec.loki.tls
. - With Kafka (both on producer and consumer sides), by setting
spec.kafka.tls
. Mutual TLS is supported here. - The metrics server running in the processor (
flowlogs-pipeline
) can listen using TLS, viaspec.processor.metrics.server.tls
. - The Console plugin server always uses TLS.
Please refer to the Architecture page.
Please refer to this documentation for everything related to building, deploying or bundling from sources.
Please refer to F.A.Q / Troubleshooting main document.
This project is licensed under Apache 2.0 and accepts contributions via GitHub pull requests. Other related netobserv
projects follow the same rules:
External contributions are welcome and can take various forms:
- Providing feedback, by starting discussions or opening issues.
- Code / doc contributions. You will find here some help on how to build, run and test your code changes. Don't hesitate to ask for help.