- handbook
- Company
- Company
- Board
- Communications
- Decision making
- Guides
- KPIs and OKRs
- principles
- Remote Work
- Security
- Asset Management Policy
- Business Continuity & Disaster Recovery Policy
- Information Security Roles and Responsibilities
- Operations Security Policy
- Risk Management Policy
- Third-Party Risk Management Policy
- Human Resources Security Policy
- Access Control Policy
- Incident Response Plan
- Cryptography Policy
- Information Security Policy and Acceptable Use Policy
- Secure Development Policy
- Data Management Policy
- strategy
- values
- Operations
- Product
- Feedback
- Market Segments
- Metrics
- Node-RED Dashboard
- personas
- Pricing Principles
- Principles
- Responsibilities
- Strategy
- Versioning
- Customer department
- Customer
- Customer Success
- Hubspot
- Marketing
- How we work
- Marketing
- Video
- Customer Stories
- Social Media
- blog
- Community
- Marketing - Website
- Webinars
- FlowFuse Messaging
- Sales
- Engineering & Design Practices
- Design
- Engineering
- Certified Nodes
- contributing
- Front End
- Packaging Guidelines
- Platform Ops
- Deployment
- Incident Response
- Observability
- Production Environment
- FlowFuse Dedicated
- Staging Environment
- Project Management
- Releases
- Security Policy
- tools
- Website A/B Testing
- Internal Operations
- People Ops
# Observability
Observability is the ability to understand the internal state and behavior of a system by analyzing its outputs, without requiring knowledge of its internal workings. In the context of DevOps, this means having a holistic view of your applications and infrastructure, including their health, performance, and any potential issues.
# Tools we use
# Prometheus
Prometheus is used to monitor the health of our applications and infrastructure. It collects metrics from various sources, including our applications, Kubernetes, and AWS. We use it to monitor the following:
- Application Metrics: Prometheus collects metrics from our applications, including HTTP requests, database queries, and background jobs.
- Kubernetes Metrics: Prometheus collects metrics from Kubernetes, including CPU and memory usage, pod status, and network traffic.
# Loki
Loki is a log aggregation system designed to work seamlessly with Prometheus. We use Loki to collect, store, and query logs from our applications and infrastructure. It complements Prometheus by providing a way to analyze logs alongside metrics.
# Grafana
Grafana is a popular open-source platform for creating, sharing, and managing dashboards. It complements Prometheus and Loki by providing a unified interface for visualizing and analyzing observability data. Key features include:
- Data Source Integration: Grafana supports various data sources, including Prometheus and Loki, making it an ideal choice for aggregating and visualizing metrics and logs in one place.
- Customizable Dashboards: Grafana offers extensive customization options, enabling you to build tailored dashboards that provide the insights you need.
- Alerting: You can set up alerting rules in Grafana to proactively monitor your systems based on your metrics and logs data.
# AWS CloudWatch
AWS CloudWatch is a monitoring and observability service that provides data and actionable insights for AWS, hybrid, and on-premises applications and infrastructure resources. In our case we use it to monitor infrastructure-related resources in AWS.
# Uptime Robot
Uptime Robot is used to monitor our public facing sites, including FlowFuse Cloud. This polls each endpoint at regular intervals and raises an alarm if an error is detected. The alerts are sent to #ops-uptime-alerts
in slack and emailed to the CTO.
# Alerting
Any alerts that have been configured in Grafana will post to the #ops-alerts
channel in slack. The alert, where appropriate, will include a link to the relevant section of the Incident Playbook.