+ +#### Wait for LFX Insights features + +Linux Foundation has its own metrics dashboard, which is not currently ready for handling custom metrics. This [feature has been requested](https://portal.productboard.com/bqt1cgergoszsqheudb7leux/c/118-bringyourownconnector-byoc), but there is no commitment to deliver it. + +Pros + +* Linux Foundation has requested this service be used, and will be more likely to approve any ongoing metrics expenditures related to it. + * Should be compliant with the Linux Foundation's own data policy +* No extra cost incurred directly by O3DE +* Not owned or maintained by Amazon or another private interest +* Dashboard is publicly accessible by default + * Has a decent [dashboard UI](https://insights.lfx.linuxfoundation.org/projects/o3de-f/dashboard;quicktime=time_filter_3Y) + +Cons + +* Waiting on unscheduled feature request from LFX, with no established ETA +* Unproven use case of custom pipeline/test/perf metrics and alarms + * Likely that certain use cases will be blocked, and need further iteration by LFX team +* Customers and contributors would wait until at least 2024 for metrics features to come online +* Downstream customers may not be able to use the same dashboards by copying configuration files (LFX appears only for Linux Foundation projects) + +#### AWS CloudWatch Metrics + +Cloudwatch provides a simple metrics solution, which includes the ability to set alarms. A "single metric" consists of a key, value, name, namespace, and up to ten "dimensions" which are a set of secondary/metadata KVP. A moderately-accurate UTC timestamp is effectively added for free. AWS CloudWatch primarily charges per **type** of custom metric, as well as a tiny amount per API call that uploads metrics data and for dashboards and alarms. Each post request adding custom metrics is limited to 20 gzip-packed metrics. Monthly CloudWatch costs would estimate metrics plus a dashboard and 100 alarms at $112 for full metrics (4 types with 10mm API calls) vs $14 for failure-only (5 types with 250k API calls). This is estimated across four or five metrics types (Pipeline Run, Job-Stage Run, Test Result, Profiling Result) with Test Module Result being added if the reduced load is selected. The less-homogenous data in the performance metrics may prompt splitting them out into separate metrics categories. Increasing the types of metrics from 5 to 1810 would increase the cost by around $540 per month and is not advised. Optimizing this may require additional investigation. + +It is ***very*** important to limit the types of custom metrics in CloudWatch. While new metrics groups could make dashboard partitioning and alarm-writing easier, if individual metric-types were all stored with a unique key then monthly costs could increase by $7000. + +Estimated monthly cost: $20/month for pipeline, benchmarks, and sparse test metrics ($120/month for full test metrics) + +Pros + +* Immediately ready to use via existing O3DE AWS account +* Proven tool with extensive documentation +* Affordable if used correctly +* Can share everything except access keys, such that customers can set up their own dashboards for their own projects + +Cons + +* Requires careful use to avoid generating a significant increase in metrics-spend for O3DE / LF + * All SIG Maintainers have the ability to merge changes to metrics (risk can be partially mitigated by structured upload scripts) +* Short-term data storage is enforced to 14 days, after which historical metrics get aggregated into 5-minute groups for 63 days, after which they are aggregated into 1-hour groups for up to 15 months. + * Additional storage costs to warehouse individual metrics or reports longer-term +* A public dashboard requires [configuring access control](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-dashboard-sharing.html) to AWS accounts, which seems inappropriate. + * Could instead periodically publish metrics data through a Lambda-hosted script into a static HTML page hosted in S3, for a small extra cost +* Difficult to create custom widgets and views (prefers only simple graphs and alarms) +* May have data availability/latency concerns (metrics tied to single region though us-east-1 should be fast enough for most users) + +#### AWS OpenSearch + +AWS OpenSearch (often called "ElasticSearch ELK" or "Kibana") is a service that runs live on EC2 instances and surfaces a web dashboard. It is a live toolbench-style service favored by data scientists. + +Estimated monthly cost: Upward of $1000/month, dominated by EC2 instance costs for two on-demand (non-dedicated) instances for a production and staging environment. + +Pros + +* Immediately ready to use via existing O3DE AWS account +* We can share everything except access keys, such that customers can set up their own dashboards + +Cons + +* High cost to host the dashboard +* Linux Foundation has previously declined a proposal for AWS OpenSearch dashboard, and would be unlikely to approve use of a similar service + +#### AWS Athena + +AWS Athena provides a big data framework for storing and parsing values in S3, but with no built-in dashboard tools. It is not a good option. + +Athena has a monthly cost of $5 per terabyte scanned, plus S3 storage at $4-25 per terabyte varying on access speed. However it will heavily depend on what additional custom logic is written for dashboards and lambda-alarms to interact with its data, which would also incur a cost. + +Estimated monthly cost: $50/month, with significant linear cost growth month-by-month unless we manually prune or merge our data. ($1800/month for full test metrics, similarly linear) + +Pros + +* Immediately ready to use via existing O3DE AWS account +* Proven tool with extensive documentation +* More control over what you build +* We can share everything except access keys, such that customers can set up their own dashboards + +Cons + +* Significant engineering cost to setup and maintain over time +* Need to define an S3 data schema to limit costs +* Need to create custom dashbaords +* Need to write Alarms as queries for a Lambda to run + +
++ +This section estimates how much metrics data would be logged across multiple pipelines, which each have many jobs and many stages. + +### Jenkins Jobs Metrics + +Pipeline run metadata should be stored per pipeline run, which should include at minimum: + +* Jenkins Pipeline execution ID (URL?) +* Branch Name (Development/Stabilization/etc.) +* Git commit hash +* UTC Start Time +* Total Duration +* Result (Success/Fail/Abort) +* Build parameters, such as: + * CLEAN_OUTPUT_DIRECTORY + * CLEAN_ASSETS + * CLEAN_WORKSPACE + * RECREATE_VOLUME + +Each pipeline currently executes multiple build-jobs in parallel. The following information is useful for each stage from each job: + +* Pipeline Stage Name (including variant info such as job name "Windows_Profile", and branch name "Development") +* UTC Start Time +* Duration +* Result (Success/Fail/Abort) +* Pipeline run metadata ID (foreign key) + +#### Pipeline Metrics Volume + +There are multiple pipelines in Jenkins, which currently have approximately following daily cadences: + +* 40x Automated PR Review +* 20x Branch Indexing +* 1x Nightly Clean Development +* 1x Nightly Incremental Development +* 1x Nightly Clean Stabilization + +There are currently a total of 29 stages across the seven parallel jobs in Automated Review (Pull Requests) and Branch Indexing (Merge Consistency Checks) runs. There are approximately 40 pull request runs and 20 branch updates per day (across two branches). + +Nightly builds have many more jobs, currently around 180 stages across 46 parallel jobs. Of these Mac jobs comprise 29 stages across 9 jobs, which may not be used in the public infrastructure. It is difficult to estimate how this set of stages will change over time. + +The daily load factor should be around `60x29 + 180x3 = 2280`. If Mac continues to not be included, this would be around 2200 sets of Jenkins-level metrics per day. + +### Test Results + +Results from individual tests will constitute the bulk of the metrics data. Currently only two parallel jobs execute tests in Automated Review, but each job contains thousands of tests. Transferring and storing this sizeable amount of data has time and storage costs, and likely needs its own optimizations. + +A decent amount of the information from test XMLs are relevant to metrics, however the file data also contains significant XML boilerplate. Instead of directly saving XML files, individual metrics can be extracted. As transforming this data has a time and hardware cost, it may be appropriate to execute in a Lambda (or GitHub Action) executed after the build pipeline. + +The following information is desired from every test: + +* Result (Pass/Fail/Error/Skip) +* Test Name +* CTest Module Name +* Duration +* Pipeline Stage Name (foreign key) +* Pipeline run metadata ID (foreign key) +* Public SIG owner + +#### Test Metrics Volume + +There are currently around 43000 tests that run in each Automated Review and Branch Update test-job, which is expected to slowly grow over time. One path to reducing the scope of test metrics is to bundle these metrics into only reporting on the module that contains sets of tests, for which there are currently 135 modules in Automated Review. This is a major tradeoff of data quality for size, and would reduce the ability to track which specific tests are failing. To reduce data loss, it should also be paired with saving the total number of pass, fail, error, and skipped tests. + +A less extreme way to reduce the volume of data is to only store non-pass results, and assume that a "missing key" is a passed test. This reduces data fidelity in the cases where tests are being renamed, added, removed, or never-fail, but retains the ability to calculate pass-rate metrics for individual failing tests. This would result in a variable load of metrics which increases as more tests fail per run. Currently around 1/10 of test runs encounter a test failure. When such failures occur, the current average number of failures is around 2.5. It is quite often 1, and is rarely above 100. This should result in around a 99% reduction in data. + +Another potential option to reduce data volume is only collecting test metrics from branch-update and periodic jobs, and not include the failures found during pull requests. This has the benefit of excluding the artificially noisy failures that occur in not-yet-ready pull requests which frequently fail for legitimate reasons. However the drawback would be not having data to highlight tests where developers get tripped up in pull requests. This would result in around a 50% reduction in data. + +Nightly jobs currently contain an additional nearly 2000 tests, across 34 modules. While these are broken out across a few parallel jobs to increase speed, the tests run across four variants (linux_profile, linux_profile_gcc_nounity, windows_profile, windows_debug) + +The amount of test results should slowly expand over time, as new modules are added. While applications and tests can be made more efficient and reliable, this also tends to encourage adding more features and tests. + +The daily load factor for saving all test-metrics would be 43000x2x60 + 2000x4x3 = 5,184,000 test-level metrics per day. +If only modules are reported, this would be 135x2x60 + 34x4x3 = 16,608 module-level metrics per day. +If modules and only test-failures are reported, this would be 0.1x2.5x60 + 135x2x60 + 34x4x3 ~= 16,625 module-plus-failure metrics per day. At a 99.7% reduction and a minor tradeoff, this approach is recommended. + +Performance Profiling Results + +There are multiple performance benchmarks, each of which emits at least one metric describing its execution. Currently there are: + +* 1795 Micro-Benchmark metrics (across 10 modules) + * Duration to complete OR iterations per unit of time OR bytes per unit of time + * Maximum memory footprint + * SIG Owner +* 10 Workflow benchmark metrics + * Duration to complete + * Maximum memory footprint + * SIG Owner +* Each of these metrics may additionally want to record: + * An error-signal for when exercising the benchmark crashes, hangs, or was fails to run + +These benchmarks execute only in the three periodic (nightly) builds as often as twice per day, inside each running on Windows and Linux. This makes the daily metrics load around 1805x3x2x2 = 21,660. + +There may be additional per-module metrics to record similar to test metrics. However there is no value in aggregating this data into per-module metrics, as the primary value for benchmarks is from tracking the individual values over time. + +Providing the ability to upload and track metrics will encourage the current number of performance metrics to grow, as such metrics otherwise have low utility. A pessimistically-high estimate is that this will expand by 10x within a year. + +Estimated Total + +Metrics systems commonly store metrics with dimensional values, grouping a "single metric" as multiple related values instead of only individual key-value pairs. Under this model, daily metrics would be around 5,200,000 and heavily dominated by test-metrics. If only test modules and failures are logged, and not all data for individual tests, this would instead be around 30,000. If metrics are allowed to naturally grow, within a year this would likely reach 6,000,000 daily metrics, versus 50,000 if only test modules and failures are recorded. + +This pessimistically-high estimation equates to around 42,000,000 vs 350,000 metrics per week (180,000,000 vs 1,500,000 per month which is 2.2bn vs 0.02bn per year). Estimated (compressed) data volume per month is 83 TB for full test metrics, and 2 TB for sparse test metrics. Given the volume of raw metrics and the time-limited value they provide, they should not be stored indefinitely. It would be appropriate to only persist condensed reports long-term. + +
+