Release Notes
Last updated
Was this helpful?
Last updated
Was this helpful?
2024-11-05
NOTE: This release deprecates legacy histograms. must be configured before upgrading to this release. If this is not done, nodes may not start up after the upgrade.
Fix use after free bug that could occasionally happen due to a race when fetching raw data.
Fix potential memory leak on certain oil/activity data operations.
Fix fetch bug where C-style calloc
allocations were being mixed with C++-style delete
s.
Add new paramter to whisper config, end_epoch_time
, that takes an epoch timestamp and directs the code to not look in whisper files if the fetch window starts after this time.
Fix bug where histogram ingestion data was not being sent properly during rebalance operations.
Fix bug where histogram rollup data was not being reconstituted during reconstitute operations.
Add get_engine
parameter to histogram data retrieval to allow pulling from either rollups or ingestion data.
No longer open new LMDB transactions when reading data for merging NNTBS blocks together.
Remove all support for legacy, non-sharded histograms.
Fix bug where if a raw shard rollup was aborted after being scheduled but before actually starting, multiple rollups could end up triggering at once.
Fix rename bug where type was not getting set when failing to send NNTBS data.
Add header to the /rename
endpoint to X-Snowth-Activity-Data-Mode
, which can accept either use_existing
or create_new
as values.
Treat MDB_CORRUPTED
, MDB_PAGE_FULL
, MDB_TXN_FULL
, MDB_BAD_TXN
, and ENOMEM
as LMDB corruption consistently when checking for errors.
When using the /merge/nntbs
endpoint to send data to a node, allow either updating the receiving node's activity data using the incoming NNTBS data or leaving it as is and not updating it.
Fix bug where activity data was not being updated correctly when inserting NNTBS data.
Fix bug where rollups were marked clean after a rollup had been kicked off asynchronously, resulting in a race that could lead to shards being incorrectly considered dirty.
Deprecate support for rebalancing data into a cluster with fewer NNTBS periods.
The /rename
endpoint will now detect when it gets a 500 error from the /merge/nntbs
endpoint and will return an error instead of spinning forever.
The /merge/nntbs
endpoint will no longer crash on detecting corrupt shards; it will offline the shards and return errors.
Various small fixes to reduce memory consumption, improve performance, and prevent possible crashes or memory corruption.
2024-07-17
Fix bug in build_level_index
where we were invoking a hook that called pcre_exec
with an uninitialized metric length.
Reduce spam in error log when trying to fetch raw data for a metric and there isn't any for the requested range.
Add new API endpoint, /rename
, to allow renaming a metric. This calculates where the new metric will live, sends the data for the metric to the new location, then deletes the old metric. This only works for numeric metrics.
Add new API endpoint, /full/canonical/<check uuid>/<canonical metric name>
that will allow deleting an exact metric from the system without using tag search.
Add ability to skip data after a given time when using the copy
sieve in snowth_lmdb_tool
.
2024-06-04
Avoid metric index corruption by using pread(2)
in jlog instead of mmap(2)
.
Deprecate max_ingest_age
from the graphite module. Require the validation
fields instead.
Change Prometheus module to convert nan
and inf
records to null
.
Add logging when when the snowth_lmdb_tool
copy operation successfully completes.
Fix bug where a node could crash if we closed a raw shard for delete, then tried to roll up another shard before the delete ran.
Fix bug where setting raw shard granularity values above 3w
could cause data to get written with incorrect timestamps during rollups.
Improve various listener error messages.
Add checks for timeouts in the data journal path where they were missing.
Improve graphite PUT error messages.
Fix NNTBS rollup fetch bug where we could return no value when there was valid data to return.
Fix bug where histogram rollup shards were sometimes not being deleted even though they were past the retention window.
2024-03-27
NOTE: This release bumps the metric index version from 4 to 5. Upon restart, new indexes will be built and the old ones will be deleted. This process will use a significant amount of memory while the indexes are being rebuilt. It will also cause the first post-update boot to take longer than usual.
Update index version from 4 to 5.
Automatically clean up old index versions on startup to make sure outdated indexes don't clog the disk.
Fix Ubuntu 20.04 specific bug where nodes could crash when trying to clean up status files when rolling up raw shards.
Fix issue with level indexes where data was being lost when deleting metrics on levels where the metric has multiple tags.
Fix issue where level indexes were incorrectly reporting that levels existed when all underlying metrics had been removed.
Add new API endpoints, /compact_indexes
and /invalidate_index_cache
, that allow forcing compaction and cache invalidation for specific accounts, respectively.
Fix rollup bug where raw shards could be prematurely deleted if a rollup was aborted due to corruption.
Fix various potential memory corruption issues.
Fix issue where jlog journal data could get corrupted.
2024-01-02
Fix bug where rollups were being flagged "not in progress" and "not dirty" when attempting to schedule a rollup and the rollup is already running.
Use activity ranges as part of query cache key. Previously, cached results from queries with a time range could be used to answer queries that had no time range, leading to incorrect results.
Fix logic bug where rollups were sometimes flagged as still being in progress after they were completed.
Account index WAL can keep growing without bounds due to a local variable value being squashed early.
Fix bug where the reconst_in_progress
file was not being cleaned up after reconstitute operations, which could block rollups and deletes from running.
The raw/rollup
and histogram_raw/rollup
API endpoints will no longer block if there is a rollup already running. They will also return sensible error messages.
Raw shard rollups will not be allowed to run unless all previous rollups have run at least once.
Fix bug where deferred rollups could cause the rollup process to lock up until the node is restarted.
Add REST endpoint POST/PUT /histogram/<period>/<check_uuid>/<metric_name>?num_records=<x>
which can be used with a json payload to directly insert histogram shard metric data for repair purposes.
Added configuration file field, //reconstitute/@max_simultaneous_nodes
, that will cause the reconstituting node to only hit a predetermined number of peer nodes at once. The default if not specified is "all peers". This setting can be used if a reconstitute places too much load on the rest of the cluster, causing degradation of service.
Disallow starting single-shard reconstitutes with merge enabled if the shard exists and is flagged corrupt.
Improve NNTBS error messages if unable to open a shard.
PromQL - improve error messages on invalid or unsupported range queries.
PromQL - fix range nested inside one or more instant functions.
Include maintenance mode when pulling lists of raw, histogram, or histogram rollup shards.
Use read-copy-update (RCU) for Graphite level indexes and the surrogate database. It allows more queries to run concurrently without affecting ingestion, and vice versa.
Defer rollups of raw shards if there is a rollup shard in maintenance that the raw shard would write to.
Reject live shard reconstitute requests on NNTBS or histogram rollup shards if there is a raw shard rollup in progress that would feed into them.
Fix bug where the system would report that a live reconstitute was not in progress, even when one was running.
Allow running single-shard or single-metric reconstitute on non-raw shards, even if the shard extends beyond the current time.
The reconstitute GUI no longer apppears when doing online reconstitutes.
Fix iteration bug when reconstituting NNTBS shards.
Added the merge_all_nodes
flag to live reconstitute which causes all available and non-blacklisted write copies to send metric data instead of only the "primary" available node.
Added the ability to repair a local database by reconstituting a single metric stream.
Fix bug where /fetch
would not proxy if the data for a time period was all in the raw database, but the relevant raw shards were offline.
2023-09-06
NOTE: This version updates RocksDB (raw database, histogram shards) from version 6 to 7. It is not possible to revert a node to a previous version once this version has been installed.
Add a new configuration parameter, //rest/@default_connect_timeout
, that allows configuring the connect timeout for inter-node communication. This was previously hardcoded to 3 seconds.
Graphite series
and series_multi
fetches now return 500 when there are no results and at least one node had an issue returning data.
Graphite series
and series_multi
fetches now return 200 with an empty results set on no data rather than a 404.
Fix bug on /find
proxy calls where activity ranges were being set incorrectly.
Add ability to filter using multiple account IDs in the snowthsurrogatecontrol
tool by providing the -a
flag multiple times.
Reduce usage of rollup debug log to avoid log spam.
Upgrade RocksDB from version 6.20.3 to version 7.9.2.
2023-07-28
Prevent ingestion stalls by setting a better eventer mask on socket read errors.
Fix bug where surrogate deletes running at the same time as level index compactions can cause crashes.
Improve scalability of lock protecting all_surrogates
in indexes.
Fix logging of old-data ingestion.
Don't stop a rollup in progress if new data is received; finish and reroll later.
Add ability to filter by account ID in the snowthsurrogatecontrol
utility by using the -a
flag.
Fix full-delete crash on malformed tag query.
Rewrite Graphite level-index query cache to remove race on lookup.
Remove surrogate2.
Some data was hidden post arts compaction, make sure it stays visible.
Fix bug where if a fetch to the /raw
endpoint exceeded the raw streaming limit (10,000 by default), the fetch would hang.
Reduce memory usage during extremely large /raw
fetches.
Fix bug where extremely large double values generated invalid JSON on /raw
data fetches.
Handle surrogate duplicates on migration from s2
to s3
.
Require all nodes for active_count
queries.
Add back-pressure to raw puts, allows the database to shed load by returning HTTP 503.
Add preliminary support for operating IRONdb clusters with SSL/TLS. This allows securing ingestion, querying, and intra-cluster replication. See for details. This feature should be considered alpha.
IMPORTANT NOTE: Any node running 0.23.7 or earlier MUST do a PRIOR to upgrading to 1.0.0. This is due to removal of support for the previous surrogate database format.
For older release notes, see .