Release Notes
Changes in 1.4.0
2024-11-05
NOTE: This release deprecates legacy histograms. Histogram shards must be configured before upgrading to this release. If this is not done, nodes may not start up after the upgrade.
Fix use after free bug that could occasionally happen due to a race when fetching raw data.
Fix potential memory leak on certain oil/activity data operations.
Fix fetch bug where C-style
calloc
allocations were being mixed with C++-styledelete
s.Add new paramter to whisper config,
end_epoch_time
, that takes an epoch timestamp and directs the code to not look in whisper files if the fetch window starts after this time.Fix bug where histogram ingestion data was not being sent properly during rebalance operations.
Fix bug where histogram rollup data was not being reconstituted during reconstitute operations.
Add
get_engine
parameter to histogram data retrieval to allow pulling from either rollups or ingestion data.No longer open new LMDB transactions when reading data for merging NNTBS blocks together.
Remove all support for legacy, non-sharded histograms.
Fix bug where if a raw shard rollup was aborted after being scheduled but before actually starting, multiple rollups could end up triggering at once.
Fix rename bug where type was not getting set when failing to send NNTBS data.
Add header to the
/rename
endpoint toX-Snowth-Activity-Data-Mode
, which can accept eitheruse_existing
orcreate_new
as values.Treat
MDB_CORRUPTED
,MDB_PAGE_FULL
,MDB_TXN_FULL
,MDB_BAD_TXN
, andENOMEM
as LMDB corruption consistently when checking for errors.When using the
/merge/nntbs
endpoint to send data to a node, allow either updating the receiving node's activity data using the incoming NNTBS data or leaving it as is and not updating it.Fix bug where activity data was not being updated correctly when inserting NNTBS data.
Fix bug where rollups were marked clean after a rollup had been kicked off asynchronously, resulting in a race that could lead to shards being incorrectly considered dirty.
Deprecate support for rebalancing data into a cluster with fewer NNTBS periods.
The
/rename
endpoint will now detect when it gets a 500 error from the/merge/nntbs
endpoint and will return an error instead of spinning forever.The
/merge/nntbs
endpoint will no longer crash on detecting corrupt shards; it will offline the shards and return errors.Various small fixes to reduce memory consumption, improve performance, and prevent possible crashes or memory corruption.
Changes in 1.3.0
2024-07-17
Fix bug in
build_level_index
where we were invoking a hook that calledpcre_exec
with an uninitialized metric length.Reduce spam in error log when trying to fetch raw data for a metric and there isn't any for the requested range.
Add new API endpoint,
/rename
, to allow renaming a metric. This calculates where the new metric will live, sends the data for the metric to the new location, then deletes the old metric. This only works for numeric metrics.Add new API endpoint,
/full/canonical/<check uuid>/<canonical metric name>
that will allow deleting an exact metric from the system without using tag search.Add ability to skip data after a given time when using the
copy
sieve insnowth_lmdb_tool
.
Changes in 1.2.1
2024-06-04
Avoid metric index corruption by using
pread(2)
in jlog instead ofmmap(2)
.Deprecate
max_ingest_age
from the graphite module. Require thevalidation
fields instead.Change Prometheus module to convert
nan
andinf
records tonull
.Add logging when when the
snowth_lmdb_tool
copy operation successfully completes.Fix bug where a node could crash if we closed a raw shard for delete, then tried to roll up another shard before the delete ran.
Fix bug where setting raw shard granularity values above
3w
could cause data to get written with incorrect timestamps during rollups.Improve various listener error messages.
Add checks for timeouts in the data journal path where they were missing.
Improve graphite PUT error messages.
Fix NNTBS rollup fetch bug where we could return no value when there was valid data to return.
Fix bug where histogram rollup shards were sometimes not being deleted even though they were past the retention window.
Changes in 1.2.0
2024-03-27
NOTE: This release bumps the metric index version from 4 to 5. Upon restart, new indexes will be built and the old ones will be deleted. This process will use a significant amount of memory while the indexes are being rebuilt. It will also cause the first post-update boot to take longer than usual.
Update index version from 4 to 5.
Automatically clean up old index versions on startup to make sure outdated indexes don't clog the disk.
Fix Ubuntu 20.04 specific bug where nodes could crash when trying to clean up status files when rolling up raw shards.
Fix issue with level indexes where data was being lost when deleting metrics on levels where the metric has multiple tags.
Fix issue where level indexes were incorrectly reporting that levels existed when all underlying metrics had been removed.
Add new API endpoints,
/compact_indexes
and/invalidate_index_cache
, that allow forcing compaction and cache invalidation for specific accounts, respectively.Fix rollup bug where raw shards could be prematurely deleted if a rollup was aborted due to corruption.
Fix various potential memory corruption issues.
Fix issue where jlog journal data could get corrupted.
Changes in 1.1.0
2024-01-02
Add preliminary support for operating IRONdb clusters with SSL/TLS. This allows securing ingestion, querying, and intra-cluster replication. See TLS Configuration for details. This feature should be considered alpha.
Fix bug where rollups were being flagged "not in progress" and "not dirty" when attempting to schedule a rollup and the rollup is already running.
Use activity ranges as part of query cache key. Previously, cached results from queries with a time range could be used to answer queries that had no time range, leading to incorrect results.
Fix logic bug where rollups were sometimes flagged as still being in progress after they were completed.
Account index WAL can keep growing without bounds due to a local variable value being squashed early.
Fix bug where the
reconst_in_progress
file was not being cleaned up after reconstitute operations, which could block rollups and deletes from running.The
raw/rollup
andhistogram_raw/rollup
API endpoints will no longer block if there is a rollup already running. They will also return sensible error messages.Raw shard rollups will not be allowed to run unless all previous rollups have run at least once.
Fix bug where deferred rollups could cause the rollup process to lock up until the node is restarted.
Add REST endpoint POST/PUT
/histogram/<period>/<check_uuid>/<metric_name>?num_records=<x>
which can be used with a json payload to directly insert histogram shard metric data for repair purposes.Added configuration file field,
//reconstitute/@max_simultaneous_nodes
, that will cause the reconstituting node to only hit a predetermined number of peer nodes at once. The default if not specified is "all peers". This setting can be used if a reconstitute places too much load on the rest of the cluster, causing degradation of service.Disallow starting single-shard reconstitutes with merge enabled if the shard exists and is flagged corrupt.
Improve NNTBS error messages if unable to open a shard.
PromQL - improve error messages on invalid or unsupported range queries.
PromQL - fix range nested inside one or more instant functions.
Include maintenance mode when pulling lists of raw, histogram, or histogram rollup shards.
Use read-copy-update (RCU) for Graphite level indexes and the surrogate database. It allows more queries to run concurrently without affecting ingestion, and vice versa.
Defer rollups of raw shards if there is a rollup shard in maintenance that the raw shard would write to.
Reject live shard reconstitute requests on NNTBS or histogram rollup shards if there is a raw shard rollup in progress that would feed into them.
Fix bug where the system would report that a live reconstitute was not in progress, even when one was running.
Allow running single-shard or single-metric reconstitute on non-raw shards, even if the shard extends beyond the current time.
The reconstitute GUI no longer apppears when doing online reconstitutes.
Fix iteration bug when reconstituting NNTBS shards.
Added the
merge_all_nodes
flag to live reconstitute which causes all available and non-blacklisted write copies to send metric data instead of only the "primary" available node.Added the ability to repair a local database by reconstituting a single metric stream.
Fix bug where
/fetch
would not proxy if the data for a time period was all in the raw database, but the relevant raw shards were offline.
Changes in 1.0.1
2023-09-06
NOTE: This version updates RocksDB (raw database, histogram shards) from version 6 to 7. It is not possible to revert a node to a previous version once this version has been installed.
Add a new configuration parameter,
//rest/@default_connect_timeout
, that allows configuring the connect timeout for inter-node communication. This was previously hardcoded to 3 seconds.Graphite
series
andseries_multi
fetches now return 500 when there are no results and at least one node had an issue returning data.Graphite
series
andseries_multi
fetches now return 200 with an empty results set on no data rather than a 404.Fix bug on
/find
proxy calls where activity ranges were being set incorrectly.Add ability to filter using multiple account IDs in the
snowthsurrogatecontrol
tool by providing the-a
flag multiple times.Reduce usage of rollup debug log to avoid log spam.
Upgrade RocksDB from version 6.20.3 to version 7.9.2.
Changes in 1.0.0
2023-07-28
IMPORTANT NOTE: Any node running 0.23.7 or earlier MUST do a surrogate3 migration PRIOR to upgrading to 1.0.0. This is due to removal of support for the previous surrogate database format.
Prevent ingestion stalls by setting a better eventer mask on socket read errors.
Fix bug where surrogate deletes running at the same time as level index compactions can cause crashes.
Improve scalability of lock protecting
all_surrogates
in indexes.Fix logging of old-data ingestion.
Don't stop a rollup in progress if new data is received; finish and reroll later.
Add ability to filter by account ID in the
snowthsurrogatecontrol
utility by using the-a
flag.Fix full-delete crash on malformed tag query.
Rewrite Graphite level-index query cache to remove race on lookup.
Remove surrogate2.
Some data was hidden post arts compaction, make sure it stays visible.
Fix bug where if a fetch to the
/raw
endpoint exceeded the raw streaming limit (10,000 by default), the fetch would hang.Reduce memory usage during extremely large
/raw
fetches.Fix bug where extremely large double values generated invalid JSON on
/raw
data fetches.Handle surrogate duplicates on migration from
s2
tos3
.Require all nodes for
active_count
queries.Add back-pressure to raw puts, allows the database to shed load by returning HTTP 503.
Older Releases
For older release notes, see Archived Release Notes.
Last updated
Was this helpful?