Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate all netdata strings #13570

Merged
merged 92 commits into from
Sep 5, 2022
Merged

Deduplicate all netdata strings #13570

merged 92 commits into from
Sep 5, 2022

Conversation

ktsaou
Copy link
Member

@ktsaou ktsaou commented Aug 25, 2022

This PR replaces all char * in RRD structures with STRING *.

Once this is done in all structures and is merged, it is expected to have a big impact on the memory footprint on busy servers, since STRING * deduplicate the strings. So, all these contexts, dimensions, variables, etc that have the same name across dimensions, charts, nodes, alarms, etc will use the memory only once.

  • rrdfamily
  • rrddim
  • rrdset
  • rrdcalc
  • alarm_entry
  • alert_config
  • rrdcalctemplate
  • rrdvar
  • rrddimvar
  • rrdsetvar
  • eval_variable
  • rrdhost
  • workers_utilization

Most of the changes are trivial (mostly a search/replace).


Also, this PR replaces AVL trees with DICTIONARIES for the following:

  • rrddim
  • rrdset
  • rrdhost
  • rrdvar
  • rrdfamily

This change was required, because with the STRINGs we now don't need hashes to compare them, but in order to lookup a value in the AVL tree, we had to create a STRING, which resulted in thousands of string lookup per second on the string dictionary, which made things slower. By replacing AVL with DICTIONARIES, we can now enjoy all the benefits of string deduplication, without any performance penalties.

The following indexes were not useful and have been removed:

  • rrdhost alarms_idx_name
    This was only needed during loading from disk on startup. It was replaced with a temporary dictionary.
  • rrdhost alarms_health_log_idx
    This was only needed to find if an alarm entry of an rrdcalc is repeating. It is now turned to a flag of the alarm entries for the index is not needed.

Although linked lists should be replaced by dictionaries, which also support reference counting, this still needs some work. Still all the code base relies on linked lists.

Until we eliminate them, most of the linked lists are now double linked, to provide constant insertion and deletion times:

  • rrdhost
  • rrdset
  • rrddim
  • rrddimvar
  • rrdsetvar
  • rrdcalc
  • rrdcalctemplate
  • workers utilization

For the double linked list management, there are 5 very useful macros in libnedata.h, so this is an attempt to standardize them a bit.


The value for users is important:

image

With the current implementation, on a single node, strings are allocated 10 times on the average (6k entries, referenced 60k times). I expect the final to be around 20 times. The impact on busy parent agents will be really great.

We also saved about 200 bytes per chart, because now the RRDSET.id is not preallocated.

On busy parents we observed several GBs in memory reduction of the agent.


Other fixes

  1. Chart labels are now written to SQLite a lot faster (one query per chart, instead of one query per label).
  2. Stream compression function restored - we shouldn't have now disconnects.
  3. On streaming start, the child first sends all chart and dimension definitions, before it starts streaming metric data.
  4. Workers utilization now supports custom metric types, either ABSOLUTE or INCREMENTAL. Already used to monitor streaming receiver bandwidth and streaming sender bandwidth and circular buffer utilization ratio.
  5. Fixed bug when charts were OBSOLETEd before any data collections and immediately were freed by the service thread.
  6. RRDCONTEXT initialization are now using SQLite prepared statements to be faster.
  7. RRDCONTEXT garbage collection was found not operational. Now it runs also on every agent restart and it has been rewritten to actually cleanup in one run.
  8. RRDCONTEXT triggers are now monitored (there is a chart for this in the netdata section)
  9. STRINGs are now monitored (there is a chart for this in the netdata section)
  10. STRINGs are now lockless. Only the Judy hashtable is protected with a read/write lock - all operations on reference counters are using atomic operations to work lock free.
  11. SQLite3 queries are now monitored (there is a chart for this in the netdata section)
  12. streaming receivers were attempting to execute received commands without reason. Now they first check if there are any outstanding data for it.
  13. RRDCONTEXT was infinitely re-triggering instance and context updates when a chart without retention on a parent, was streamed from a child (BEGIN, END) without setting values. Fixed: avoided the streaming of the empty chart, rrdcontext avoided the trigger on empty instances.
  14. RRDCONTEXT hooks about collected metrics and instances have been moved after RRDDIMs and RRDSETs have data written to the database, so that RRDCONTEXT can find their retention.

@thiagoftsm
Copy link
Contributor

Hello @ktsaou,

Your PR is not in draft mode, but I can see that we have empty checkbox in the description, is the PR ready for review?

Best regards!

@ktsaou
Copy link
Member Author

ktsaou commented Aug 26, 2022

@thiagoftsm I am done I think. Let's test it and merge it, so that I can continue with other changes.

@thiagoftsm
Copy link
Contributor

@thiagoftsm I am done I think. Let's test it and merge it, so that I can continue with other changes.

Thank you for the update, I will start testing it right now. 🤝

@thiagoftsm
Copy link
Contributor

Hello @ktsaou are you going to address this warning?

@thiagoftsm
Copy link
Contributor

Hello @ktsaou ,

Except for the already mentioned warnings from Codacy, I ran the following successful tests:

  • I ran coverity scan on this branch and it did not find warnings/erros on it.
  • I did not have any coredump when I ran the branch.
  • Cloud is working as expected
  • Alerts are working as expected.
  • I used JSON exporter and it works.
  • I used the script to test API access and I have the expected results.
  • Streaming is also running as expected.

After you take decision about Codacy, we can merge the PR.

@ktsaou
Copy link
Member Author

ktsaou commented Aug 26, 2022

Thank you @thiagoftsm !

I decided to proceed a little more. There are a few changes I need to do, that require this work to be merged. So, I will push more commits and then we will review it again.

I am doing thousands of little changes. If you or anyone else can actually review the changes, it would be great! Most of the changes are simple search-replace, but in a few cases, especially conditions, some changes were required.

So, if you can review the code, please do.

@thiagoftsm
Copy link
Contributor

So, if you can review the code, please do.

Hello @ktsaou ,

While I was testing your code, I also reviewed it, right now I am missing only the last commit. I will review it and test the modified codes. 🤝

database/rrdcalc.c Outdated Show resolved Hide resolved
@thiagoftsm
Copy link
Contributor

Hey @ktsaou I used all alarms present in our repo and I did not have any issue.

I am waiting confirmation that the PR is ready to approve it.

@thiagoftsm
Copy link
Contributor

thiagoftsm commented Sep 2, 2022

@ktsaou on the same Raspberry PI I am also having this error now:

2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (0032@streaming/compressio:lz4_compressor_): STREAM_COMPRESSION: Compressor Reset
2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (0561@streaming/sender.c  :rrdpush_sender_): STREAM potinho [send to 192.168.1.6:19999]: established communication with a parent using protocol version 5
2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (0753@streaming/sender.c  :rrdpush_queue_i): STREAM potinho [send to 192.168.1.6:19999]: sending metric definitions...
2022-09-02 16:52:03: netdata FATAL : STREAM_SENDER[potinho] : (0012@database/rrdset.c   :__rrdset_check_): RRDSET 'system.idlejitter' should be read-locked, but it is not, at function need_to_send_chart_definition() at line 178 of file 'streaming/rrdpush.c' # : Success

2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (0031@daemon/main.c       :netdata_cleanup): EXIT: netdata prepares to exit with code 1...
2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (0042@daemon/main.c       :netdata_cleanup): EXIT: cleaning up the database...
2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (1504@database/rrdhost.c  :rrdhost_cleanup): Cleaning up database [1 hosts(s)]...
2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (1459@database/rrdhost.c  :rrdhost_cleanup): Cleaning up database of host 'potinho'...
2022-09-02 16:52:03: netdata ERROR : STREAM_SENDER[potinho] : (0223@libnetdata/locks/loc:__netdata_rwloc): RW_LOCK: failed to obtain read lock (code 35)
2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (0140@database/sqlite/sqli:sql_close_conte): Closing context SQLite database
2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (0564@database/sqlite/sqli:sql_close_datab): Closing SQLite database
2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (0144@database/sqlite/sqli:add_stmt_to_lis): No statements pending to finalize
2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (0072@daemon/main.c       :netdata_cleanup): EXIT: removing netdata PID file '/run/netdata/netdata.pid'...
2022-09-02 16:52:03: netdata INFO  : STREAM_SENDER[potinho] : (0080@daemon/main.c       :netdata_cleanup): EXIT: all done - netdata is now exiting - bye bye...
EOF found in spawn pipe.
Shutting down spawn server event loop.

When I disable streaming I can run netdata on Raspberry PI (@stelfrag considering that child sends labels to parent, this could be a clue to help fix the previous).

@ktsaou
Copy link
Member Author

ktsaou commented Sep 2, 2022

Thank you @thiagoftsm ! Good find!

I updated need_to_send_chart_definition() to use the dictionary of dimensions and eliminate the need for a rrdset lock.

@ktsaou
Copy link
Member Author

ktsaou commented Sep 2, 2022

@thiagoftsm can you please check again if the previous crash of rrdeng_generate_legacy_uuid() is still reproducible?

@thiagoftsm
Copy link
Contributor

@ktsaou a quick update:

  • The issue with streaming was fixed and I have my Raspberry working as expected.
  • The other issue did not happen with the same high frequency of the previous, but the tests that I ran until now are not showing it anymore, so probably it is also fixed. I will continue testing to be sure that this is not happening and I will bring another update in few hours for this.

@thiagoftsm
Copy link
Contributor

@ktsaou the issues were fixed, I will rebase the branches and start another round of test.

@thiagoftsm
Copy link
Contributor

@stelfrag and @ktsaou, yesterday I ran another round of tests after commits, and I observed system stable and running fine on arm and x86_64, but I did not retest all possible alerts and access to different API endpoint, please, when you finish the development, please, write another comment for me to retest everything again to be sure we are not missing nothing. Thanks and have a nice weekend!

… lockless when they dont need to touch Judy - only Judy is protected with a read/write lock
…ed; string_length() renamed to string_strlen() to follow the paradigm of all other functions, STRING internal statistics are now only compiled with NETDATA_INTERNAL_CHECKS
@ktsaou
Copy link
Member Author

ktsaou commented Sep 4, 2022

Guys, I don't have any open issues about this PR.

Please review it. If no problem is found I want it merged today.

@thiagoftsm
Copy link
Contributor

Guys, I don't have any open issues about this PR.

Please review it. If no problem is found I want it merged today.

All right, it will be my high priority for today/tomorrow (September 5th). 🤝

@thiagoftsm
Copy link
Contributor

thiagoftsm commented Sep 5, 2022

Considering I am going to do different tests during the whole day, I will use this comment and I will be updating it until the tests end.

Compilation on different platforms

This PR was tested with success on:

Arch Operate System
x86_64 Slackware current
x86_64 on Vagrant VM Ubuntu 22.04
x86_64 on Vagrant VM Alma 9
x86_64 on Vagrant VM Debian 11
x86_64 on Vagrant VM FreeBSD 13.1-RELEASE-p2
ArmV7 Raspberry PI OS

Streaming

The streaming worked as expected on the following environment:

Arch Operate System Is child? Is compression enabled?
x86_64 Slackware current no yes
x86_64 on Vagrant VM FreeBSD 13.1-RELEASE-p2 yes no
ArmV7 Raspberry PI OS yes yes

I also tested in the scenario where parent did not have compression:

Arch Operate System Is child? Is compression enabled?
x86_64 Slackware current no no
x86_64 on Vagrant VM FreeBSD 13.1-RELEASE-p2 yes yes
ArmV7 Raspberry PI OS yes yes

I am rebasing branch now with latest @stelfrag commit before to proceed.

After the PR to be merged, I began to use master branch for next tests:

API access

I used our script to access thousands of URLs and everything worked as expected.

Health

All alerts present in this link were raised as expected.

Exporter

The exporting engine is working as expected, I used json and prometheus to test it.

@ktsaou ktsaou merged commit 5e1b95c into netdata:master Sep 5, 2022
@ktsaou
Copy link
Member Author

ktsaou commented Sep 5, 2022

It is merged guys! Thank you!

@thiagoftsm
Copy link
Contributor

@ktsaou and @stelfrag I am still testing this branch, I won't approve it, but I will update the comment #13570 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants