Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device limit and memory discussion for 1.8.1-dev #684

Closed
mgeramb opened this issue Nov 9, 2023 · 63 comments
Closed

Device limit and memory discussion for 1.8.1-dev #684

mgeramb opened this issue Nov 9, 2023 · 63 comments

Comments

@mgeramb
Copy link

mgeramb commented Nov 9, 2023

I using know the 1.8.1-dev version for some days without issue. But I have not so much success with increasing the number of devices.

Your limit with 80k seems to be really the limit where everything works fine. Now I have analyzed this limit a little bit.
I checked the maximum used heap and found that the 40k is the lower water mark which left free. So there should enough memory available for more devices. Next step was checking the maximum free block size, which is also about 40k. This means the memory is fragmented and a large block greater than 40k can not be reserved. For me this means, that Homekit seems to be use very large blocks of memory for answering the web request. So I take a look in the code and I found maybe the confirmation for this. There is one big block for creating the json for all devices. Then as second block will reserved which receives the encrypted body.

My theory now is, that it should be possible to reduce the memory consumption for this a lot with the following idea:
The library should reserve the memory for one encryption frame size. Then the loop starts through the devices. In this loop the memory for the body for only one device is needed. This memory will be copied to the buffer for encryption, if the frame is full the body part should be sent with chunked encoding to the network.

Do you think that my analyze is correct? Do you think that a change like this would be possible? Are you interested in such an change and would you support me?

Best,
Michael

@HomeSpan
Copy link
Owner

These are good suggestions. What should be relatively straightforward is for me to change my sendEncrypted() function so that it sends each frame as formed, rather than created a large buffer to store all frames at once. However, creating the initial buffer one Accessory at a time will be quite intrusive and would require a lot more changes because sendEncrypted() is a generic function used for all communication back to HomeKit - I would need to create a special set of routines that blended the JSON creation and Encryption at the same time. I'll start with the first change and let you know when ready for testing.

@mgeramb
Copy link
Author

mgeramb commented Nov 11, 2023

I have already an idea how the Accessory handling can be done. A context object should be created for this task. This object can handle the header, footer and sending the data. So, before the loop through the accessory starts, the context object has to be created. It will immediately create the header in a buffer. Then the loop starts and the context object will passed to the accessories. Each accessory will put the needed data to the context object which fills it in its internal buffer. Every-time if the context object has collected enough data, it will encrypt the data and send it to the network. If you want I could support you with the development, but I would need some hints where I found the loop through the accessories and information where you thing that such large buffers are created so that is worth to implement this mechanism.

@HomeSpan
Copy link
Owner

I've update release-1.8.1-dev to make the change I discussed above, which saves 15-20 KB for a device with 50 simple lightbulb Accessories. The size of the temporary buffer to store encrypted frames is now set to the minimum size needed for a single frame (which will be 1042 or fewer bytes). Please give it a try.

Operating on an Accessory-by-Accessory basis is going to be much more intrusive, and given other priorities and requests I'm not going to tackle, at least at this time. However, please feel free to give it a go if you'd like. If you look at the 'd' Serial Monitor Command around line 629 of HomeSpan.cpp you'll see the start of how the Accessory attributes JSON is constructed.

@mgeramb
Copy link
Author

mgeramb commented Nov 11, 2023

This is already a big step forward. I'am really suprised, also the standby memory consumption is lower, have anything else changed? These screenshots are with 38 channels mixed devices, it was the limit before:
Previous version:
image

New version:
Free Heap 89268

Especially the lower water mark is much more better!!!

Now I'am going to find the new limit 😀

@mgeramb
Copy link
Author

mgeramb commented Nov 11, 2023

50 devices and not limit in sight:
image
The interesting thing is, which function has used the largest buffer size.
It would be nice if you could take over my change to get the statistic: #685

@HomeSpan
Copy link
Owner

Further update: I changed the method by which the lists of required and optional Characteristics for each Service are stored. I was using C unordered_set containers but found that switching to C vector containers is much more memory efficient. For a device with a small number of Accessories the saving is not much. But for a device with many Accessories the memory saving is quite extensive. Changes are committed to the release-1.8.1-dev branch. Let me know if it makes a difference in your sketch.

I also added some error checking for NVS storage, which starts to become a limiting factor if you have a lot of Accessories and they have Characteristics with NVS storage set to true.

@mgeramb
Copy link
Author

mgeramb commented Nov 13, 2023

Hi, I have take over your changes in my branch. For moment it look very good to me. Lot of memory saved:
image
But regardless of your change, yesterday I found maybe that there is another unknown limit. With 50 devices everything worked fine, with 55 devices the software becomes unstable. I did not get anymore updates in Homekit and the logs show random reconnects to the WLAN. But no other error messages. Currently I have no idea where it comes from, but I don't think that this was an memory issue because the maximum free heap block and the lower heap limit was stable.

@mgeramb
Copy link
Author

mgeramb commented Nov 13, 2023

Today with the new version it seems to work fine up to 55 devices. With 60 devices it works only with one controller. And I got an error of a stack overflow. Unfortunately it seems that there is not easy way to change the stack size for the loop function with the arduino framework. But anyway, 55 devices are a big step forward and I expect that depending on the other SW running on the microcontrol, more devices should be possible. My program needs also a lot of RAM for the configuration of devices. These are my current statistics:
image

@HomeSpan
Copy link
Owner

Note that if you use homeSpan.autoPoll() instead of homeSpan.poll() (see API for details) you can set your own stack size. Of course this reduces heap memory, but if you are limited by stack size this may be an option.

@mgeramb
Copy link
Author

mgeramb commented Nov 14, 2023

This is might be a good option. But I currently do not understand the relation between stack usage and the number of accessories. I have only seen lists with loops through the accessories, so the stack consumption should not increase with the number of devices. Only if there would be a link between the accessories and a function call goes recursive through this links, the stack will be effected. So maybe the error is wrong and just the memory becomes corrupted somehow.

But independent of this, I can confirm that the new version with your and my changes runs stable since one day in a production environment. No memory leak, everything is fine and responsive.

@mgeramb
Copy link
Author

mgeramb commented Nov 18, 2023

I did now a lot of analyze. I found out, that the stack is a really a critical factor:

With 55 devices only 156 bytes free stack, but this seems to be stable.
I tried to use 58 devices, this basically works but sometimes I notice Wifi connection problems.

Then I checked the relation between stack usages and number of devices:

As you can see, there is a relation. For me the question is, why. All devices objects are created on heap. So something seems to happen in the web calls.

Then I checked if there is a relation between stack usage and Controller request:
Before controller request:

After controller request:

Maybe you have an idea what happen inside of the request, is there anywhere a linked list where on object is calling the next?

My next test was increasing the heap:

From perspective of memory, now it should not be a problem to create more devices, still 40k available on heap and enough memory available on the stack. But unfortunately it was not. A test with 60 devices fails:

I really do not understand what happens here, because the lower mark of the heap was always 40948 bytes.
With 58 devices I get errors like this:
image

It seems that 55 is the limit for a stable version.

@HomeSpan
Copy link
Owner

How are you determining the size of the stack? Also, what does "Used Channels" represent in your diagnostics?

@HomeSpan
Copy link
Owner

Are you using NVS to store Characteristic values? If so, this may be a limiting factor (you can turn this off by not providing a second argument to Characteristic instantiations).

@mgeramb
Copy link
Author

mgeramb commented Nov 18, 2023

"Used Channels" are the number of accessories, to get the stack watermark I use the function uxTaskGetStackHighWaterMark.
I'am looked in the code and assumed the the vector is maybe causing the large stack usage. I'am currently try to replace them to see if this is the issue. I'am not using NVS.

@mgeramb
Copy link
Author

mgeramb commented Nov 18, 2023

I did not find any issue and I did not find the reason for the large stack usages of HomeSpan.
I use now the autoPoll function with a 9216 bytes of stack size.

The autoPoll task needs 5736 bytes with 20 accessories :
Used channels 20

The autoPoll task needs 7448 bytes with 55 accessories :
Used channels 55

I tried a lot of different stack sizes, but always the same, with 58 accessories the wifi get sometimes lost. I don't think, that the stack is the limit, but I also to not believe that this extensive stack usage is normal, especially as it grows with the number of accessories.

Nevertheless , all the improvements are a step forward, please let me know how we continue with my PR. If you think that it can be used, I will to a cleanup to remove all debug information from the TempBuffer.

@frankonski
Copy link
Contributor

The error you are seeing, write(): fail on fd 54, errno: 11, "No more processes" has been documented in the issue at espressif/arduino-esp32#6129 , where this comment mentions it is fixed by this PR, which has been pulled in version 2.0.14 of arduino-esp here

Are you using arduino-esp32 version 2.0.14?

@mgeramb
Copy link
Author

mgeramb commented Nov 18, 2023

@frankonski I'am using currently 2.0.9. I will try to update
@HomeSpan I found out, that if a little bit more heap can be freed, 58 devices gets stable. Because vectors always double the reserved space, I trimmed now the vectors to the required size by adding lines like this:

  std::vector<SpanService*>(linkedServices).swap(linkedServices); // shrink vector to fit

This saves about 1k memory in my case

@HomeSpan
Copy link
Owner

I have an ESP32-S2 from Unexpected Maker in the Adafruit Featherboard configuration. It has an additional 8MB of PSRAM. I have been able to run up to 80 simple accessories as a bridge (just a lightbulb with On/Off). I'm tracking the high watermark for the stack (using just poll(), not autopoll()) and it is NOT affected by the number of Accessories (it gets as well as about 2,924). I then modified my sketch to include the Brightness Characteristic. The stack usage did not change, but heap memory did and I was not able to run as many Accessories (at least 60, but it dies at 80).

The ESP32-S2 is not throwing any errors: it basically just freezes when (or right after) trying to transmit the Accessories data to the Home App.

So this strongly suggests it's not a stack issue, but it is some sort of memory issue. I'm using 1.8.1-dev as per my latest posting in GitHub, which does not include any of your Context PR code. If you are seeing changes in the stack usage as a function of the number of Accessories this may mean it's something in your Context code?

Though I have 8MB of PSRAM, I am still running out of memory. By separately tracking low points of total and internal RAM separately, I can see that I am drawing down on BOTH the PSRAM and internal RAM, and the internal RAM gets quite low (10 kB at the minimum). This low-point DOES (roughly) track the number of Accessories.

From the ESP32 documentation of PSRAM, I know that not all processes can use that memory, especially those that use interrupts (which need to be in main RAM). To force usage of PSRAM, I changed the malloc() call in TempBuffer to ps_malloc(), though this did not have any impact on low internal RAM points.

My guess is that the vectors I'm using to store the Accessory/Service/Characteristics are perhaps being pulled from internal RAM, and this is drawdown causes internal RAM to get low enough that other processes that MUST use internal RAM are failing.

To see if that is the issues, I am going to force my vectors to use PSRAM instead. The std::vector class allows you to define your own memory allocator, and I've seen others post some example code of how this is done. Will give it a try and see what I may learn.

@mgeramb
Copy link
Author

mgeramb commented Nov 19, 2023

@frankonski Do you know how the arduino-esp32 version can be specified in a platformio project? I tried to upgrade the platform to the newest espressif32 version, but the newest arduino-esp32 version I get is 2.0.11

@mgeramb
Copy link
Author

mgeramb commented Nov 19, 2023

@HomeSpan I do no large allocation on the stack in my branch, but maybe this issue is somehow related to the arduino-esp32 version or something in the web client implementation. Maybe you can try my branch with your sketch, I would expect that this should improve the temporary heap usage for request to a minimum. Also would it be helpful to see if you run in the same stack issue.

@mgeramb
Copy link
Author

mgeramb commented Nov 19, 2023

@HomeSpan I have written now a new dyamic array pointer implementation. This are the features:

  • Most of the functions of vector supported (Only the find_if is currently not yet possible)
  • If only one element is stored in the array, no extra heap is allocated
  • It does never reserve more memory then needed (This means, that is should not be used for lists which are changed frequently, but this should be find for the lists with characteristics, services and accessories)

You can find the implementation of the dynamic pointer array in my PR #686

My app is now running with 60 devices, this was my target and I will run it now for longer time to see if it is stable, for now it seems to be.

@frankonski
Copy link
Contributor

frankonski commented Nov 19, 2023

Do you know how the arduino-esp32 version can be specified in a platformio project?

No idea, I don't use PlatformIO. However, I do use VSCode with some personal touch. Extensions installed:

  • Arduino (installed but not used to build)
  • C/C & extension pack & theme
  • Espressif IDF (installed, but not used to build)

I have Arduino IDE 2.x installed, that I launch from time to time to get automatic updates, which gets installed in C:\Users\<user>\AppData\Local\Arduino15\packages\esp32\hardware\esp32\2.0.14\.

I have arduino-cli.exe copied locally, which automatically uses whatever version of arduino-esp32 is installed in AppData.

I have a Compile task defined in tasks.json that simply kicks off arduino-cli.exe with my .ino project in verbose mode.

Ctrl-Shift-B compiles everything, in full scrooling-realtime verbose mode. I don't like the Arduino VSCode extension as even with verbose turned on, it only spits out the output at the very end when everything is complete (or failed). I prefer seeing the output scrolling during the build.

Also, having ESP-IDF extension installed makes serial monitoring SO better than the Serial Monitor from MS, as the MS one seems to buffer and lag, while using ESP-IDF extension, the serial monitor kicks off a terminal, so no lag when there is lots of output.

Lastly, I have another task that runs espota.py to do OTA update.

Bottom line:

  • Ctrl-Shift-B to compile
  • Ctrl-Shift-9 to flash OTA

@HomeSpan
Copy link
Owner

@mgeramb, I'm glad you've been able to get 60 devices running, but with regards to the PRs, I'm uncomfortable adding them to the production branches at this time. The techniques you used are really interesting but they add a lot of complexity I'm not sure I'll be able to effectively support at this juncture. I'd encourage you to keep the PRs active so there is a good record of the changes, but very shortly Espressif is going to release version 3 of the Arduino-ESP32 Board Manager for support of the H2 chip and thread. According the Espressif, this version has a lot of breaking changes which will likely require substantive modifications to HomeSpan just to get it to work without even using anything related the thread or the H2. I want to make sure the base code stays as clean as possible with only incremental changes until at least then.

I'll continue exploring other memory issues as well, but will try to avoid major changes for now.

@HomeSpan
Copy link
Owner

To close the loop on my PSRAM comment above, by using a custom allocator for vector that forces allocations from PSRAM, I can run 90 lightbulbs with a on/of and dimmer characteristics. The purpose of this test was to verify that PSRAM can be effectively used to extend the memory. It can, but the automated PSRAM handling by Espressif requires you to override malloc calls in various places with ps_malloc, including by using a customized allocator for vectors (that itself uses ps_malloc. By doing so, you keep enough RAM free for WiFi and other calls that cannot (or at least do not) use PSRAM.

@frankonski, this might help with some of your PSRAM questions/issues.

(I may try to make a generic PSRAM switch based on the whether a device has PSRAM).

@frankonski
Copy link
Contributor

I'm still reading & learning about the different memory models & types, even as of few minutes ago. So far, my recent understanding is that SPIRAM and PSRAM is the same thing. Using ps_malloc() seems to explicit allocate memory from PSRAM. Usin malloc() is the same as heap_caps_malloc_default(), which internally uses MALLOC_CAP_DEFAULT.

ok.

Now if calling heap_caps_get_info(&heapInfo,MALLOC_CAP_DEFAULT) and logging (heapInfo.total_free_bytes heapInfo.total_allocated_bytes) / 1024 gives me 4345 Kb (for total DRAM). I can assume that the full PSRAM/SPIRAM is available as heap memory in my case.

For as long as HomeSpan uses malloc(), things should be fine and use both internal and SPIRAM automatically. My build output shows -DBOARD_HAS_PSRAM by default, as I selected the board as being the WROVER-E. One thing I noticed: it looks like mBedTLS will use internal RAM by default (source) unless you change sdkconfig to use CONFIG_MBEDTLS_DEFAULT_MEM_ALLOC=y.

HomeSpan uses mBedTLS do some crypto-related operations. If mBedTLS is set to use internal RAM, the larger the data to encrypt (the more accessories you have?), the larger internal RAM would be used, right? Unless that setting above is used, technically speaking.

@mgeramb
Copy link
Author

mgeramb commented Nov 19, 2023

@frankonski

HomeSpan uses mBedTLS do some crypto-related operations. If mBedTLS is set to use internal RAM, the larger the data to encrypt (the more accessories you have?), the larger internal RAM would be used, right?

No, this is not exactly how it works. HomeSpan has to create a json for each calls from the HomeKit controller which contains infos about all accessories. This json file becomes bigger and need more temporary free memory on the heap if more accessories are defined. After this operation the buffer is divided in 1024 byte blocks which will be encrypted. So, this means, for encryption the needed buffer always have a maximum size of 1024 bytes. But in the current released version, again a large buffer is created with hold all encrypted parts for sending it afterwards to the network. The change in the 1.8.1-dev branch from @HomeSpan is, that the encrypted buffer will be immediately sent to the network. This saves memory for the large sending buffer. But for the json is still a large temporary buffer needed.
This is the place where my PR #686 come in place. It uses a fixed buffer size of 1024 for creating the json, every time if this buffer gets full it will be immediately encrypted and sent to the network. This saves the large temporary buffer for storing the complete json.
Another issue I found out is, that the vector implementation always doubles the reserved memory if the internal buffer becomes full. But a lot of vector lists are container lesser elements than the reserved memory provides. This is the second thing which I address in my PR. A created a own list implementation, with nearly 100% API compatibility to the vector, so a simple replace in the declaration is needed, only view places which are using find_if have to be rewritten, at least for now. And the memory management is quit simple, so an advancement for PSRAM could be done, but in my option, the larger buffers are needed for the elements like services, characteristics,... This means, maybe using the PSRAM in the TempBuffer could already be enough.
My board has no PSRAM, this means I can only use the normal RAM, but still with them, I am currently working with 60 accessories of different types.
This is my memory usage (My other application, a KNX Stack uses also a lot of memory). On heap is always approx. 52k free space and the main stack has also 3k free memory, but I hat to increased the memory for the autopoll task from 8 to 9k. Now the setup seems to be stable.
image

@HomeSpan
Copy link
Owner

One area where encryption memory is wasted in the SRP6 code. Here I pre-define a lot big integers, but none of this is needed after the initial pairing (SRP is only used to pair). In my server branch I re-factored this so that big integers are only initialized when needed and then destroyed right after.

There is also some wasted space in the TLV implementation that I also re-factored in the server branch.

I will eventually bring both of those into the main branch but am trying to balance new functionality versus optimization (e.g. I'm hoping to work on implementing descriptive constants for each of the Characteristics (per a separate thread), which will help new users who don't have access to the original HAP document (which Apple has pulled from their web site).

@mgeramb
Copy link
Author

mgeramb commented Nov 20, 2023

'm glad you've been able to get 60 devices running, but with regards to the PRs, I'm uncomfortable adding them to the production branches at this time. The techniques you used are really interesting but they add a lot of complexity I'm not sure I'll be able to effectively support at this juncture.

@HomeSpan I completely understand you, that you feel you not comfortable with the PR. The PR is just a Draft, Proof of Concept and discussion starting point and should currently not be merged. But let me explain the concept behind it, basically it should make the code much more easier than it is currently.

Simple example:

Current code:
nBytes =snprintf(cBuf,cBuf?64:0,"{\"accessories\":[");
compared to future code:
callContext->print("{\"accessories\":[");

As you can see here in this example, the second version is much more easier, no checks for null for cBuf, no counting of bytes anymore. Just write out what you want, including all of the formatting feature of sprintf. And the output can simple be going to a memory buffer, the console, the network, or maybe also encrypted to the network. So, where and how comes this magic?

I used a very common pattern in the SW development which is used in a lot of high performance implementation. I give you an example. I'am sure you know a language with have some kind of JsonConverter. It works like this, that it has a memory buffer inside to hold the complete json. This is good for small object, but for very large objects it will not work in this way. In this situation you normally use class like JsonWriters. A json writer just provides an interface for writing out date needed Data. But the json writer does not hold the complete json, instead it writes out the data to some target, and that is the interesting thing here, the target can be a network, a file or whatever you want where you need the data.

You also will see similar concepts for XML, Streams, StringWriters, filter chains, etc

Exactly the same I'am doing here. I have created an basically abstract interface class for writing out data with printf and print functions. Currently I named it CallContext, maybe not a good choice, maybe DataWriter or something else would be the better naming.
All the functions which have to write some data, just use this simple interface. Your objects do not know, what happens behind this, that is the nice thing in this story. Because of this, there is already a simplification of the code. You can not see this in my current PR, because I only have added the context class in one place, just to have a proof of concept and in my PR your code looks more ugly then it was before because I have added a lot of if statements to keep the code downward compatible. But in the future, all the if statements can be removed and the context class should be forwarded to all inner classes instead of the cBuf, and not in addition how it is now.

The next good thing in this concept is, that you can now decide completely independent what should happen with the data which is provided by your objects. Just by deriving a new class of the CallContext, you remember, it is basically just an interface (in my PR it has a little bit more code, but maybe this should be removed). In the derived class, you can decide what you want to do with the data, e.g. storing it in a large buffer as it was before or even better, directly write it out the network. Or you want to give out the data on the console? Just derive from the CallContext, implement one function which writes out the data to the console, or maybe you have to calculate a hashcode? Just derive a new class, implement one function with the hash calculation.

Let me summarize all this things:

  • Easier code in your objects without checking for nulls or counting bytes
  • Open for future needs because of the separation and abstraction what should be done with data
  • And last but not least, less memory consumption because there is no need to hold all the data in the memory.

Just check in your code snipped with your 90 accessories, how many bytes you need here in the heap:

int nBytes = homeSpan.sprintfAttributes(NULL);        // get size of HAP attributes JSON
TempBuffer <char> jBuf(nBytes 1);
homeSpan.sprintfAttributes(jBuf.get());

My implementation will need a maximum of 2048 bytes for buffering and encryption.

I expect, you just get a complete wrong picture of my PR because the integration is not yet fully done and your code becomes for this reason very ugly. If you want, I could provide on example how your code would be after the fully integration.
And don't forget, it is a starting point for discussion, we can discuss the naming, the needed interface and what kind of derived class of the CallContext are need.

And one more, maybe important information for your decision. I need this implementation for another large open source project, the (OpenKNX)[https://github.com/OpenKNX] project where I'am working on a (KNX-Homekit bridge)[https://github.com/OpenKNX/OAM-SmartHomeBridge]. So it is not only for my private usage, a lot of user will benefit from the improvement which makes it worth for me to spend so a lot of time on it.

If you have any question to understand the concept, please ask me, or if you want we can also make a remote session, it is important that you understand everything.

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 2, 2024

Happy New Year!

I finally finished my refactoring of the TLV and SRP routines and have been able to re-focus on the memory-limitations above. Though the solution you proposed using contexts would work, I believe there is a much easier way.

The root of the issue is that I am using and sprintf(), which is arguably an old C-style of creating formatted text. Instead, I should be using standard C streams, such as cout << "x=" << x. This is because C streams have built-in buffering mechanisms that can be customized and overridden to our advantage.

To solve the problem, I replaced all my main sprintf() calls that produce JSON text with cout << statements. This actually makes the code simpler and easier to read.

Then, I created a customized stream buffer holding 1024 bytes. Whenever the buffer fills, it flushes the output to:

  • the Serial Monitor, with or without indented formatting AND/OR
  • an HTTP client, with or without HAP encryption

Rather than make these changes to the cout stream, I created a new stream called hapOut that inherits from std::ostream so offers all the same functionality as cout, but uses customized buffer logic. All the logic is at the bottom of Hap.h and Hap.cpp.

As you can see, it's a really small amount of code. But it is generic and it works for everything from printing the Accessory database to the screen, to producing HTTP web logs, to transmitting all HAP HomeKit data using the proper encryption keys for each HAP Client.

I've tested this with up to 70 Accessories and it works fine. You'll note that regardless of how large the Accessory database is, it never uses more than a few KB when printing to the screen.

The only issue I have found is that chunking the data into 1024 bytes packages and transmitting in a loop to a WiFi client can overrun the ESP32 WiFi buffers -- the symptoms are usually that the WiFi connection drops and can't re-connect. These are internal to the IDF and there is not much I can do to check if the transmission is completed before sending the next packet. To address this I added a 50ms delay after each transmission, which seems to solve the problem for up to 70 Accessories, and does not negatively impact operations with the exception that the weblog page take a little longer to load if you have 200 weblog records. However, I would like to find a more robust solution. Decreasing the size of the buffer seems to make things worse, not better, but I can't increase the size beyond 1024 if I want to continue encryption on the fly. To speed up the weblog, I may add a second buffer with a larger capacity that is used for unencrypted transmission.

Please give it a try and let me know what you think. I've added a simple example called Max Accessories to help stress the system.

I have not yet checked on a chip with a large amount to PSRAM to stress with 149 Accessories.

@mgeramb
Copy link
Author

mgeramb commented Jan 2, 2024

Very interesting news. But please check the heap usage before you start big changes in the code. In my tests the heap usages was very high for 149 devices and I do not know why. My only idea is, that maybe a base library like the webclient do some recurisve calls if the sending buffer gets filled faster than it could be sent.

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 2, 2024

Yes, of course, all this was checked and usage is lower for like-to-to like comparisons. But I can push the envelope a lot further and that's where I'm trying to determine how much I can squeeze out of the memory before hitting a wall.

Also, the changes to recursive encrypted transmission code was made a while back. The specific change here is that the data prepared for the transmission is also built on the fly (which takes a little longer) instead of having it all prepared in a large buffer ahead of time. I thought avoiding that large buffer was the main point of the context code. That's basically what I implemented with streams.

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 2, 2024

My understanding is that when you transmit data over a socket, the WiFi client, dynamically pulls heap memory for 2 copies of the data - one for the LwIp processing, and one for the MAC process. There are some configurable options if you compile from the IDF, but not from the Arduino-Esp library.

@mgeramb
Copy link
Author

mgeramb commented Jan 3, 2024

That is exactly what I was wondering, nothing should be allocated on the stack. Anyway, I will try your version give you feedback

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 3, 2024

I figured out the issue I was seeing with my new code. The code is operating as expected - I was measuring the wrong heap. I was using MALLOC_CAP_INTERNAL to monitor memory consumption, which includes IRAM. But IRAM can't be used for data unless guaranteed to be 32-bit aligned (such as array of uint32_t).

Hence, I thought I was seeing failures even though there was still some memory available, which was puzzling. But in reality I was using all the DRAM, so hit the limit I expected to hit.

I will expand the output of the 'm' command so the memory available for each heap type is clear and unambiguous.

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 3, 2024

Should have mentioned above that this also addressed the need for a 50ms delay after each transmit. A delay of 1ms works fine.

@mgeramb
Copy link
Author

mgeramb commented Jan 3, 2024

My first test result with the new version: everything works like in my last version base on commit [beef66e]
The memory consumption on Heap, PSRAM and Stack is the nearly the same (Test with 79 devices in productive environment). I will go now with this version for a while to see the long time stability and behavior.

@mgeramb
Copy link
Author

mgeramb commented Jan 5, 2024

Hi, I afraid there is a major bug in the new version.

Everything works fine the last two days. Today I made a reboot of the device, but it comes not up again. In the terminal I saw this error happens, immediately after starting:

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 271414342, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0030,len:1184
load:0x40078000,len:13260
load:0x40080400,len:3028
entry 0x400805e4
Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4008b1f0  PS      : 0x00060330  A0      : 0x801b5260  A1      : 0x3ffe3b40  
A2      : 0x00000000  A3      : 0x00000000  A4      : 0x000000d8  A5      : 0x00000000  
A6      : 0x56780403  A7      : 0x0000000d  A8      : 0x3ffc9ef8  A9      : 0x00000003  
A10     : 0x3ffc9b20  A11     : 0x3ffc9ef8  A12     : 0x3ffe3ac8  A13     : 0x3ffbb096  
A14     : 0x00000000  A15     : 0x3ffe3ae0  SAR     : 0x00000019  EXCCAUSE: 0x0000001d  
EXCVADDR: 0x00000000  LBEG    : 0x4008b1f0  LEND    : 0x4008b1fb  LCOUNT  : 0x0000000c  


Backtrace: 0x4008b1ed:0x3ffe3b40 0x401b525d:0x3ffe3b50 0x4011a646:0x3ffe3b70 0x400fc03d:0x3ffe3ba0 0x4010e26a:0x3ffe3bd0 0x4010e2fb:0x3ffe3c00 0x4016cf73:0x3ffe3c20 0x40083415:0x3ffe3c50 0x40079306:0x3ffe3c90 |<-CORRUPTED

  #0  0x4008b1ed:0x3ffe3b40 in memset at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/src/newlib/newlib/libc/machine/xtensa/memset.S:143
  #1  0x401b525d:0x3ffe3b50 in mbedtls_sha512_init at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/mbedtls/port/sha/parallel_engine/esp_sha512.c:106
  #2  0x4011a646:0x3ffe3b70 in HapOut::HapStreamBuffer::HapStreamBuffer() at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HAP.cpp:1612 (discriminator 1)
  #3  0x400fc03d:0x3ffe3ba0 in HapOut::HapOut() at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HAP.h:216 (discriminator 2)
  #4  0x4010e26a:0x3ffe3bd0 in __static_initialization_and_destruction_0(int, int) at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HomeSpan.cpp:48
  #5  0x4010e2fb:0x3ffe3c00 in _GLOBAL__sub_I_hapOut at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HomeSpan.cpp:2441
  #6  0x4016cf73:0x3ffe3c20 in do_global_ctors at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/startup.c:201
      (inlined by) start_cpu0_default at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/startup.c:444
  #7  0x40083415:0x3ffe3c50 in call_start_cpu0 at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/port/cpu_start.c:630

After several retries, I switched back to the commit beef66e, the devices comes up again and worked normally. I tried several reboots, all worked fine.
So I tried again with the lastest dev-1.8.1-dev branch.
The same error occured.
Switching back to commit beef66e and it was fine again.

It's not clear for me what happened, because the latest version worked already for several days. I hope the stack trace help to find the issue. Let me know if I can help with testing somehow.

I'am using the Adafruit ESP32 Feather V2, 8MB Flash 2MB PSRAM board with active PSRAM.

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 5, 2024

Thanks very much for this --- I'll see what I can find based on the stack trace.

@frankonski
Copy link
Contributor

ditto. I switched back to the release-1.8.1-dev branch, got a crash I didn't copy down. Did a git reset to beef66e, then right at boot this is what was logged:

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0030,len:1344
load:0x40078000,len:13964
load:0x40080400,len:3600
entry 0x400805f0
Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4008c003  PS      : 0x00060130  A0      : 0x8013f364  A1      : 0x3ffe3b90  
A2      : 0x00000000  A3      : 0x00000000  A4      : 0x000000d8  A5      : 0x00000000  
A6      : 0x00060320  A7      : 0x0000000d  A8      : 0x3ffc9d90  A9      : 0x00000003  
A10     : 0x3ffc99b8  A11     : 0x3ffc9d90  A12     : 0xb6eea556  A13     : 0x3ffc602c  
A14     : 0x007b7d48  A15     : 0x003fffff  SAR     : 0x00000019  EXCCAUSE: 0x0000001d  
EXCVADDR: 0x00000000  LBEG    : 0x4017bad8  LEND    : 0x4017baef  LCOUNT  : 0x00000000  


Backtrace: 0x4008c000:0x3ffe3b90 0x4013f361:0x3ffe3ba0 0x400d7b3d:0x3ffe3bc0 0x400db8b9:0x3ffe3be0 0x400de55c:0x3ffe3c00 0x400f6c3f:0x3ffe3c20 0x40083be5:0x3ffe3c50 0x40079316:0x3ffe3c90 |<-CORRUPTED




ELF file SHA256: 370fc237c7b931df

E (532) esp_core_dump_flash: Core dump flash config is corrupted! CRC=0x7bd5c66f instead of 0x0

Then after rebooting from that exception, it is like the board rebooted using a previously-working partition. I had branched weblog-tls from e715d67, so I know this last one is good.

@mgeramb
Copy link
Author

mgeramb commented Jan 5, 2024

That is interesting, for me working (!!!) one is the beef66e. This version is crashing on your site?

The none working is the latest on release-1.8.1-dev 4269eca

@mgeramb
Copy link
Author

mgeramb commented Jan 5, 2024

From reading the stack trace, I would expect the bug in the commit 1f13906
Should I verify this assumption? Do you know if the commit before was a basically working version?

@mgeramb
Copy link
Author

mgeramb commented Jan 5, 2024

The f2cb880 shows an dump in the autoloop function

 #0  0x4008b089:0x3ffec8e0 in memcpy at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/src/newlib/newlib/libc/machine/xtensa/memcpy.S:163
  #1  0x401e3be5:0x3ffec8f0 in std::char_traits<char>::copy(char*, char const*, unsigned int) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/bits/char_traits.h:352
      (inlined by) std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, int) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/bits/streambuf.tcc:90
  #2  0x4022e171:0x3ffec910 in std::basic_streambuf<char, std::char_traits<char> >::sputn(char const*, int) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/streambuf:458
  #3  0x401f12dd:0x3ffec930 in void std::__ostream_write<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, int) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/bits/ostream_insert.h:50
  #4  0x401f1379:0x3ffec950 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, int) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/bits/ostream_insert.h:101
  #5  0x401f1429:0x3ffec980 in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/ostream:564
  #6  0x40103b3a:0x3ffec9a0 in Span::printfAttributes(int) at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HomeSpan.cpp:1245
  #7  0x40118e27:0x3ffec9e0 in HAPClient::getAccessoriesURL() at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HAP.cpp:903
  #8  0x40115b9d:0x3ffeca30 in HAPClient::processRequest() at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HAP.cpp:249
  #9  0x400fded1:0x3ffeca90 in Span::pollTask() at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HomeSpan.cpp:270 (discriminator 5)
  #10 0x400debd4:0x3ffecb90 in Span::autoPoll(unsigned int, unsigned int, unsigned int)::{lambda(void*)#1}::operator()(void*) const at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HomeSpan.h:384 (discriminator 1)
  #11 0x400debf5:0x3ffecbc0 in Span::autoPoll(unsigned int, unsigned int, unsigned int)::{lambda(void*)#1}::_FUN(void*) at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HomeSpan.h:387

But this crash does not stop my device.

The commit 1f13906 crashes immediately after device start we the same stack trace which I posted in my previous posting.

@mgeramb
Copy link
Author

mgeramb commented Jan 5, 2024

I have tested now several time. The commit f2cb880 is the first one, which shows an exception.

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 5, 2024

Are you both using PSRAM?

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 5, 2024

In a working version can you erase the WiFi using the 'X' command - this will prevent any communication after rebooting, which means none of the new code will be called. The only new code used will be the constructor for hapOut, which will narrow things down.

@mgeramb
Copy link
Author

mgeramb commented Jan 5, 2024

Did some testing with the latest version which normally crashed immediately. I changed in the constructor of HapOut::HapStreamBuffer the allocation from PSRAM to HS_MALLOC, because I saw the crash happens in the mbedtls_sha512_init function. This causes, that the device start working, but crashed now here:

 #0  0x4008b089:0x3ffec900 in memcpy at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/src/newlib/newlib/libc/machine/xtensa/memcpy.S:163
  #1  0x401e2825:0x3ffec910 in std::char_traits<char>::copy(char*, char const*, unsigned int) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/bits/char_traits.h:352
      (inlined by) std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, int) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/bits/streambuf.tcc:90
  #2  0x4022cddd:0x3ffec930 in std::basic_streambuf<char, std::char_traits<char> >::sputn(char const*, int) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/streambuf:458
  #3  0x401eff1d:0x3ffec950 in void std::__ostream_write<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, int) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/bits/ostream_insert.h:50
  #4  0x401effb9:0x3ffec970 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, int) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/bits/ostream_insert.h:101
  #5  0x401f0069:0x3ffec9a0 in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/no-rtti/libstdc  -v3/include/ostream:564
  #6  0x40103cde:0x3ffec9c0 in Span::printfAttributes(int) at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HomeSpan.cpp:1238
  #7  0x401051da:0x3ffeca00 in Span::updateDatabase(bool) at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HomeSpan.cpp:1536
  #8  0x40114012:0x3ffeca70 in HAPClient::init() at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HAP.cpp:106
  #9  0x400fd77b:0x3ffecad0 in Span::pollTask() at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HomeSpan.cpp:187
  #10 0x400ded80:0x3ffecbd0 in Span::autoPoll(unsigned int, unsigned int, unsigned int)::{lambda(void*)#1}::operator()(void*) const at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HomeSpan.h:383 (discriminator 1)
  #11 0x400deda1:0x3ffecc00 in Span::autoPoll(unsigned int, unsigned int, unsigned int)::{lambda(void*)#1}::_FUN(void*) at /Users/mgeramb/sources/OpenKNX/HomeSpan/src/HomeSpan.h:386

I have currently no test environment for none PSRAM devices

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 5, 2024

Can you run with PSRAM disabled?

@mgeramb
Copy link
Author

mgeramb commented Jan 6, 2024

It is related to PSRAM usage. I changed all the mallocs in the stream #733 and no crash anymore. Maybe the reason is the time of allocation, the stream will be instantiated directly at startup. Maybe also lazy loading could fix the problem, but not yet tested. I think the memory usage of 2k would be also ok for the normal heap.

@frankonski
Copy link
Contributor

Are you both using PSRAM?
I am. I only have a single board, Freenove ESP32 WROVER dev kit. Maybe I should invest in another board… suggestions?

@mgeramb
Copy link
Author

mgeramb commented Jan 6, 2024

I'am using the adafruit esp32 feather v2 board. But I'am not sure if this board is currently still in production, because it is out of stock in a lot of shops.

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 7, 2024

Thanks @frankonski and @mgeramb for identifying the source of the issue. A few weeks ago the chip I was using to test PSRAM broke and I was not able to test the implementation of hapOut when running under PSRAM. However, I was able to find a few more chips in my stash with PSRAM (one ESP32-S2 and one ESP32-S3), and once I tested I reproduced exactly what you were seeing. The solution is indeed to NOT use PSRAM for the output stream. Since using just malloc() does not guarantee internal RAM will be used (the IDF will choose), you'll note in the code I used a version of malloc() that allows you to explicitly determine the type of RAM, so hapOut should never been assigned to PSRAM.

As an aside, when testing with the ESP32-S3 using PSRAM, the system kept crashing. However, it seemed to work for some sketches but not others. After spending 3 hours trying to determine the issue, I finally realized the problem was not the software, but rather that for my primary test sketch and hardware, I use a lot of pins (one RGB LED, one contact switch, one NEO Pixel, an I2C-based Temperature Sensor, two pushbuttons, and one touch sensor). This setup allows me to stress just about every module in HomeSpan. However, the PSRAM I have on the ESP32-S3 is Octal (OPI) and when enabled needs to use 8 external-facing pins, a number of which overlapped with those used in my test sketch. Something to watch out for if you use Quad or Octal PSRAM.

In any case, please feel free to test to the latest commit and let me know if you encounter any other issues.

@frankonski
Copy link
Contributor

Thanks, but while the cause has been found, that does not really explain why it crashes on allocating a simple char *..

Something to watch out for if you use Quad or Octal PSRAM.

hum… I will review the pins I use based on this post.

Nevertheless, I will try the latest commit soon.

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 7, 2024

Thanks, but while the cause has been found, that does not really explain why it crashes on allocating a simple char *..

Yeah, I wish I knew exactly whay it was not working. Did some research and it seems like there is definitely an interaction between spinlock and PSRAM, though why put only impact an oStream is not obvious to me.

@frankonski
Copy link
Contributor

Something seems shady... Switching back to the release-1.8.1-dev branch, fully updated, the setWebLogCSS() callback seems to no longer have any effect. WebLogs are back to your default lightblue background color.

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 7, 2024

I'll take a look. May have dropped something when I updated the weblog to use hapOut.

@HomeSpan
Copy link
Owner

HomeSpan commented Jan 7, 2024

Fixed and updated in 1.8.1-dev. I missed a conversion of the CSS String to a c-style string. Since my test sketch explicitly contains a CSS that turns everything gaudy colors, you think I would have noticed this. Too many late night-coding sessions...

More broadly, Strings are an Arduino-specific thing that are not actually based on std::string. I should have used std::string when I first created HomeSpan, since the rest of the C standard library knows what they are and iostreams make the conversion automatically. For Arduino Strings you must call the c_str() method when using with many standard C functions.

@HomeSpan
Copy link
Owner

I think we resolved all the memory and PSRAM issues, so I'll close out this thread. Please feel free to re-open if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants