-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device limit and memory discussion for 1.8.1-dev #684
Comments
These are good suggestions. What should be relatively straightforward is for me to change my |
I have already an idea how the Accessory handling can be done. A context object should be created for this task. This object can handle the header, footer and sending the data. So, before the loop through the accessory starts, the context object has to be created. It will immediately create the header in a buffer. Then the loop starts and the context object will passed to the accessories. Each accessory will put the needed data to the context object which fills it in its internal buffer. Every-time if the context object has collected enough data, it will encrypt the data and send it to the network. If you want I could support you with the development, but I would need some hints where I found the loop through the accessories and information where you thing that such large buffers are created so that is worth to implement this mechanism. |
I've update release-1.8.1-dev to make the change I discussed above, which saves 15-20 KB for a device with 50 simple lightbulb Accessories. The size of the temporary buffer to store encrypted frames is now set to the minimum size needed for a single frame (which will be 1042 or fewer bytes). Please give it a try. Operating on an Accessory-by-Accessory basis is going to be much more intrusive, and given other priorities and requests I'm not going to tackle, at least at this time. However, please feel free to give it a go if you'd like. If you look at the 'd' Serial Monitor Command around line 629 of HomeSpan.cpp you'll see the start of how the Accessory attributes JSON is constructed. |
50 devices and not limit in sight: |
Further update: I changed the method by which the lists of required and optional Characteristics for each Service are stored. I was using C I also added some error checking for NVS storage, which starts to become a limiting factor if you have a lot of Accessories and they have Characteristics with NVS storage set to true. |
Note that if you use |
This is might be a good option. But I currently do not understand the relation between stack usage and the number of accessories. I have only seen lists with loops through the accessories, so the stack consumption should not increase with the number of devices. Only if there would be a link between the accessories and a function call goes recursive through this links, the stack will be effected. So maybe the error is wrong and just the memory becomes corrupted somehow. But independent of this, I can confirm that the new version with your and my changes runs stable since one day in a production environment. No memory leak, everything is fine and responsive. |
How are you determining the size of the stack? Also, what does "Used Channels" represent in your diagnostics? |
Are you using NVS to store Characteristic values? If so, this may be a limiting factor (you can turn this off by not providing a second argument to Characteristic instantiations). |
"Used Channels" are the number of accessories, to get the stack watermark I use the function |
The error you are seeing, Are you using arduino-esp32 version 2.0.14? |
@frankonski I'am using currently 2.0.9. I will try to update
This saves about 1k memory in my case |
I have an ESP32-S2 from Unexpected Maker in the Adafruit Featherboard configuration. It has an additional 8MB of PSRAM. I have been able to run up to 80 simple accessories as a bridge (just a lightbulb with On/Off). I'm tracking the high watermark for the stack (using just poll(), not autopoll()) and it is NOT affected by the number of Accessories (it gets as well as about 2,924). I then modified my sketch to include the Brightness Characteristic. The stack usage did not change, but heap memory did and I was not able to run as many Accessories (at least 60, but it dies at 80). The ESP32-S2 is not throwing any errors: it basically just freezes when (or right after) trying to transmit the Accessories data to the Home App. So this strongly suggests it's not a stack issue, but it is some sort of memory issue. I'm using 1.8.1-dev as per my latest posting in GitHub, which does not include any of your Context PR code. If you are seeing changes in the stack usage as a function of the number of Accessories this may mean it's something in your Context code? Though I have 8MB of PSRAM, I am still running out of memory. By separately tracking low points of total and internal RAM separately, I can see that I am drawing down on BOTH the PSRAM and internal RAM, and the internal RAM gets quite low (10 kB at the minimum). This low-point DOES (roughly) track the number of Accessories. From the ESP32 documentation of PSRAM, I know that not all processes can use that memory, especially those that use interrupts (which need to be in main RAM). To force usage of PSRAM, I changed the malloc() call in TempBuffer to ps_malloc(), though this did not have any impact on low internal RAM points. My guess is that the vectors I'm using to store the Accessory/Service/Characteristics are perhaps being pulled from internal RAM, and this is drawdown causes internal RAM to get low enough that other processes that MUST use internal RAM are failing. To see if that is the issues, I am going to force my vectors to use PSRAM instead. The std::vector class allows you to define your own memory allocator, and I've seen others post some example code of how this is done. Will give it a try and see what I may learn. |
@frankonski Do you know how the arduino-esp32 version can be specified in a platformio project? I tried to upgrade the platform to the newest espressif32 version, but the newest arduino-esp32 version I get is 2.0.11 |
@HomeSpan I do no large allocation on the stack in my branch, but maybe this issue is somehow related to the arduino-esp32 version or something in the web client implementation. Maybe you can try my branch with your sketch, I would expect that this should improve the temporary heap usage for request to a minimum. Also would it be helpful to see if you run in the same stack issue. |
@HomeSpan I have written now a new dyamic array pointer implementation. This are the features:
You can find the implementation of the dynamic pointer array in my PR #686 My app is now running with 60 devices, this was my target and I will run it now for longer time to see if it is stable, for now it seems to be. |
No idea, I don't use PlatformIO. However, I do use VSCode with some personal touch. Extensions installed:
I have Arduino IDE 2.x installed, that I launch from time to time to get automatic updates, which gets installed in I have I have a
Also, having ESP-IDF extension installed makes serial monitoring SO better than the Serial Monitor from MS, as the MS one seems to buffer and lag, while using ESP-IDF extension, the serial monitor kicks off a terminal, so no lag when there is lots of output. Lastly, I have another task that runs Bottom line:
|
@mgeramb, I'm glad you've been able to get 60 devices running, but with regards to the PRs, I'm uncomfortable adding them to the production branches at this time. The techniques you used are really interesting but they add a lot of complexity I'm not sure I'll be able to effectively support at this juncture. I'd encourage you to keep the PRs active so there is a good record of the changes, but very shortly Espressif is going to release version 3 of the Arduino-ESP32 Board Manager for support of the H2 chip and thread. According the Espressif, this version has a lot of breaking changes which will likely require substantive modifications to HomeSpan just to get it to work without even using anything related the thread or the H2. I want to make sure the base code stays as clean as possible with only incremental changes until at least then. I'll continue exploring other memory issues as well, but will try to avoid major changes for now. |
To close the loop on my PSRAM comment above, by using a custom allocator for vector that forces allocations from PSRAM, I can run 90 lightbulbs with a on/of and dimmer characteristics. The purpose of this test was to verify that PSRAM can be effectively used to extend the memory. It can, but the automated PSRAM handling by Espressif requires you to override @frankonski, this might help with some of your PSRAM questions/issues. (I may try to make a generic PSRAM switch based on the whether a device has PSRAM). |
I'm still reading & learning about the different memory models & types, even as of few minutes ago. So far, my recent understanding is that SPIRAM and PSRAM is the same thing. Using ok. Now if calling For as long as HomeSpan uses HomeSpan uses mBedTLS do some crypto-related operations. If mBedTLS is set to use internal RAM, the larger the data to encrypt (the more accessories you have?), the larger internal RAM would be used, right? Unless that setting above is used, technically speaking. |
No, this is not exactly how it works. HomeSpan has to create a json for each calls from the HomeKit controller which contains infos about all accessories. This json file becomes bigger and need more temporary free memory on the heap if more accessories are defined. After this operation the buffer is divided in 1024 byte blocks which will be encrypted. So, this means, for encryption the needed buffer always have a maximum size of 1024 bytes. But in the current released version, again a large buffer is created with hold all encrypted parts for sending it afterwards to the network. The change in the 1.8.1-dev branch from @HomeSpan is, that the encrypted buffer will be immediately sent to the network. This saves memory for the large sending buffer. But for the json is still a large temporary buffer needed. |
One area where encryption memory is wasted in the SRP6 code. Here I pre-define a lot big integers, but none of this is needed after the initial pairing (SRP is only used to pair). In my server branch I re-factored this so that big integers are only initialized when needed and then destroyed right after. There is also some wasted space in the TLV implementation that I also re-factored in the server branch. I will eventually bring both of those into the main branch but am trying to balance new functionality versus optimization (e.g. I'm hoping to work on implementing descriptive constants for each of the Characteristics (per a separate thread), which will help new users who don't have access to the original HAP document (which Apple has pulled from their web site). |
@HomeSpan I completely understand you, that you feel you not comfortable with the PR. The PR is just a Draft, Proof of Concept and discussion starting point and should currently not be merged. But let me explain the concept behind it, basically it should make the code much more easier than it is currently. Simple example: Current code: As you can see here in this example, the second version is much more easier, no checks for null for cBuf, no counting of bytes anymore. Just write out what you want, including all of the formatting feature of sprintf. And the output can simple be going to a memory buffer, the console, the network, or maybe also encrypted to the network. So, where and how comes this magic? I used a very common pattern in the SW development which is used in a lot of high performance implementation. I give you an example. I'am sure you know a language with have some kind of JsonConverter. It works like this, that it has a memory buffer inside to hold the complete json. This is good for small object, but for very large objects it will not work in this way. In this situation you normally use class like JsonWriters. A json writer just provides an interface for writing out date needed Data. But the json writer does not hold the complete json, instead it writes out the data to some target, and that is the interesting thing here, the target can be a network, a file or whatever you want where you need the data. You also will see similar concepts for XML, Streams, StringWriters, filter chains, etc Exactly the same I'am doing here. I have created an basically abstract interface class for writing out data with printf and print functions. Currently I named it CallContext, maybe not a good choice, maybe DataWriter or something else would be the better naming. The next good thing in this concept is, that you can now decide completely independent what should happen with the data which is provided by your objects. Just by deriving a new class of the CallContext, you remember, it is basically just an interface (in my PR it has a little bit more code, but maybe this should be removed). In the derived class, you can decide what you want to do with the data, e.g. storing it in a large buffer as it was before or even better, directly write it out the network. Or you want to give out the data on the console? Just derive from the CallContext, implement one function which writes out the data to the console, or maybe you have to calculate a hashcode? Just derive a new class, implement one function with the hash calculation. Let me summarize all this things:
Just check in your code snipped with your 90 accessories, how many bytes you need here in the heap:
My implementation will need a maximum of 2048 bytes for buffering and encryption. I expect, you just get a complete wrong picture of my PR because the integration is not yet fully done and your code becomes for this reason very ugly. If you want, I could provide on example how your code would be after the fully integration. And one more, maybe important information for your decision. I need this implementation for another large open source project, the (OpenKNX)[https://github.com/OpenKNX] project where I'am working on a (KNX-Homekit bridge)[https://github.com/OpenKNX/OAM-SmartHomeBridge]. So it is not only for my private usage, a lot of user will benefit from the improvement which makes it worth for me to spend so a lot of time on it. If you have any question to understand the concept, please ask me, or if you want we can also make a remote session, it is important that you understand everything. |
Happy New Year! I finally finished my refactoring of the TLV and SRP routines and have been able to re-focus on the memory-limitations above. Though the solution you proposed using contexts would work, I believe there is a much easier way. The root of the issue is that I am using and sprintf(), which is arguably an old C-style of creating formatted text. Instead, I should be using standard C streams, such as To solve the problem, I replaced all my main Then, I created a customized stream buffer holding 1024 bytes. Whenever the buffer fills, it flushes the output to:
Rather than make these changes to the As you can see, it's a really small amount of code. But it is generic and it works for everything from printing the Accessory database to the screen, to producing HTTP web logs, to transmitting all HAP HomeKit data using the proper encryption keys for each HAP Client. I've tested this with up to 70 Accessories and it works fine. You'll note that regardless of how large the Accessory database is, it never uses more than a few KB when printing to the screen. The only issue I have found is that chunking the data into 1024 bytes packages and transmitting in a loop to a WiFi client can overrun the ESP32 WiFi buffers -- the symptoms are usually that the WiFi connection drops and can't re-connect. These are internal to the IDF and there is not much I can do to check if the transmission is completed before sending the next packet. To address this I added a 50ms delay after each transmission, which seems to solve the problem for up to 70 Accessories, and does not negatively impact operations with the exception that the weblog page take a little longer to load if you have 200 weblog records. However, I would like to find a more robust solution. Decreasing the size of the buffer seems to make things worse, not better, but I can't increase the size beyond 1024 if I want to continue encryption on the fly. To speed up the weblog, I may add a second buffer with a larger capacity that is used for unencrypted transmission. Please give it a try and let me know what you think. I've added a simple example called Max Accessories to help stress the system. I have not yet checked on a chip with a large amount to PSRAM to stress with 149 Accessories. |
Very interesting news. But please check the heap usage before you start big changes in the code. In my tests the heap usages was very high for 149 devices and I do not know why. My only idea is, that maybe a base library like the webclient do some recurisve calls if the sending buffer gets filled faster than it could be sent. |
Yes, of course, all this was checked and usage is lower for like-to-to like comparisons. But I can push the envelope a lot further and that's where I'm trying to determine how much I can squeeze out of the memory before hitting a wall. Also, the changes to recursive encrypted transmission code was made a while back. The specific change here is that the data prepared for the transmission is also built on the fly (which takes a little longer) instead of having it all prepared in a large buffer ahead of time. I thought avoiding that large buffer was the main point of the context code. That's basically what I implemented with streams. |
My understanding is that when you transmit data over a socket, the WiFi client, dynamically pulls heap memory for 2 copies of the data - one for the LwIp processing, and one for the MAC process. There are some configurable options if you compile from the IDF, but not from the Arduino-Esp library. |
That is exactly what I was wondering, nothing should be allocated on the stack. Anyway, I will try your version give you feedback |
I figured out the issue I was seeing with my new code. The code is operating as expected - I was measuring the wrong heap. I was using MALLOC_CAP_INTERNAL to monitor memory consumption, which includes IRAM. But IRAM can't be used for data unless guaranteed to be 32-bit aligned (such as array of uint32_t). Hence, I thought I was seeing failures even though there was still some memory available, which was puzzling. But in reality I was using all the DRAM, so hit the limit I expected to hit. I will expand the output of the 'm' command so the memory available for each heap type is clear and unambiguous. |
Should have mentioned above that this also addressed the need for a 50ms delay after each transmit. A delay of 1ms works fine. |
My first test result with the new version: everything works like in my last version base on commit [beef66e] |
Hi, I afraid there is a major bug in the new version. Everything works fine the last two days. Today I made a reboot of the device, but it comes not up again. In the terminal I saw this error happens, immediately after starting:
After several retries, I switched back to the commit beef66e, the devices comes up again and worked normally. I tried several reboots, all worked fine. It's not clear for me what happened, because the latest version worked already for several days. I hope the stack trace help to find the issue. Let me know if I can help with testing somehow. I'am using the Adafruit ESP32 Feather V2, 8MB Flash 2MB PSRAM board with active PSRAM. |
Thanks very much for this --- I'll see what I can find based on the stack trace. |
ditto. I switched back to the
Then after rebooting from that exception, it is like the board rebooted using a previously-working partition. I had branched |
From reading the stack trace, I would expect the bug in the commit 1f13906 |
The f2cb880 shows an dump in the autoloop function
But this crash does not stop my device. The commit 1f13906 crashes immediately after device start we the same stack trace which I posted in my previous posting. |
I have tested now several time. The commit f2cb880 is the first one, which shows an exception. |
Are you both using PSRAM? |
In a working version can you erase the WiFi using the 'X' command - this will prevent any communication after rebooting, which means none of the new code will be called. The only new code used will be the constructor for hapOut, which will narrow things down. |
Did some testing with the latest version which normally crashed immediately. I changed in the constructor of HapOut::HapStreamBuffer the allocation from PSRAM to HS_MALLOC, because I saw the crash happens in the
I have currently no test environment for none PSRAM devices |
Can you run with PSRAM disabled? |
It is related to PSRAM usage. I changed all the mallocs in the stream #733 and no crash anymore. Maybe the reason is the time of allocation, the stream will be instantiated directly at startup. Maybe also lazy loading could fix the problem, but not yet tested. I think the memory usage of 2k would be also ok for the normal heap. |
|
I'am using the adafruit esp32 feather v2 board. But I'am not sure if this board is currently still in production, because it is out of stock in a lot of shops. |
Thanks @frankonski and @mgeramb for identifying the source of the issue. A few weeks ago the chip I was using to test PSRAM broke and I was not able to test the implementation of hapOut when running under PSRAM. However, I was able to find a few more chips in my stash with PSRAM (one ESP32-S2 and one ESP32-S3), and once I tested I reproduced exactly what you were seeing. The solution is indeed to NOT use PSRAM for the output stream. Since using just As an aside, when testing with the ESP32-S3 using PSRAM, the system kept crashing. However, it seemed to work for some sketches but not others. After spending 3 hours trying to determine the issue, I finally realized the problem was not the software, but rather that for my primary test sketch and hardware, I use a lot of pins (one RGB LED, one contact switch, one NEO Pixel, an I2C-based Temperature Sensor, two pushbuttons, and one touch sensor). This setup allows me to stress just about every module in HomeSpan. However, the PSRAM I have on the ESP32-S3 is Octal (OPI) and when enabled needs to use 8 external-facing pins, a number of which overlapped with those used in my test sketch. Something to watch out for if you use Quad or Octal PSRAM. In any case, please feel free to test to the latest commit and let me know if you encounter any other issues. |
Thanks, but while the cause has been found, that does not really explain why it crashes on allocating a simple char *..
hum… I will review the pins I use based on this post. Nevertheless, I will try the latest commit soon. |
Yeah, I wish I knew exactly whay it was not working. Did some research and it seems like there is definitely an interaction between spinlock and PSRAM, though why put only impact an oStream is not obvious to me. |
Something seems shady... Switching back to the |
I'll take a look. May have dropped something when I updated the weblog to use hapOut. |
Fixed and updated in 1.8.1-dev. I missed a conversion of the CSS String to a c-style string. Since my test sketch explicitly contains a CSS that turns everything gaudy colors, you think I would have noticed this. Too many late night-coding sessions... More broadly, Strings are an Arduino-specific thing that are not actually based on std::string. I should have used std::string when I first created HomeSpan, since the rest of the C standard library knows what they are and iostreams make the conversion automatically. For Arduino Strings you must call the |
I think we resolved all the memory and PSRAM issues, so I'll close out this thread. Please feel free to re-open if needed. |
I using know the 1.8.1-dev version for some days without issue. But I have not so much success with increasing the number of devices.
Your limit with 80k seems to be really the limit where everything works fine. Now I have analyzed this limit a little bit.
I checked the maximum used heap and found that the 40k is the lower water mark which left free. So there should enough memory available for more devices. Next step was checking the maximum free block size, which is also about 40k. This means the memory is fragmented and a large block greater than 40k can not be reserved. For me this means, that Homekit seems to be use very large blocks of memory for answering the web request. So I take a look in the code and I found maybe the confirmation for this. There is one big block for creating the json for all devices. Then as second block will reserved which receives the encrypted body.
My theory now is, that it should be possible to reduce the memory consumption for this a lot with the following idea:
The library should reserve the memory for one encryption frame size. Then the loop starts through the devices. In this loop the memory for the body for only one device is needed. This memory will be copied to the buffer for encryption, if the frame is full the body part should be sent with chunked encoding to the network.
Do you think that my analyze is correct? Do you think that a change like this would be possible? Are you interested in such an change and would you support me?
Best,
Michael
The text was updated successfully, but these errors were encountered: