Skip to content

Progmem F() Strings Manager #4270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Hiddenvision opened this issue Feb 1, 2018 · 29 comments
Closed

Progmem F() Strings Manager #4270

Hiddenvision opened this issue Feb 1, 2018 · 29 comments
Labels
waiting for feedback Waiting on additional info. If it's not received, the issue may be closed.

Comments

@Hiddenvision
Copy link

Hiddenvision commented Feb 1, 2018

Again, Not an issue but more of an Improvement thought.

I read that using Progmem for read only strings is all cool and dandy
but the compiling/linking process does not optimise repeated text used.

I was toying with a Strings manager concept that would handle all the static strings.
So you could perhaps litter your code with F_(55) where strings are required.
Save all the Indexed Strings in a txt file with a little tool to (add/search/replace/delete) entries.

F_(int i); being a simple function to return a constructed string.

An example optimizedStrings.h file

const char HTTP[] PROGMEM = "http:";
const char DOTCOM[] PROGMEM = ".com";
const char DOTCOUK[] PROGMEM = ".co.uk";
const char WEBSITE[] PROGMEM = "visitme";
const char HELPSITE[] PROGMEM = "helpme";
const char THANKS[] PROGMEM = "Thanks";
const char MR[] PROGMEM = "Mr:";
const char MRS[] PROGMEM = "Mrs:";
const char SMITH[] PROGMEM = "Smith";
const char PAGE[] PROGMEM = "Page ";
const char NOTHING[] PROGMEM = "Nothing ";
const char UNIQUE[] PROGMEM = "unique";
const char UNKNOWN[] PROGMEM = "UNKNOWN";

String F_(int i){
 switch (i) {
  case 1:     return (String) FPSTR(MR) + FPSTR(SMITH);
  case 2:     return (String) FPSTR(MRS) + FPSTR(SMITH);
  case 3:     return (String) FPSTR(HTTP) + FPSTR(WEBSITE) + FPSTR(DOTCOM);
  case 3:     return (String) FPSTR(HTTP) + FPSTR(HELPSITE) + FPSTR(DOTCOUK); 
  case 55:   return (String) FPSTR(NOTHING) + FPSTR(UNIQUE);
  case 721: return (String) FPSTR(PAGE) + FPSTR(UNIQUE) + FPSTR(THANKS); 
  default: return (String) FPSTR(UNKNOWN) + String(i); 
 }
}

end of optimizedStrings.h

As you can see from this random example trying to create and maintain this sort of thing by hand would be a nightmare but if technically this could save some space then to automate the creation of the builder function and const char progmem bits, is relatively easy to do from my point.
As a precompile process all the user strings can be studied and duplicates optimised.
Simple little exe to buz thru an indexed txt file doing the pre optimising stuff and outputting a fresh include file ready for compile.

Just wondering if this would be worth looking at or would it be a wasted folly.
I am no pro coder, I'll slap on more code than sunblock, until the resources run low,
and I'll be the first to say I can over complicate a problem that does not exist just for the fun of it.
It may even be the worst thing to even consider and produce horrendous output,
happy to take any comment.!

Now that I can do OTA > 500k, perhaps I wont even think of it again.
But wasted space is ditto.
I did this on the little Atmel devices when I was running low on space, 32k was easy to fill.!
That was not automated, but like I said, 32k is easy to fill, so I had limited text to manage.

It could be calculated that saving space takes time, so there is always a trade-off.

Hv

@Makuna
Copy link
Collaborator

Makuna commented Feb 2, 2018

I don't understand why you wouldn't do the top part, then just use those directly. Its seems more readable than F_(52).

@Hiddenvision
Copy link
Author

Hiddenvision commented Feb 2, 2018

@Makuna
Yes Readability goes out the door but then code is not to be read, it is to executed.!
And the more that can be squeezed in for execution the better on these little devices.
With a little manager to search for and SEE the strings F_(42) would look ok.
Personally I like short easy to read code and try to keep all the clutter backstage.

But comes the question of how much code space is used for joining strings.
Is there a letter count threshold to make it worthwhile.?
Who knows such things.?

Serial.Print(F("Mr. Smith"));
Serial.Print((String) F("Mr.") + F("Smith"));
Serial.Print(F_(1));

@Hiddenvision
Copy link
Author

Hiddenvision commented Feb 2, 2018

Actually what would be good is a resource file area in an unused section of flash.
With an update process and some method of mapping that area to an F_() type function.

Then all the strings could live in upper flash just below the FS and not take up any of the limited Sketch space. If what I hear about the Maximum Sketch size being 1MB then to off-load static resources from
the sketch area could be handy.

Plus, updates would be faster if no resources had changed. (Extra bonus)

After having thought for a while,
I guess something like the 20K Flash allocation for simulated EEprom,
Then a version of the EEPROM libs remapped and tweeked with an index lookup.
too crazy for this time of day ???

that's it, the coffee ran out, the effects of the coffee expired, I'm done for today.

Hv.

@tonton81
Copy link

tonton81 commented Feb 2, 2018

speaking of “resources”, you shouldn’t be using the “String” class. Character arrays are always the best.

@Hiddenvision
Copy link
Author

Hiddenvision commented Feb 2, 2018

@tonton81
Great tip, I know strings are lazy but, well that just suits my use from VB.
https://hackingmajenkoblog.wordpress.com/2016/02/04/the-evils-of-arduino-strings/

If I ever do a production thing I would take more care.!
Or pay someone else to write it properly.

I played around with the EEPROM libs,
Actually just renamed everything containing EEPROM to RESOURCES.
Changed the Sector pointer. and it compiled.
But I had not time to actually make use of or test it .
After reading the code I think it may be a little heavy for pulling individual strings
but an interesting thought none the less.

Hv.

@vdeconinck
Copy link
Contributor

Yes Readability goes out the door but then code is not to be read, it is to executed.!

I beg to disagree on that point.

@Hiddenvision
Copy link
Author

Hiddenvision commented Feb 2, 2018

@vdeconinck
no worries, that's what we are all here for.
But honestly how often do you need to "see" code once it is written and working.?

As far as maintenance, having all the resources in the one place seems like an easier approach.
Oh and when I mentioned Strings, I was being generic and just talking about things with letter in them.

But hey that's just my lazy mind stretching it's legs,
Personally I think the use of any undeclared constant should be outlawed.

@vdeconinck
Copy link
Contributor

@Hiddenvision many experienced programmers agree that you should aim for 1.correctness 2.readabilty 3.performance, in that order, and 3. only if necessary ("premature optimization is the root of all evil").
Your proposal does not respect 2. and 3. (as you admitted) but it also does not respect 1. (nasty bug: you have two "case 3:" in your code).
In other words, you just proved my point: while you would think it's "working", it's not, and as it's not "readable" either, the bug went unnoticed and will byte you (or your users/customers) some time in the future.
You are very welcome to use such a manager in your code if you like but IMHO it wouldn't be wise to include it in this project (I admit I even fail to grasp how you could include it, as it's totally application dependent, but maybe that's just me).
Just my 2 cents.

@Hiddenvision
Copy link
Author

Hiddenvision commented Feb 2, 2018

I'd be switching 2 & 3 around myself.
But yep number 1 is in the right spot.
No point in writing code if it don't work right. (or at least with acceptable flaws!)

My background comes more from hacking and reverse engineering so prefer assembler if honest but c and such things do make portability easier.
As far as readability, I guess it is just different.
I would still rate a good resource manager over doing it by hand.

and being able to edit all the "Strings" in one single file, it worked for me
but then my mind is pretty shattered so I like stuff to be easy to find

But lets be real, bugs are our own creation mostly,
typing different things sometimes gives different results.
Re the two Case 3:'s
That's what copy and pasting on a browser does for ya,
But the compiler will speak to you in firm words if used in real life.
Correct it, move on and try not to do it again.
Plus if the file creation was automated then any errors will be software created and not down to Human intervention.

At the end of the day the compiler does not care about what stuff looks like as long as it fits thru the sieve.

But don't think I do not understand the need for well structured stuff in it's place.
Any public facing code should always be well dressed.

However, I would be curious to know if doing such a thing would result in a "space" saving.

@devyte
Copy link
Collaborator

devyte commented Feb 4, 2018

@Hiddenvision you are correct: when strings are duplicate under normal use, the build process (supposedly) unifies duplicates. However, when those strings are moved to flash, then that optimization no longer works. Therefore, if there are duplicate strings (and we do have a few of those), then they are wastefully increasing the bin size. Removing duplicates therefore saves bin space.

As commented above, your original code structure has one Bad Idea: the switch statement with numerical index. This makes usage like F(55) very unreadable for posterity. True, code is to be executed, but after it executes the first time, that code has to be maintained. Bugs can appear, restructuring can be needed, rewrites, or even just a look to check on some behavior detail. It is because of these reasons that readability is important: code lives on after it has been written for the first time.

If strings are going to be moved to global space, one option is to have each string in its own variable, like you have above, and access them directly via that variable. The other option is to have all strings in a pointer array, and access via an index, in which case an enum with meaningful names is better than a numerical index that says nothing about the string. Here, you don't even need your function with the switch.

About your idea of a resource manager. I am currently doing a minor overhaul of our esp8266webserver, moving several strings to flash, others to global space and removing duplicates, and so on, and I have to say I find the process a bit painful, so I was thinking along similar lines. However, having a tool restructure your code for you makes me cringe a bit, especially when thinking about maintaining that tool cross-platform. Consider a subtle bug in that tool, which ends up messing up some user's code...
So, I thought of a simpler solution, at least as a first step: a string checker, that issues a build warning when it finds strings not in flash, or strings that are in flash that are duplicates.
Why just a checker? Most users who care enough about bin size also care about a warning-free build. At the very least, for those who don't care outright, they'll know that they could improve their code.
We could then eventually use that checker during our CI build process, and fail PRs that don't pass the check.

Problem for me is: there is no way I could justify spending the time to develop something like that on my end. Are you willing to tackle this yourself?

@devyte devyte added the waiting for feedback Waiting on additional info. If it's not received, the issue may be closed. label Feb 4, 2018
@everslick
Copy link
Contributor

IMHO and experience: you run out of RAM long before you run out of flash. at least in general. there might be applications where this could be different, but then you can think about special deduplication strategies that are tailored to that use case.

@Hiddenvision
Copy link
Author

Hiddenvision commented Feb 4, 2018

@devyte Wise words indeed.
All of what people say is true, but there must be some compromise.

Just to be clear I was not thinking of anything that made changes to anyone's core code.

My thought,
You just type in a new string to the Resource-Manager and it give you a new index to use.
Or if you want to Check or edit the strings, enter an index and edit the string as desired.
No need to re-index so no need to back track thru other pages making changes.
Before compile, a separate "generated" include file is made.
the need for some sort of switch function is to "BUILD" the strings from the various parts that use duplicates. There maybe a better method to build these strings but something will need to exist.?

The Index's could be given "Names" but I fear that some of that may creep it's way into the binary depending on debug level as I am sure I see raw function names in the bin.

I would be happy to have a look at what is required to do this on a multi platform level but I would need to finish my own tool so it is publicly safe before I go creating larger monsters.

As mentioned my background is more taking stuff apart, understanding and improving,
and the tools I write are very unique to my needs and written as and while I need them.
Not to say my needs are unique, just the tools.!

I do "currently" have a HUGE amount of free time but not as much of the training and code discipline that maybe required (I score 0% in both!).
But glad to assist anyone else that may have more credentials.

@everslick
I think you maybe right, Ram is always desperate.
Then again almost everything is endangered on these embedded devices.
The first thing I do is run out of pins normally.

Most of my initial coding was done in .5k or 1k of PIC space.
Ahh, those long gone days of hunting for a 4 byte savings.
I used a few tricks on tightening up assembler code,
but again it was not as pretty to read but the silicon was happy.

Hv.

@devyte
Copy link
Collaborator

devyte commented Feb 4, 2018

@Hiddenvision please contact me via gitter, there's something I'd like to discuss in addition to this :)

@Hiddenvision
Copy link
Author

fire away.
you can email or PM on any place you find me.
Or post it here.

hv.

@Hiddenvision
Copy link
Author

Hiddenvision commented Feb 5, 2018

Oh the side note.
When looking at the Bin I still see lots of "RAW" text.
Obviously my own strings but also Full function names and FULL paths to various libraries.

Example:
THIS IS A WASTE OF SPACE!!!\Arduino15\packages\esp8266\hardware\esp8266\2.4.0\cores\esp8266\spiffs_api.h

Someone mentioned trying debug NoAssert-NDEBUG
This did reduce the Bin size a little, 2k in sketch and 0.5k in dynamic. cool,
but still the long names and things.

If this sort of info being in the final binary is unavoidable then it should be considered to move things to a lower lever folder to avoid "LONG" paths being included.

Does anyone else experience this.?

Hv.

@d-a-v
Copy link
Collaborator

d-a-v commented Feb 5, 2018

The long file names are gcc's internal macro __FILE__ expanded in debug messages. Last time I checked we cannot change that (there's no __FILENAMEONLY__-like macro available).
As they should be used only in debug messages, they should not appear in the final binary if the single Tool/DebugLevel NoAssert-NDEBUG option is used.
If this is not the case, then the string should be tracked down and catched into debugging macros.

@Hiddenvision
Copy link
Author

Hiddenvision commented Feb 5, 2018

Hi David,
Easy for me to compile and dump all the strings from the Bin.

Re the filenames, Still don't understand why they have to end up in the final bin.
Unless it is for this debug macro.

#define __FILENAME__ ( strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 :  strrchr(__FILE__, '\\') ? strrchr(__FILE__, '\\') + 1 : __FILE__ )

#ifdef DEBUG
#define DEBUG_PRINT(str)    \
  Serial.print(String(system_get_free_heap_size()));      \
  Serial.print(F(": "));    \
  Serial.print(millis()/1000);     \
  Serial.print(F(", "));      \
  Serial.print(__PRETTY_FUNCTION__); \
  Serial.print(F(": "));      \
  Serial.print(str);
#define DEBUG_PRINTLN(str)    \
  DEBUG_PRINT(str); \
  Serial.println();
#else
#define DEBUG_PRINT(str)
#define DEBUG_PRINTLN(str)
#endif


I shall turn off my use of the __FILE__ and__PRETTY_FUNCTION__macro and see what remains.

Hv.

@d-a-v
Copy link
Collaborator

d-a-v commented Feb 5, 2018

That's fine when DEBUG is defined. We need them.
DEBUG beeing undefined at least when using NoAssert option, the strings won't be included in the final release of the binary.
Did you enable the NoAssert option ?

edit if this DEBUG macro is yours, do you not define it before checking if the strings are present ?
you can use #if !defined(NDEBUG) instead of #ifdef DEBUG. It is the one defined with the NoAssert-NDEBUG option.

edit2 on unix, man assert to check why we are using the NDEBUG macro.

@Hiddenvision
Copy link
Author

Yep, NoAssert ticked in menus.
Oops, sounds like my use of #define DEBUG harks back to my early days of using SP as a handy multi-purpose register.!
Is DEBUG already in use.?
Anyway, did another recompile with all that turned off and still have the long file paths.
Obviously there are "Loads" of strings that are part of the standard libs.
Guess they have to be there.?

Should I extract all the Strings and share.?

@d-a-v
Copy link
Collaborator

d-a-v commented Feb 5, 2018

Yes please, filenames should be there no more with the option

@earlephilhower
Copy link
Collaborator

There may be strings, but if they're not in .RODATA then they do not affect your free memory, only the code size. So please double-check these are not just in .TEXT and not .RODATA. "strings file.elf" is not a reliable indicator of where the string is located, you really do need to use objdump to check it.

I still see only 2 full paths in RODATA, abi.cpp and core_esp8266_main.cpp with -NDEBUG enabled in a random sample.

#4187 should remove a few add'l core strings, if it merges.

@devyte
Copy link
Collaborator

devyte commented Feb 5, 2018

@Hiddenvision btw, I can't find you on gitter. I think you have to log in there at least once first.

@Hiddenvision
Copy link
Author

Hiddenvision commented Feb 5, 2018

@devyte , Sorry dude, What's Gitter.?
Ok found it, arrgh, not another site.

Bin_Text.txt

This was a quick strip of the Text found.
Actually less than I thought but then I only exported text strings > 6 long.

As you can C, I am the worst offender with duplicates, and wasted space all over the place.
But I guess closer to the bottom is where the other stuff is.

Ooo, actually the split is around line 770.
So near 50/50 blame.!.

@Hiddenvision
Copy link
Author

Hiddenvision commented Feb 5, 2018

@earlephilhower .
Not sure where they were located but as it was the Bin size that I was concerned with I just looked in that.

Re #4187. Oh yer, I never gave that any thought. Der.
But it looks like the clever people are keeping an eye on things.

Having now created that text file, I must now tidy up my own work.
So much waste, Line count can be such a bitch..
but then hey, while the space is there, fill-er-up I say.

I think it was only the flash allocation and the failing http.updates with 2.4 that got me thinking this deep.

@earlephilhower
Copy link
Collaborator

@Hiddenvision you can run "objdump -s -j .rodata file.elf" (may need to use the xtensa-lx106-elf-objdump.exe if you're under Windows, it's located in the tools/xtensa-*/bin directory.

It's not critical if you only care about total upload size (i.e. trying to fit an OTA upgrade in <500KB), but the rodata stuff is very important if you're pressed for RAM and getting OOM exceptions.

@Hiddenvision
Copy link
Author

Hiddenvision commented Feb 5, 2018

@earlephilhower
Yep I guess it started with that 500K limit, but we are passed that now that the flash is allocated different.
#4252

if I need to dig deep I normally drop the elf into IDA.
But then I get confused thinking even more unwanted strings are in the BIN.!

@earlephilhower
Copy link
Collaborator

Closing as this has gone stale. If you've got a PR, please do feel free to submit it for review. Thanks.

@stefan123t
Copy link

To answer on @Hiddenvision initial questions I would like to add the following two points:

1.) Optimizing space usage by using some lookup table and trading it for some processing time is the foremost application of the LZV algorithm. So if you are really short on PROGMEM you may consider implementing LZV for ESP8266.

2.) FPSTR() will aquire PROGMEM in 4byte chunks for 32bit access. Hence the letter count you were asking for is in chunks of 4bytes and the remaining bytes are wasted. Maybe this should be considered in your algorithm by working on 4byte chunks.
const char SMIT[] PROGMEM = "Smit"; # no waste
const char SMITH[] PROGMEM = "Smith"; # wastes 3bytes of PROGMEM

Counting and Sorting of the 4byte patterns according to the frequency of their occurrence would be along the general LZV problem and solution.

For starters I would already be happy to have the Warning messages @devyte mentioned in his String checker suggestion.

@devyte
Copy link
Collaborator

devyte commented Jul 5, 2019

Unfortunately, the code checker idea hasn't been picked up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for feedback Waiting on additional info. If it's not received, the issue may be closed.
Projects
None yet
Development

No branches or pull requests

9 participants