While I am doing another test run, let me try to explain this better.
Our DB is nearly 15 years old.
People have copy-and-paste any kinds of encoding into the database. That stuff may have or may not have been transform to the encoding of the DB. In addition, over the years, the coding of the DB has changed. It was not UNICODE in the beginning.
The same is true for keyboards. People type from all kinds of keyboard over the years. Sometimes this adds to the problem of encoding, but generally it is from copy-and-post, from what I have seen. Many people like to write their post on their desktop editor and copy and paste that into the forums.
So, running any generic encoding translation will not work for all encodings. If it was, this problem would have already been solved. Sometimes UNICODE does not work because there are encoded chars with are not part of UNICODE.
It's not a theory. It is a fact of years of having a busy forum with people all over the world copy-and-pasting their locally encoded text into our DB. Sometime we get lucky and the encoding works.
All we can do, is identify it and squash it, or ignore it.
It's not critical either, because I can fix it after migration directly in the DB, as I have been doing today. But the best place to fix it is in the legacy mysql DB when possible but it is also doable if information was not lost in migration from mysql to postgres to do it in postgres.
This is why I am kinda begging everyone to help test. I can write the code to fix the issue if I clearly see the issues. There are one million posts. The more people take a look, the more it helps.
Sorry to be begging... LOL. I have been working on this for months. My wife is starting to feel like she has no husband; which I can understand why.
But I wanted everyone to understand why I have asked for this help.
This is exactly what I need..... (image from first post on this test)
-------------------------
Honestly, so far people have provided me a total of about 3 or 4 links only where this encoding issue comes up and most of those are in non-public spam archives.
I don't want to be spending my time chasing outliers in two decades of encoding. Either there are issues or not. I am not going to spend my entire life working on chasing unimportant encoding issues to try to make a migration which s 99.99% perfect to 99.9999% perfect. It's not a good use of our time.
So, please provide details accounts of any remain encoding issues with links to the original and the migrated version.
Thanks.