Programming

04-28-2020

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

Quote:

Originally Posted by hicksd8

Hmmmmmmmmmmmmmm.......

Why do i keep getting apostrophe's converted to ? when i import - beets

Unicode apostrophe standardization - Style - MetaBrainz Community Discourse

This is all good... but I want to focus bottom up... only on the specific chars causing a problem in our DB.

That is what I am doing now ... finding the exact offending char and then finding the correct transform to cleanse it.

Please hold off on posting links unless the link contain solutions for the exact char we are having issues with (let's stick to bottoms up approach, not top down, for now).

I have all the chars we have found so far covered, so hold off (on these funny chars) until I get the various staging DBs synced.

Thanks.

Right now I have all the transforms I need based on what we have found so far. We can search for more in the next round. In other words, I know what the problem is. What we need is to find them and then fix them, from a bottoms up approach because I am not going to run any code which "transforms" problems we have not identified and tested. I do not want unintended consciences of running code and others transforms unless they solve a specific, clearly identified issue.

Will update soon.

Neo

View Public Profile for hicksd8

04-28-2020

Moderator

2,327, 710

Join Date: Feb 2012

Last Activity: 3 May 2020, 3:12 AM EDT

Location: Devon, UK

Posts: 2,327

Thanks Given: 442

Thanked 710 Times in 578 Posts

Need to remap ASCII characters to Unicode?

hicksd8

Find all posts by hicksd8

04-28-2020

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

Quote:

Originally Posted by hicksd8

Need to remap ASCII characters to Unicode?

No. It's not that simple. It it was that simple, there would be no issue now. (The migration script already does encoding mapping from day 1. The DB is already UNICODE... the Ruby script already does encoding mapping. it is not so simple as a "general remap" or it would be done already.)

Let's stick with the plan. Find links with problems. I know how to fix these if everyone will follow my original plan and provide specific links with specific issues (original v. migrated posts).

What I need are EXACT examples of the real problem in OUR DB (not theory). Thanks. The links I posted were directly related to the EXACT char problem I am working today (right now). I used that code as a basis to address directly problems you guys found.

We want to work this bottoms up. Bottoms up means to find the exact issues (not theory) and fix the exact coding issue for each encoding issue.

Please. I'm busy and need to get this done the way I know will work. The only way to get this done correctly and surely is bottoms up. Not top down theory and speculation.

Thanks.

What I need from testers, in this thread is ORIGINAL versus MIGRATED posts examples. I can take care of the rest (finding the encoding in the DB, finding the correct transform, writing the code, running it, testing it in the DB, etc). Please keep on track looking for issues. That is the best way to help get this done.

Everything we have identified so far, I already have a solution for, and tested it and it works.

What I need are more examples of any error, anomaly or other data migration integrity issue, in two links (the original post and the migrated post).

This User Gave Thanks to Neo For This Post:

Neo

04-28-2020

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

Here is the simple version.

I we have a post full of ORIGINAL v. MIGRATE threads, it is easy for me to compare, come up with code, test and retest.

Without the links, or links scattered all over the place (email, whats app messages, carrier pigeon), it is hard for me to go back and test and it take me too much time because there is a great amount of work to do.

This is why I called for testing exactly as I did in my first call for testing:

Here is an image of what we need, from my first post on this caper:

Please Help Integrity Test New Discourse Forums V2

QUOTE FROM LINK ABOVE

"

Issue with Keyboard or Char Encoding During Migration-screen-shot-2020-04-28-31655-pmjpg

"

Neo

04-28-2020

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

While I am doing another test run, let me try to explain this better.

Our DB is nearly 15 years old.

People have copy-and-paste any kinds of encoding into the database. That stuff may have or may not have been transform to the encoding of the DB. In addition, over the years, the coding of the DB has changed. It was not UNICODE in the beginning.

The same is true for keyboards. People type from all kinds of keyboard over the years. Sometimes this adds to the problem of encoding, but generally it is from copy-and-post, from what I have seen. Many people like to write their post on their desktop editor and copy and paste that into the forums.

So, running any generic encoding translation will not work for all encodings. If it was, this problem would have already been solved. Sometimes UNICODE does not work because there are encoded chars with are not part of UNICODE.

It's not a theory. It is a fact of years of having a busy forum with people all over the world copy-and-pasting their locally encoded text into our DB. Sometime we get lucky and the encoding works.

All we can do, is identify it and squash it, or ignore it.

It's not critical either, because I can fix it after migration directly in the DB, as I have been doing today. But the best place to fix it is in the legacy mysql DB when possible but it is also doable if information was not lost in migration from mysql to postgres to do it in postgres.

This is why I am kinda begging everyone to help test. I can write the code to fix the issue if I clearly see the issues. There are one million posts. The more people take a look, the more it helps.

Sorry to be begging... LOL. I have been working on this for months. My wife is starting to feel like she has no husband; which I can understand why.

But I wanted everyone to understand why I have asked for this help.

This is exactly what I need..... (image from first post on this test)

-------------------------

Honestly, so far people have provided me a total of about 3 or 4 links only where this encoding issue comes up and most of those are in non-public spam archives.

I don't want to be spending my time chasing outliers in two decades of encoding. Either there are issues or not. I am not going to spend my entire life working on chasing unimportant encoding issues to try to make a migration which s 99.99% perfect to 99.9999% perfect. It's not a good use of our time.

So, please provide details accounts of any remain encoding issues with links to the original and the migrated version.

Thanks.

Neo

04-28-2020

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

Here is one, but the issue is in the original DB.

Retry Logic But In Cron - UNIX for Beginners Questions & Answers - UNIX.COM Community

Issue with Keyboard or Char Encoding During Migration-screen-shot-2020-04-28-45232-pmjpg

In the mysql DB:

Issue with Keyboard or Char Encoding During Migration-screen-shot-2020-04-28-44927-pmjpg

So, no reason to waste time on encoding issues which are not migration issues.

Retry Logic But In Cron

This illustrates the problem, chasing error in the original DB which migration as they were posts.

This is why I need the ORIGINALS and the MIGRATED versions if anyone sees any issue.

However, if anyone knows the correct replacement for that strange stuff, I will add it to the translation.

Neo