Friday, July 8, 2011

Moodle Issues after an upgrade! [Resolved!]

Part of my project in migrating to a load-balanced virtualized infrastructure involved running a minor version upgrade from 1.9.7 to 1.9.11 which addressed several bugs in our core Moodle usage. The project plan included checking the database before and after for corruption as well as re-indexing using mysqlcheck and myisamchk respectively, and no errors were detected, both in the Test environment as well as during the upgrade in the production environment. Basic checks were done post upgrade to ensure core moodle functionality. To my surprise (read: shock, horror, pain), two weeks after the upgrade, two major issues were reported by several lecturers:
    1. Quiz attempts were not being saved, where the quiz spanned multiple pages.
    2. Assignments (Advanced Uploading of Files) could not be submitted.

Tracking down this issue took some time, and since so many variables changed from the original environment and so much time had elapsed since the upgrade, reverting was not an option.
The order of troubleshooting included:
  •     Determining if the course/activity settings were at fault.
  •     Turning on Debugging to check for errors upon avtivity submits.
  •     Determining if my (tested) load balancer environment was causing the issue.
  •     Checking the new webserver configuration
  •     Checking the database configuration
  •     Checking and rechecking  and attempting to repair the database for corruption using variations of switches on the respective tools (mysqlcheck and myisamchk)
  •     Checking if caching was causing the issue (disabling memcached)
I then decided to try two different tests:
  •     Parallel installation of moodle on the same new environment/hardware.
  •     Try the current database on different hardware
The first test required using a copy of the existing moodle code, but on a fresh database. To my surprise (read: temporary relief), the environment was fine. The empty database with a test course worked perfectly. The next logical step assumed that the database may be the issue, as the previous test used the same moodle code, effectively eliminating it as the root cause.
The second test confirmed this. I migrated the same Moodle code plus a copy of the database to a clean server and the problem was replicated. At this point the epic horror dawned upon me on the possibilities ahead as visions of rebuilding from scratch haunted me.

At this point, only two variables needed investigating:
  •     Database Indexes
  •     Database Data/Structure
In my test environment (where the problem was replicated), I attempted to perform some last ditch attempts to recover the environment. Firstly, I attempted to strip all the indexes from the database and use the mysiamchk tool to rebuild.
I took a backup of the database (without indexes) using the following code:
mysqlhotcopy -u root -p yourpassword --addtodest --noindices db_name /path/to/backup/destination/
I then used the MYSQL restore method in my earlier post to restore the database to the test environment.

Since the indexes had to be rebuilt using myisamchk, it required the mysql service to be stopped. The indexes were rebuilt using:
myisamchk -rq /var/lib/mysql/db_name/*.MYI

I then ran the mysqlcheck command as in step 7 of my earlier post.

Again, no errors/issues were reported. I held my breath as I started the test quiz and assignment upload....and it failed again. That meant the data was royally screwed. (...and that's putting it nicely).
I then decided I was going to take a look at the database structure and perhaps even delve into the data itself and try to see where the issues were.

I took mysqldumps of the structure of the working database and the non-working database and compared them line by line in Notepad++ (a tedious task, but some real team effort came into play here), but came to the conclusion that the structure was fundamentally the same. Therefore, by the process of elimination, the root cause was the actual tables/data.
The structure was analyzed using the following:
mysqldump --no-data --host localhost --user=root -p db_name > /path/to/report.sql

On to the analysis of data. Analysis of the tables from the phpmyadmin interface showed no anomalies. In Moodle, there's a database report that is included with the base code which contains a bit more useful data and can be accessed via:
http://yourmoodleurl.com/admin/dbperformance.php

I ran the script and used the view tables link and started scouring the tables one by one, searching for anomalies. Interesting enough, I stumbled upon two very strange issues.
The auto_increment fields for two tables were HUGE, compared to the number of rows the tables contained:
  1. mdl_question_sessions
  2. mdl_assignment_submissions
The figures were large enough to be beyond the range of the ID field (BIGINT-10) and the names of the questionable tables coincided with the issues we were having. A faint light began to appear at the end of this proverbial tunnel.
The problem lay with the Primay Keys of these two tables. After searching for the Moodle Database schema on Moodle.org (almost a herculian task), and careful analysis of the stable structure I learned that the Primary keys were not foreign keys in other tables. The proverbial light grew a little bigger.

The only way forward at this point would be to delete the entire primary key fields on the two affected tables, recreate the field (and associated options) and reset the auto_increment to 1.

This operation was done from the phpmyadmin interface and needed to be tested significantly to ensure that other functionality would not be negatively affected. Success! It worked in the test environment. The light grew stronger...

I was especially wary about implementation in production, having a very narrow maintenance window to attempt the fix. After the database fix, I re-ran the myisamchk with repair as well as the mysqlcheck to err on the safe side. The operation was a success, and I finally saw the end of the tunnel...Problem solved.

Thoughts: I've yet to determine WHY the moodle upgrade script failed and caused this databse anomaly. From the front-end, the upgrade was successful and NO errors were reported. I've also to understand how the mysqlcheck and myisamchk could not detect that a table field contained data beyond it's configured range. Even turning on Debugging on the Moodle site failed to show anything useful about a failed insert to at least provide a tip in the general direction of the issue.