Agile Ajax

Using ActiveRecord to Migrate Legacy Data

birds.jpg
Here's your problem. You're converting an existing project to Rails. The existing system has a database that, in the best case, doesn't use Rails naming conventions. In the worst case, the old data is ill-structured, or uses a commercial database platform that you want to abandon.

Rather than deal with the ongoing pain of using Rails against the legacy database, you've decided to migrate the legacy data to a new database, using standard ActiveRecord constructs. Your next step is to migrate the data from old to new.

I'm making a few assumptions about your migration needs, hopefully they'll be not very restrictive:

  • You can connect to the existing database, either the live copy or (better) a recent snapshot.

  • Although you'll start migrating data during the development process, you'll need to continue migrating new data right up until your final switchover.

  • Performance is not a major issue for the migration.

My book, Professional Ruby On Rails, has more information on dealing with legacy databases.

After the jump, some ideas on how to use ActiveRecord to manage a data migration.

Set Up

What you want to do is create an ActiveRecord class for each table in the legacy database that you want to migrate. You don't want to mix these classes in with the main part of your app -- I like to put them in lib, or lib/migration. These classes will start out quite simple, so you can even start them all off in a single file.

To manage the connection to the legacy database, you'll need to add an entry to the database.yml file, call it legacy_development, and include the connection information for the database you are migrating from. You might also need a legacy_test database -- we'll talk about that in a second.

All your legacy classes should have a common base class that sets up the connection to the legacy database:

class AbstractLegacyRecord < ActiveRecord::Base
establish_connection "legacy_#{RAILS_ENV}"
self.abstract_class = true
end

The child classes can then set up their own specific naming options

class LegacyUsers < AbstractTitle
set_table_name "UserTable"
set_primary_key "UserTableId"
end

Naming conventions are a pain here. ActiveRecord lets you manage unusual table and key names, but if the legacy table has column names that start with capital letters, that's going to confuse the Ruby interpreter. One way around this is to put the following in the abstract parent class.

def self.normalize_attribute_names(*names)
names.each do |name|
define_method :"#{name.downcase}" do
read_attribute(:"#{name}")
end
define_method :"#{name.downcase}=" do |arg|
write_attribute(:"#{name}", arg)
end
end
end

That method creates new lower case getters and setters for each name passed in. (It's harmless if the name is already all lower case, although you could explicitly test for that. The list of names can either be explicitly hardcoded in the class or automatically generated via ActiveRecord#columns (how you test this code may be a factor in whether you can use this). Call this method at the class level for each subclass right after you set the table and key name. The lower case versions of each attribute will be available, but the mixed case version will not go away.

The Basic Plan

The basic algorithm of the migration is really simple. For each table in the legacy database that needs to be migrated, you want something like this:

LegacyModel.find(:all).each do |legacy|
legacy.to_model
end

Where legacy.to_model looks something like this (I use a separate new and save rather than create to allow for further error checking in more complex cases).

def to_model
model = new_model.new(:new_attribute => old_attribute,
:another_new_attribute => old_attribute)
# error checking
model.save
end

Of course, the migration code I'm working on now is about 1000 lines of code, so it can get more complex...

Getting Legacy Models

In the simple case, you'll be getting every row from the old database. However, you'll often need to apply more complicated logic to determine which data gets transitioned. (For example, data older than a certain age might not be needed, or user data for lapsed users, etc...).

If you are working against a live database, you'll probably want to have separate logic for the case where you are migrating everything from scratch as well as the case where you are making incremental updates to reflect the most recent changes to the legacy data without reloading everything. Depending on the make up of your data, the incremental update can be managed by tracking the highest ID already migrated, or by tracking the time of the latest migrated change.

You want the migration to be idempotent, which is our Computer Science Word of the Day, meaning that no matter how many times you run the migration, you get the same result. If a subsequent run of the migration adds duplicates of records that have already been migrated, that would be a problem. If you run the incremental migration twice in a row against the same snapshot, the second run should migrate no data. In general, you don't want to be in the business of policing when the data migration is run, you want the data migration to behave itself no matter how often it runs. The thing to keep in mind is that even if you think you will only run the migration once, you're probably going to run it multiple times.

It's helpful to add a column in the new database structure named legacy_id or something like that. This helps with incremental migrations, and also gives you traceability when looking at the new data and trying to figure out how it got there.

On a related note, starting each full migration with a db:reset will ensure that all the new database primary key id columns will start over at one. I find that consistency to be useful.

Relationships

Presumably there will be relationships among both the legacy models and your new models. My slight preference is not to specify ActiveRecord relationships in the legacy classes, on the grounds that a) messing with the legacy naming structure is a pain, b) if the data is being scrubbed en route to being migrated, it's not always the case that two objects that are related in the legacy database will both be translated together, c) keeping things separate makes it easier to migrate tables one at a time, and d) I feel just a tiny bit more in control of the database calls. That said, it's perfectly reasonable to specify ActiveRecord relationships on the legacy classes if that makes more sense for the structure of your data.

When creating related objects in your new database, there are a couple of things to keep in mind. Often, if your new database is more aggressively normalized than the old one, then a single legacy row could be converted into two or more new database objects (a common example is a user being split into a user and an address).

On the new database side, the belongs_to half of a relationship (the model with the foreign key in its table) can be created by setting either the entire data object (user.address = address) or just the id (user.address_id = id). The id-only way is useful because it means you can cache objects in some kind of keyword => id hash and save yourself a lot of lookups in the legacy database.

If you are setting the object from the many side of a one-to-many or many-to-many relationship, you can create the object in one line via the association proxy, user.address.create(:address1 => '225 W. Superior') or the push operator user.address << address. If you still want to work via ids only, Rails automatically creates a separate accessor for that user.address_ids = [1, 2, 3] or user.address_ids << 4.

Caching can really help performance here by preventing multiple data lookups for already migrated records. You need to be careful, though, caching every record that you migrate can cause really outrageous memory use when you try to migrate a lot of data. Caching just the relationship between a legacy id and a new id is sometimes enough to save lookups. If the data is somewhat chronological, you may be able to tell when you no longer need an object and can flush it from the cache.

Attribute Values and validation

One motivation for migrating data is to clean up anomalies in the original data set. Often, your Rails code will have validations defined that are stricter then the legacy code. Since your migration is going through ActiveRecord, those validations are applied as part of the data transfer. This is generally good, because it makes it easy to clean up the data, but it does mean that you'll need a way of working around invalid legacy data, since the newly created records won't save if the validations don't pass.

If you chose to throw out the invalid records, you should at least log them, the method call record.errors.full_messages will give you a reasonable text description of which validation failures blocked the save. In very early development, it might be reasonable to throw an exception and exit on an unexpected validation, but by the time you're ready to go for real, the migration script shouldn't stop for anything.

The other alternative would be to mitigate invalid data by replacing it with clean, or at least passible data. Whether this makes sense for you would depend on the details of your project.

Testing

Testing a migration script is subject to the same difficulties as testing any multi-database application in Rails, namely that you need to manually handle some of what the Rails test system normally does for you, most notably fixture data. There are two main options for testing.

The first is to create a second test database shadowing the main one, manually load fixture data to it, and run as you would a normal Rails unit test. One advantage of this mechanism when running migrations is that you typically are only reading the legacy database, so you may be able to, say use an SQL dump file from the real database as pre-loaded test data. The downside of this plan is that replicating the legacy database's schema could be time-consuming. If the legacy database is on a more esoteric database platform, you'll probably still want to create your test db in MySQL or something convenient, however that leaves the possibility of a mismatch between your test and runtime environments. Still, if it's feasible to replicate the legacy schema, I think this method will be easier for testing.

The other option is to use mock objects to fake the database connection. This can be something of a pain with ActiveRecord, but there are several recipes on the web that let you generically mock the database connection. In your specific migration, it may be easier to explicitly mock the find methods that are called by your migration script to return instances of the legacy models. Plusses of using mock objects are that the test script will generally run faster, and it may be quicker to set up if the remote schema is not easily replicated. The downsides are that you are virtually guaranteed a mismatch between the test and runtime environments, and my general quibble that I find mock testing to be somewhat brittle.

Executing

Your program structure should be in three layers

  • A thin Rake layer. You'll probably want to define at least "migrate all" and a "migrate all with database reset" tasks, you can break out tasks for individual tables as needed -- you may want to separate out particularly large datasets to make it easier to develop. Also, during development, it's helpful to be able to pass an argument limiting migration to a specific time slice, so you can cycle through acceptance testing more quickly.

  • A separate migration object to bridge between the rake and ActiveRecord layers. If your migration is simple, this can be in the Rake layer, but it's a good place to do things like hold caches of legacy information, set up log files and manage logging, specify which objects are being migrated, and the like.

  • The ActiveRecord layer described above.

Hope this helps, happy migrating!


If you liked this, please check out my book, Professional Ruby On Rails.

Topics:

Leave a comment

Powered by WP Hashcash

About Pathfinder

  • We design and build extraordinary applications for companies looking to make the next great idea a reality.
  • learn more

Topics

WordPress

Comments about this site: info@pathf.com