Wednesday, June 15, 2005

Better Living Through Code Optimization

Well, I think I optimized my option data importer enough to be practical this last weekend. This turned out to be quite a chore. Some of you may wonder why speed was even an issue at all. I mean how long can it take to copy 6 megs of data from one place to antoher? The process of imporation was not mearly just one of moving data around. I use a more complex structure to store my data than the form which it arrives in from stricknet. When designing a database it's important to use a good "structure". When you have a well-structured database you find that it is easy to extract information from it. It will also be easy to intigrate new types data into the database at later times. Extracting information from a poorly-structured database is often such a chore that useful information never gets used or even discovered.

Among other things, a well-structured database, stores each fact only once. One reason for this, is to reduce over-all storage requirements. An even better reason is that it makes it easier to maintain consistancy across a database. As information changes and gets updated, you only have to update it in one place. If your storing the same information in multiple places, great care has to be taken to update it in all of those different tables and records. Furthermore there is a tendacy for information get lost due to uncertainty introduced to the database from inconsistencies resulting from incomplete updates.

Here's a practical example of what I mean. The data from stricknet arrives as two tables. In the equity table, you have rows that look like this for each trading day:

Symbol, Company Name, , Open, High, Low, Last, Volume

AA,Alcoa Inc,,29.99,30.0,28.64,28.7,6591100

It would be very easy to import that "raw" as is. One obvious problem with that though is that every day I will be writing the symbol and the company name into my database. Why do this for every date record? The symbol and company name don't change on a daily basis. Instead I have a table called "symbols", which looks kind of like this:

ID, Symbol, CompanyName

Then I have another table called "dailyequityprices" that looks something like this:

SymbolID, Open, High, Low, Last, Volume

So, when I import the data I first check if there is a symbol record for "AA", and if there isn't I write the record to symbols just once. Then for each date record I store just the id of that symbol record instead of storing the text for the symbol name and the company name. Not only is this more efficient, but now the equity data is cross-referenced to all of the data I allready have on alcoa in my database, because all of my existing data also references the same symbol record. For example, I allready have earnings data, split data, dividend data, etc.

This cross-referencing tends to multiply the useful information in a database. When I import the daily price facts I haven't just imported the price of a particular stock on a particular day. I've created new informaion as well, through the cross-referencing to other related facts which allready exist in my database. As new facts are imported into a relational database, the total information tends to grow faster than a linear rate if the database is well structured.

Another practical benefit that I mentioned earlier is that if I have to change some information I only have to change it in one place. For example when the symbol for the "QQQ" changed to "QQQQ", I didn't have to change every price record, every dividend record, and every split record in my database. I just changed one symbol record.

I used to regularily solve optimization problems years ago back during my short career as a game-developer, but it has been awhile since I've needed to undertake this type of chore. I guess it is like riding a bycle though. You never forget. Here are the general steps I use for optimizing code:

Step 1: Write bug-free, easy to understand, intuitive code. Ignore optimization issues.
The first step in writing optimized code is to not even try. You need a solid foundation on which to build the optimized code. If you try to optimize from the very beginning of a project, more than likely you'll only add complexity and bugs, without much of a speed increase. This is what I did. I wrote a very simple importer which imported the data one row at a time, while reading from the database before writing every single record to make sure that duplicates were not introduced and that the consistancy of relationships was maintained. Not all of these checks needed to be done for every record, but this was the easiest and most intuitive way to do it. I err on the side of caution, by not eliminating checks that might be redundant.

So after writing my simple implementation of the importer, I found that it took well over an hour to run. I never actually ran it to completion for even one day of data (From my retrospective measurements, I now know it would have taken the better part of a day.) It simply took too long.

Step 2: Measure everything.
Often ineficiencies in a relatively few lines of code will account for the vast majority of the slowness. If you don't measure, you waste a lot of effort optimizing lots of code, with little effect on execution time.

The data comes in a zip file, which I decompress on the fly as I read the data in. Decompression is a slow process. One of my first ideas for optimization was maybe I should decompress the zip files in batch ahead of time. So the first thing I did was time how long it takes to just decompress the zip file the way I was doing it. To do this I disabled all of the parsing code and database access and ran one file through. It took less than 0.8 seconds. Hmm..maybe the parsing is slow. So I enabled that as well. It still took around 0.8 seconds to execute. So then I knew there were no easy optimizations. To get a baseline, I disabled all but the simplest database operation...symply checking if an equity symbol allready exists for each line of text in the zip file. So I executed this, and it took over 4 hours and was still not done. Wow! what an eye opener.

Well, one way to speed up database record retrievals is to add strategic indicies to your database, so I added indicies throughout my database. Then I enabled everything and ran the importation again. It was still taking forever.

So, next I did an experiment. I ran a query on an empty table for every record in my options zip file. I would expect the actual sql query should run in a trivial amount of time. This took like 10 minutes to run. The lesson here was clear. Merely opening a connection to the database and transmitting query code is a significant process. I now knew what I would have to do. Batch all of my database operations. For example to query the database and return 1 option symbol record, takes about the same amount of time as to query all 140000+ option symbol records. So instead of doing all of those one at a time for each record in the zip file, I needed to read all of the information ahead of time.

In order to batch all of my database operations, and still ensure consistancy in my database required importing the data in batches, building hash tables in memory, etc...In short it greatly increased the complexity of my code, introduced many bugs that had to be found, and my code is not as robust as my original implementation. However, now importing a striknet zip file only takes 20 seconds for the first zip file. The time then slowly increases as the data grows, so that now after running the importation over night, the time to import one zip file is up to 5 minutes after importing the vast majority of the data. 20 seconds to 5 minutes is much better than over 4 hours, so the trade-off in development time, complexity, and robustness in exchange for speed seems well worth it in this case. My original code probably would have taken months, if not years to import all of the data.


Post a Comment

Links to this post:

Create a Link

<< Home