So much for backtesting?
Well, I wrote some code this weekend to import my stricknet.com options data into my MySQL database. I had a few hours to burn tonight so I debugged it, and started importing one day of data as a test. The data from stricknet comes as a zip archivefor each day which contains 2 comma dileminated text files. The first one is about a 136KB uncompressed and contains end of day stock price data. There are around 2600 optionable stocks. The other one is about 6.1megs uncompressed and holds the options data. My program reads the files directly from the compressed zip archive and imports them into my MySQL database.
I need the data in a relational database. This gives me more flexibility, and makes it easier to do the complicated queries I will want to do as I develop quantitative options trading strategies. I also structure the data more efficiently than the simple comma deliminated list that stricknet provides. The comma deliminated lists contain much redundant information which I eliminate by using good data structure.
So anyway, I started importing the first zip file a long time ago and it's still not done. I don't know how long ago exactly. It wouldn't surprise me if this is taking over an hour for just one day's data. The stricknet data comes as 2 CD's and contains just a little over 2 years of data. There are what 240 trading days a year? 240 days * 2 years * 1 hour per day / 24 hours a day = 20 days to import the data...if it takes an hour to import each day's data. That estimate of 1 hour per day may be too low. This may be too long. If it takes that long I may never get all of the data imported and may never be able to do much backtesting.
I need to think of some optimizations. Maybe unzipping all of the data first would speed it up? I'll have to try that. I do a lot of checks to ensure the data-integrity as it gets imported. I suspect some are unnecessary, maybe eliminating some of these will speed it up a little. I could add some indicies to speed up the checks...but that would also slow down the inserting of new records.
I have another computer that I don't use for anything at all. It's a fairly fast computer. One of the first things I probably need to do is to set that up as a dedicated database server. Maybe that is what I do...Just start running the import program on that other computer 24 hours a day for a month or two.
This is a very discouraging development.
1 Comments:
I am doing reads to validate the data (ensure I don't duplicate records) as I write it. I've actually heavily optimized it this weekend. My tests cases suggest that I should have it down to a few minutes per file once I debug all the parts and get it together. I will blog about it later. The biggest culprit: overhead connecting to the database too frequently.
Post a Comment
<< Home