I just recently had the experience of working with a few datasets that were quite sizeable to manage on my laptop. And while I found the challenges similar to those that many of you have already experienced, I did identify a few steps that make things a little easier. Of the most important are process and preparation. I found that the more I knew about what I was going to need to do, the smoother the general process was in each of the steps.
Most of the common challenges people are familiar with surround getting the data prepared for an analysis environment or tool. Usually common ETL tools are used to do this, but what if you are preparing data for the first time and you are not sure what that process is? Outlining your requirements for each step of the process is key in avoiding common pitfalls.
* What is the data?
* Size of the data?
* Is it compressed?
* What are the staging environment storage requirements?
* Tools for analysis?
* Preparing the data
* Moving it around.
* Putting it in production.
In my scenario, I had a single file of Web log data that was 47GB in file size. I knew that with a file this size, no matter what decisions I made or outlined, it was going to take some time to accomplish each stage given the hardware resources I had (the aforementioned laptop). I did not have enough disk space to extract it and load it into a database on my computer and I would expect the same issue with a lot of people when working with file sizes like this. So then next step was to rar the file, which took about an hour and SCP it over to a local server in my network. Using an SSH program, i was able to unrar the file and then connect to an Infobright command prompt and start to load the data into a staging table.
It ultimately was 157 million records. Where I ran into problems were with the format of the line terminators of the file. Because it came from a Windows machine and I moved it down and rarred it onto a Linux box, the line terminators were different, but I did not find that out until 30 minutes into the first load data exercise. I made the correction and then 38 minutes later, I had a count of 157 million records...... Something to note, I did use the reject file parameters for the data load and was actually surprised that it encountered no data errors during the load.
Now I was able to use Toad and take a look at the actual data and sample it accordingly. This allowed me to decide what data points I could optimize and what points were important. I had to look into making this data more optimal for the analysis environment so it could be queried in a timely fashion. I made a few DDL changes with the table structure and was able to test the response time to my satisfaction.
Now since this was going to a production machine, i had to dump the data to a file, rar it up, send it to the server and load data again. I do remember thinking to myself, this was an exercise. By the end of the day, I had the file in a production database and quite optimized. Looking back at what I did to accomplish this taught me this:
1. Always flowchart each step and document it
2. Be patient, but know as much as you can about the data itself
3. Know your environment(s) implicitly
4. Be prepared for something to error and how to identify where the breakdown is
The key point I learned is process and preparation. Whether you are working with 47GB of data or 47TB, the point is still valid. Stay tuned and I'll let you know what I am doing with this data...