I just recently had the experience of working with a few datasets that were quite sizeable to manage on my laptop. And while I found the challenges similar to those that many of you have already experienced, I did identify a few steps that make things a little easier. Of the most important are process and preparation. I found that the more I knew about what I was going to need to do, the smoother the general process was in each of the steps.
Most of the common challenges people are familiar with surround getting the data prepared for an analysis environment or tool. Usually common ETL tools are used to do this, but what if you are preparing data for the first time and you are not sure what that process is? Outlining your requirements for each step of the process is key in avoiding common pitfalls.
* What is the data?
* Size of the data?
* Is it compressed?
* What are the staging environment storage requirements?
* Tools for analysis?
* Preparing the data
* Moving it around.
* Putting it in production.
In my scenario, I had a single file of Web log data that was 47GB in file size. I knew that with a file this size, no matter what decisions I made or outlined, it was going to take some time to accomplish each stage given the hardware resources I had (the aforementioned laptop). I did not have enough disk space to extract it and load it into a database on my computer and I would expect the same issue with a lot of people when working with file sizes like this. So then next step was to rar the file, which took about an hour and SCP it over to a local server in my network. Using an SSH program, i was able to unrar the file and then connect to an Infobright command prompt and start to load the data into a staging table.
It ultimately was 157 million records. Where I ran into problems were with the format of the line terminators of the file. Because it came from a Windows machine and I moved it down and rarred it onto a Linux box, the line terminators were different, but I did not find that out until 30 minutes into the first load data exercise. I made the correction and then 38 minutes later, I had a count of 157 million records...... Something to note, I did use the reject file parameters for the data load and was actually surprised that it encountered no data errors during the load.
Now I was able to use Toad and take a look at the actual data and sample it accordingly. This allowed me to decide what data points I could optimize and what points were important. I had to look into making this data more optimal for the analysis environment so it could be queried in a timely fashion. I made a few DDL changes with the table structure and was able to test the response time to my satisfaction.
Now since this was going to a production machine, i had to dump the data to a file, rar it up, send it to the server and load data again. I do remember thinking to myself, this was an exercise. By the end of the day, I had the file in a production database and quite optimized. Looking back at what I did to accomplish this taught me this:
1. Always flowchart each step and document it
2. Be patient, but know as much as you can about the data itself
3. Know your environment(s) implicitly
4. Be prepared for something to error and how to identify where the breakdown is
The key point I learned is process and preparation. Whether you are working with 47GB of data or 47TB, the point is still valid. Stay tuned and I'll let you know what I am doing with this data...
Data were great so great responsibility as well. You should be able to find a way to properly manage these data. There are many types of data that can be managed properly.
airline credit card
my feedback displaying. is there a setting i’m missing? it’s doable it’s possible you’ll assist me out? oakleys sunglasses
The data is too big should be divided into sections that are very small. You should be able to find a lot of important parts and put it together when needed.
globaltelesis.com
If you work with data stored on your computer, it helps if you include the extra security. In my opinion, the system additional security is important enough to be applied in the computer to prevent the actions of people who are not responsible. Title Services
Working with large data is very time-consuming for you. You must be familiar with the large data loading for a long time so when you are looking for data.
savage arms
I think that we all have faced such a situation at some point and I must admit that it took me some time to get the job done. For the safety of my data I use a finfisher surveillance program, to keep the viruses and hackers away.
Big data is very hard to manage. Manual processing is out of the question, as it will take years to finish. You can code your own system or use commericla systems like SAP, OCIDOS, and VERINT.
women secrets
This is a really good read for me, Must admit that you are one of the best bloggers I ever saw…genital warts treatment
For a new person, it can be confusing. It should be checked for many times.
annuity settlement
Having a lot of data in the computer does not need a little space, but it was in the computer security system should be kept up to date. If indeed we have a lot of data to support the work, it is better if we set up more storage space in the computer. Neofame.com.my
Without having the data is valid and can be believed, I think everyone would not be easy to make a decision in the job. Therefore, the data storage in a computer must be ensured safe and can accommodate data capacity that we have. website
I do know this isn’t exactly on subject, but i have a web page utilizing the identical program as effectively and i am getting troubles with my feedback displaying. is there a setting i’m missing? it’s doable it’s possible you’ll assist me out? thanx.
online sports psychology degree
böcek ilaçlama | Thanks for this read mate. Well, this is my first visit to your blog! But I admire the precious time and effort you
Truly most people now a day are relaying on the articles in the internet, but sometimes some of those questions were not answered but not when I visited your site and found great ideas on it. borehole cameras
It is really a difficult task. It is bit hard to work with big and huge data.
achat maison
Post Comment