Joinutility seperatorLogin utility separator Infobright.com
   
 
searching for numerical value in large tables
Posted: 23 September 2009 07:04 AM   Ignore ]  
Newbie
Rank
Total Posts:  49
Joined  2009-05-19

Hi,

I am back of the subject of my large tables with IP addresses in. On your advice I sorted the data before loading and that made a huge difference. I can do sub-second queries against billions of rows.

My question is: The above applies for the IP in column 1 - ie; I sorted the table on column 1. If I want to do a similar search but on the IP is column 2 then I have no option but to sort the table again on column 2 - is that correct?

This means I will have two ‘identical’ tables, one sorted on IP_1 and a second sorted on IP_2. I can then search on both tables and then do a JOIN. I want the two queries to complete and then do UNION ALL on the two results.

I am ok with doing this, but just wanted to check it was necessary for searching on either column.


Thanks

Profile
 
Posted: 01 October 2009 09:57 AM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  505
Joined  2008-08-18

Hi,

To your point, obviously it’s not possible to sort the table on two different columns. So if you really need the sub-second response on both columns, you probably need to load the data in 2 different tables.

Note, however, the data does not need to be totally sorted in order to get good performance. Because of the way the knowledge grid works across data packs, an approximate sort (or sort of each incremental load) will probably give you what you want.

Just another thought: What data type are you using to store the IP addresses? Querying on a numeric type will be faster than varchar. Try storing the IP address as BIGINT (note you need to use BIGINT instead of UNSIGNED INT because we don’t currently support the unsigned range):

CREATE TABLE INET (
  
IP_1N BIGINT,
  
IP_1A VARCHAR(15),
  
IP_2N BIGINT,
  
IP_2A VARCHAR(15),
  
otherdata ...
  ....
); 

Convert the IP addresses to usigned 32-bit values in your ETL process. Then in your queries you can use the INET_ATON() function

select otherdata from INET where IP_1N INET_ATON("10.20.30.40"); 

Not sure how much improvement you will get from this, but it may be worth a try.

Geoffrey

Signature 
Profile