Joinutility seperatorLogin utility seperator Infobright.com
   
 
Zip as number
Posted: 17 November 2008 11:35 PM   Ignore ]  
Newbie
Rank
Total Posts:  13
Joined  2008-10-29

Just thought I’d bounce this off the community.  Do you think using an integer to store zip would performance significantly better than using varchar(5)?  Searches seem a little slow.

Thanks in advance.

Profile
 
Posted: 18 November 2008 03:53 AM   Ignore ]   [ # 1 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  109
Joined  2008-08-18

Hi Kevin!

Your question is quite interesting. I do not have any test results in this area, so it would be very fine if you can make some and post them smile

Theoretically, if you search using first digits in the zip code, integers should perform well. Possibly (depends on data layout) better that varchar.
Varchars can have an edge if the search is based on trailing digits ( e.g. LIKE ‘___76’ ).
Additionally, integers can use less memory than VARCHARs/CHARs.

Profile
 
Posted: 18 November 2008 09:22 AM   Ignore ]   [ # 2 ]  
Jr. Member
Avatar
RankRank
Total Posts:  80
Joined  2008-08-18

It would be very interesting experiment. What type of queries/conditions do you expect?

In case of Varchars/chars we create so called Character Map Knowledge Nodes that are used by rough query evaluator. For every character we store on which position in a string it occurred in a given pack of data of a column. Thus, if queries search for patterns with fixed positions of some characters then character maps can speed up queries significantly by eliminating irrelevant data packs to be uncompressed.

As about occupied memory, any integer type is represented internally by 64 bit int. Varchar(5) would take 5 bytes. Size on disk depends on compression.

Thanks,
Piotr

Profile
 
Posted: 18 November 2008 10:46 AM   Ignore ]   [ # 3 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  109
Joined  2008-08-18

any integer type is represented internally by 64 bit int

64 bits yes, but when a value is retrieved from the table and used by some algorithm. In memory cache in a sequence of 64K integer values, a value can take 1,2 or 4 bytes. Eg. if 64K subsequent values are in the range which is 250 wide, then only 1 byte is necessary to remember each,e.g. for 10233, 10234, ..., 10410 we remember shift=10233 and values in one byte 0,1,…

Profile
 
Posted: 18 November 2008 11:31 AM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  12
Joined  2008-08-18

I suspect, in this particular example, using INT might work better, because ZIP code is a low cardinality field.

In case of high degree of clustering, you may get higher benefit from compression, maybe even get over 100:1. 

In case of relative randomness, DPNs and and P2P may be less selective; however, in case of INT, you may greatly benefit from histograms.

Based on these considerations, I would use INT for US zip codes. 

Regards,

Alex Esterkin

Signature 

Alex Esterkin
Infobright, Inc.

Profile
 
Posted: 20 November 2008 08:16 AM   Ignore ]   [ # 5 ]  
Newbie
Rank
Total Posts:  13
Joined  2008-10-29

Well, I just loaded all of the data.  I made a little mistake.  I had added a zip plus 4 field, and that’s an integer.  I was also going to convert the standard zip field to a numeric column, but forgot, so I’ll have to reload the data again (a few late nights recently wink

However.  Here’s a quick test for zip code 10128

select count(*) from anchor where ZIPPLUS4 between 101280000 and 101289999;
+----------+
| count(*) |
+----------+
| 74604 |
+----------+
1 row in set (0.28 sec)

select count(*) from anchor where ZIP = ‘10128’;
+----------+
| count(*) |
+----------+
| 74604 |
+----------+
1 row in set (14.16 sec)

The first query is the numeric field.  I crammed the zip and the plus 4 part togeter to get a number (101280000-101289999).  That .28 seconds is the first query run on that column.  The second query is against the varchar field.  14.16 seconds.  So, roughly 50 times as long?  I’ll post the numbers after I fix my little problem with the zip field not being numeric as well.

A semi-similar query:

select count(*) from anchor where fipscode = 36061;
+----------+
| count(*) |
+----------+
| 1706440 |
+----------+
1 row in set (0.11 sec)

select count(*) from anchor where state = ‘NY’ and countycd = ‘061’;
+----------+
| count(*) |
+----------+
| 1706440 |
+----------+
1 row in set (4.75 sec)

Obviously not the same, as the second query has an ‘and’ condition.  However, the first is a numeric field that I put together from the parts of the second.  That’s .11 seconds, hitting hundreds of millions of records. Pretty snazzy.

Profile
 
Posted: 20 November 2008 09:17 AM   Ignore ]   [ # 6 ]  
Jr. Member
Avatar
RankRank
Total Posts:  97
Joined  2008-08-18

Hi Kevin,

Thank you for all your tests. Looks like rough-set-based optimizations for integers work as expected.
However, in one of the next ICE releases there is an optimization for varchars, which may speed-up your second query (ZIP = ‘10128’). Would be interesting to confirm it… Anyway, if integers are possible, it is usually better to use them rather than varchars, unless you are not going to search for e.g. ZIP LIKE ‘_01__’.

Signature 
Profile
 
Posted: 20 November 2008 09:02 PM   Ignore ]   [ # 7 ]  
Newbie
Rank
Total Posts:  13
Joined  2008-10-29

Reloaded the data.  Tried the query on the numeric zip…

select count(*) from anchor where zip = 10128;
+----------+
| count(*) |
+----------+
| 74604 |
+----------+
1 row in set (0.01 sec)

That’s after a db restart.  That’s compared to the varchar version of 14 seconds and change.  Pretty sweet.  Can’t wait to see the app hitting it.

Profile
 
Posted: 21 November 2008 09:44 AM   Ignore ]   [ # 8 ]  
Newbie
Rank
Total Posts:  13
Joined  2008-10-29

Tried the application hitting the numerical version of zip.  It is much, much faster.  Hard to describe.  I tried it with hundreds of zip values as part of the query and its still really fast.

Profile
 
   
 
 
‹‹ CHAR or VARCHAR      Surrogate Keys ››