Showing posts with label keys. Show all posts
Showing posts with label keys. Show all posts

Sunday, March 25, 2012

Check integrity

Hi,

What methods can we use to check the integrity problems? Some records in my
database are having Foreign Keys but the database doesn't have any related
records with required Primary Keys. I need to scan the whole database and
delete (maybe with backup, maybe not) all wrong records.

Who can I do that? Is it need to write my own application to do that or we
have some standard way to fix these problems? I'm not a database
administrator and don't know these ways (yet). Can somebody help me with
advice?

Thanks.

Dmitri ShvetsovHi

If you have a foreign key constraint enabled for the given relationships
then this should not occur.
If they were created with the NOCHECK option then existing data may violate
the constraint, but if you would not be able to create the FK if you this
was the case and NOCHECK was not specified.

The easiest way to check the integrity is probably to drop and reapply the
FKs. Another alternative would be to use a construct such as

SELECT * FROM Referencing C WHERE NOT EXISTS ( SELECT * FROM Referenced P
WHERE P.PK = C.FK )

But doing this for a large database may take some time.

John

"Dmitri Shvetsov" <dshvetsov@.cox.net> wrote in message
news:Xzlnb.87999$Ms2.54382@.fed1read03...
> Hi,
> What methods can we use to check the integrity problems? Some records in
my
> database are having Foreign Keys but the database doesn't have any related
> records with required Primary Keys. I need to scan the whole database and
> delete (maybe with backup, maybe not) all wrong records.
> Who can I do that? Is it need to write my own application to do that or we
> have some standard way to fix these problems? I'm not a database
> administrator and don't know these ways (yet). Can somebody help me with
> advice?
> Thanks.
> Dmitri Shvetsov|||Hi

If you have a foreign key constraint enabled for the given relationships
then this should not occur.
If they were created with the NOCHECK option then existing data may violate
the constraint, but if you would not be able to create the FK if you this
was the case and NOCHECK was not specified.

The easiest way to check the integrity is probably to drop and reapply the
FKs. Another alternative would be to use a construct such as

SELECT * FROM Referencing C WHERE NOT EXISTS ( SELECT * FROM Referenced P
WHERE P.PK = C.FK )

But doing this for a large database may take some time.

John

"Dmitri Shvetsov" <dshvetsov@.cox.net> wrote in message
news:Xzlnb.87999$Ms2.54382@.fed1read03...
> Hi,
> What methods can we use to check the integrity problems? Some records in
my
> database are having Foreign Keys but the database doesn't have any related
> records with required Primary Keys. I need to scan the whole database and
> delete (maybe with backup, maybe not) all wrong records.
> Who can I do that? Is it need to write my own application to do that or we
> have some standard way to fix these problems? I'm not a database
> administrator and don't know these ways (yet). Can somebody help me with
> advice?
> Thanks.
> Dmitri Shvetsov|||Try using dbcc checkconstraints with all_constraints . It should return all
'bad' data in the database...

MC

"Dmitri Shvetsov" <dshvetsov@.cox.net> wrote in message
news:Xzlnb.87999$Ms2.54382@.fed1read03...
> Hi,
> What methods can we use to check the integrity problems? Some records in
my
> database are having Foreign Keys but the database doesn't have any related
> records with required Primary Keys. I need to scan the whole database and
> delete (maybe with backup, maybe not) all wrong records.
> Who can I do that? Is it need to write my own application to do that or we
> have some standard way to fix these problems? I'm not a database
> administrator and don't know these ways (yet). Can somebody help me with
> advice?
> Thanks.
> Dmitri Shvetsov
>|||Try using dbcc checkconstraints with all_constraints . It should return all
'bad' data in the database...

MC

"Dmitri Shvetsov" <dshvetsov@.cox.net> wrote in message
news:Xzlnb.87999$Ms2.54382@.fed1read03...
> Hi,
> What methods can we use to check the integrity problems? Some records in
my
> database are having Foreign Keys but the database doesn't have any related
> records with required Primary Keys. I need to scan the whole database and
> delete (maybe with backup, maybe not) all wrong records.
> Who can I do that? Is it need to write my own application to do that or we
> have some standard way to fix these problems? I'm not a database
> administrator and don't know these ways (yet). Can somebody help me with
> advice?
> Thanks.
> Dmitri Shvetsov
>

Tuesday, February 14, 2012

character strings as primary keys

Hello,

How bad is normalizing the database to the 3rd form, which requires that all fields depend on nothing but primary key. Consider the first table you create -- users. They have int primary keys, which duplicate the real primary keys -- user names. When user logs in, the user's entry is uniquely identified by its name, which is not primary key. The fundamental design rule -- avoid redundancy -- is violated. A VERY serious reason should be there for that.

Usually, design is compromised by redundancy for performance. Here, both copies are stored in one remote database, but integer keys may be located/used faster. Additionally, using long string references everywhere instead of short integer keys may save a lot of storage space (additionally increasing speed). How serious these impacts are? Am I missing something?

Usually, login names are not allowed to change. You have problems changing primary keys because all the foreign keys must be updated accordingly. Does it reveal that most user databases use character strings as primary keys?

Valentin,

Names are rarely UNIQUE -a primary requirement for a Primary Key. You only have to look at a telephone book to see that using Names as a Primary Key will be impossible. (Granted, a few 'small' countries mandate name uniqueness.)

From: http://howmanyofme.com/people/John_Smith/

There are 49,842 people in the U.S. named John Smith.

User Names, while sometimes, in small systems, seem unique, and seem suitable for Primary Key usage, will, over time prove to be a significant problem. Then the system will have to devolve to using ( Name + identifier ), e.g., JohnSmith1, JohnSmith2, etc. I have found that it is so problemmatic to use Names as keys that is is not worth even considering. As you indicate, Names 'should' allow for changes, i.e, marriage, divorce, etc. But the Primary Key 'should not' change. If you consider the domain security system, the User Name is NOT the Primary Key, and the Login Name is changable. The Primary Key is the SID, a unique identifier normally hidden from view, somewhat like using a IDENTITY field (or other indentifier) -in fact, a surrogate key. (Beware: simplistic explanition.)

It is extremely difficult to design a system for identifying People with a 'natural' Primary Key. Some adamately espouse using the SSN (in the USA) as a Primary Key. But it is NOT a naturally occuring characteristic of the individual -it is a surrogate key created by a remote system. It is legally protected, and increased being enforced, as 'sensitive' data. Most data systems 'should' NOT use the SSN as a Primary Key -of course, exceptions are allowed, or required, by law. So what does that leave as a naturally occuring characteristic of the entity -nothing really, except perhaps biometrics. Now try using the DNA sequence as a key -talk about size... (Though I know of some attempts to synthesize through the sequence redundancy to create smaller and more usable keys. -Think DNA/Fingerprint/Biometric databases.)

From: http://www.eogn.com/archives/news0202.htm

The fact that DNA is inherited and that each individual is the product of his/her progenitors means that DNA can be used to not only create unique identifications, but also to identify members of the same family, the same clan or tribal group, or the same population.

I hope I have sufficiently challenged your assertion that names are usable as primary keys.

... which duplicate the real primary keys -- user names.

A Passport number a NOT good Primary Key candidate -many people do not have a passport.

A Driver's License Number is NOT a good Primary Key candidate -many people do not have one.

So the real question is, in my system, for our business needs, what are appropriate surrogate keys?

|||By the character strings, I did not mean the real names. It is clear that real people can be namesakes. But in the internet, rarely anybody uses their numeric ids. For instance, Microsoft uses emails as user identifiers. The DNS maps between real IP addresses and memorizable symbolic names. People enjoy having fancy monikers.
|||Trying to grasp the subtle difference between "is unique" indeces and "unique key" columns, I came across the following text:


SQL Server 2005 Books Online

Unique Index Design Guidelines
Designing and Creating Databases > Indexes > Designing Indexes >

A unique index guarantees that the index key contains no duplicate values and therefore every row in the table is in some way unique. Specifying a unique index makes sense only when uniqueness is a characteristic of the data itself. For example, if you want to make sure that the values in the NationalIDNumber column in the HumanResources.Employee table are unique, when the primary key is EmployeeID, create a UNIQUE constraint on the NationalIDNumber column. If the user tries to enter the same value in that column for more than one employee, an error message is displayed and the duplicate value is not entered.

The engine designers foresee the situation where there is a redundancy -- two key columns in a table. But they do not explain why there may appear a need for such abnormal relation. Isn't EmployeeID redundant if they are identified by national id?

|||I think you are confusing a "primary key" with a "unique index/constraint".

I "primary key" is the link between other tables. For example, you are not, and should not, be linking Employees to Addressses by the "EmployeeName" field, you use the "EmployeeID".

Also, you NEVER EVER use a link between tables which the user could to change. Otherwise, you need to update every single table which links to the item.

For linking tables, you want very fast linking. Integers are 1000 times faster links than using a character string.

In your case, you are only wanting the "loginname" to be UNIQUE, not the link between tables. They are two totally different things and conform to the normalization rule because you don't duplicate "loginname" in other tables.

|||

Tom Phillips wrote:

I think you are confusing a "primary key" with a "unique index/constraint".


Yes, to things for the same (unique ID) purpose cannot be not confusing.

Tom Phillips wrote:


I "primary key" is the link between other tables. For example, you are not, and should not, be linking Employees to Addressses by the "EmployeeName" field, you use the "EmployeeID".


What is the problem to use the unique key "NationalID" for linking instead of EmploeeID primary key? Please, do not tell that the Microsoft decided so.

Tom Phillips wrote:


Also, you NEVER EVER use a link between tables which the user could to change. Otherwise, you need to update every single table which links to the item.

For linking tables, you want very fast linking. Integers are 1000 times faster links than using a character string.

This is what I wanted to see. It is not obvious, since the hash tables reach O(1) efficiency by hashing strings as well as serial numbers and the speed could easily be of the same order.

Tom Phillips wrote:


In your case, you are only wanting the "loginname" to be UNIQUE, not the link between tables. They are two totally different things and conform to the normalization rule because you don't duplicate "loginname" in other tables.

In my case I want to know the reason for overcomplicating the relations by inferring the redundant IDs to the emails I already have. The redundancy does violate the normalization as I have explained before. Normalization = elimination of redundancy.

|||

Another reason not to use SSN. SSN is not guaranteed to be unique.

Credit agencies got burned by this in the past|||

Valentin,

There are times when it is necessary to handle what you classsify as 'redundancy'. For example, in the USA, it is necessary to have an Employees TaxID number stored and available for government related reports and tax filings. However, that is 'protected' and sensitive data, and therefor 'should' NOT be used as a Primary Key for normal reports and operations. And we don't have 'mandated' NationalID number -yet.

Isn't EmployeeID redundant if they are identified by national id?

So in most situations, there will be an EmployeeID, Primary Key, and also a need to verify and be certain that there are no duplicate TaxIDs. So the Unique Constraint (Unique Index) helps in that respect.

With a Unique Constraint, you DO NOT have to have a value (as a Primary Key requires) -BUT if you do provide a value, it must be UNIQUE.

|||They are not the "same thing". They are used for different purposes. The ONLY thing the same is they are both unique. They are not redundant.

You NEVER, EVER link tables on a user changeable data. In the MS example, I assume, "NationalID" is a user entered field (probably Social Security Number). Although, it needs to be unique, because 2 people cannot have the same number, the user MIGHT change it for many reasons (it was entered wrong or changed for some other reason). Therefore, you do not link tables on the NationalID field. You use the EmployeeID field, which is also unique, but NEVER changes after it is created by the software.

From what you describe in your situation, lets say you have 2 tables:

User:
UserID INT IDENTITY(1,1) PK
UserName varchar(50)
UserEmailAddress varchar(255) (this is the max size of an SMTP email address) UNIQUE

LoginData:
LoginID INT IDENTITY(1,1) PK
UserID INT FK to User
LoginDate DATETIME

What would happen if the user changes his/her email address? The answer, NOTHING.

What I think you are suggesting is the following:

User:
UserID INT IDENTITY(1,1) PK
UserName varchar(50)
UserEmailAddress varchar(255) UNIQUE (this is the max size of an SMTP email address)

LoginData:
LoginID INT IDENTITY(1,1) PK
UserEmailAddress varchar(255) FK to User
LoginDate DATETIME

Now if the email address changes, you need to change every record in LoginData AND you have violated normalization rules by duplicating USER DATA (UserEmailAddress) in your tables. Also, you have made your table linking take 10,000 times longer to search by using a varchar(255) rather than the int.

|||Now, lets further complicate your issue by talking about adding some way to find "Last Login Date".

Normalization would say, you have it:

SELECT MAX(LoginData.LoginDate) FROM LoginData WHERE LoginID=@.LoginID

However, it is a terrible waste of time to scan 1 billion records.

FYI, in MS SQL (but no other engine) you could do:

SELECT TOP 1 LoginData.LoginDate FROM LoginData WHERE LoginID=@.LoginID ORDER BY LoginData.LoginDate DESC

So most people would just add "LastLoginDate datetime" to User and update it when the user logged in.

Yes, it violates normalization rules, but one field is nothing. Then it is 2, then 10, then 100 and then normalization is out the window. Smile

|||

Tom Phillips wrote:

They are not the "same thing". They are used for different purposes. The ONLY thing the same is they are both unique. They are not redundant.

You NEVER, EVER link tables on a user changeable data. In the MS example, I assume, "NationalID" is a user entered field (probably Social Security Number). Although, it needs to be unique, because 2 people cannot have the same number, the user MIGHT change it for many reasons (it was entered wrong or changed for some other reason). Therefore, you do not link tables on the NationalID field. You use the EmployeeID field, which is also unique, but NEVER changes after it is created by the software.

<skip>
Now if the email address changes, you need to change every record in LoginData AND you have violated normalization rules by duplicating USER DATA (UserEmailAddress) in your tables. Also, you have made your table linking take 10,000 times longer to search by using a varchar(255) rather than the int.

The fact that changing primary key (like object address in a program) is problematic does not mean that adding an extra reference you do no incur a redundancy.

Arnie Rowland wrote:

Valentin,

There are times when it is necessary to handle what you classsify as 'redundancy'. For example, in the USA, it is necessary to have an Employees TaxID number stored and available for government related reports and tax filings. However, that is 'protected' and sensitive data, and therefor 'should' NOT be used as a Primary Key for normal reports and operations. And we don't have 'mandated' NationalID number -yet.

But I'm addressing the case when the symbolic name is unique and mandatory. In our app, user logs in by email. In this case, email is not just unique. Since it is used to uniquely identify users, it is a key (nulls are not allowed). The int ID uses the same purpose. It duplicates the email field. The same situations I have in 'groups' table where human-manageable groups must have unique names but we refer the groups by ID. In addition to being redundant, the design overcomplicates the things because user normally wants to see the group (s)he belongs in text rather than an integer id of the group.

Though, the keys are duplicated, I cannot bring an inconsistency example, which proves the redundancy. But 3rd normal form requires the record fields to depend only on the primary key. Meantime, users in my database are identified/refered/addressed/pointed to by email. It is a natural primary key. Integer IDs were artificially introduced to do the same job.

character strings as primary keys

Hello,

How bad is normalizing the database to the 3rd form, which requires that all fields depend on nothing but primary key. Consider the first table you create -- users. They have int primary keys, which duplicate the real primary keys -- user names. When user logs in, the user's entry is uniquely identified by its name, which is not primary key. The fundamental design rule -- avoid redundancy -- is violated. A VERY serious reason should be there for that.

Usually, design is compromised by redundancy for performance. Here, both copies are stored in one remote database, but integer keys may be located/used faster. Additionally, using long string references everywhere instead of short integer keys may save a lot of storage space (additionally increasing speed). How serious these impacts are? Am I missing something?

Usually, login names are not allowed to change. You have problems changing primary keys because all the foreign keys must be updated accordingly. Does it reveal that most user databases use character strings as primary keys?

Valentin,

Names are rarely UNIQUE -a primary requirement for a Primary Key. You only have to look at a telephone book to see that using Names as a Primary Key will be impossible. (Granted, a few 'small' countries mandate name uniqueness.)

From: http://howmanyofme.com/people/John_Smith/

There are 49,842 people in the U.S. named John Smith.

User Names, while sometimes, in small systems, seem unique, and seem suitable for Primary Key usage, will, over time prove to be a significant problem. Then the system will have to devolve to using ( Name + identifier ), e.g., JohnSmith1, JohnSmith2, etc. I have found that it is so problemmatic to use Names as keys that is is not worth even considering. As you indicate, Names 'should' allow for changes, i.e, marriage, divorce, etc. But the Primary Key 'should not' change. If you consider the domain security system, the User Name is NOT the Primary Key, and the Login Name is changable. The Primary Key is the SID, a unique identifier normally hidden from view, somewhat like using a IDENTITY field (or other indentifier) -in fact, a surrogate key. (Beware: simplistic explanition.)

It is extremely difficult to design a system for identifying People with a 'natural' Primary Key. Some adamately espouse using the SSN (in the USA) as a Primary Key. But it is NOT a naturally occuring characteristic of the individual -it is a surrogate key created by a remote system. It is legally protected, and increased being enforced, as 'sensitive' data. Most data systems 'should' NOT use the SSN as a Primary Key -of course, exceptions are allowed, or required, by law. So what does that leave as a naturally occuring characteristic of the entity -nothing really, except perhaps biometrics. Now try using the DNA sequence as a key -talk about size... (Though I know of some attempts to synthesize through the sequence redundancy to create smaller and more usable keys. -Think DNA/Fingerprint/Biometric databases.)

From: http://www.eogn.com/archives/news0202.htm

The fact that DNA is inherited and that each individual is the product of his/her progenitors means that DNA can be used to not only create unique identifications, but also to identify members of the same family, the same clan or tribal group, or the same population.

I hope I have sufficiently challenged your assertion that names are usable as primary keys.

... which duplicate the real primary keys -- user names.

A Passport number a NOT good Primary Key candidate -many people do not have a passport.

A Driver's License Number is NOT a good Primary Key candidate -many people do not have one.

So the real question is, in my system, for our business needs, what are appropriate surrogate keys?

|||By the character strings, I did not mean the real names. It is clear that real people can be namesakes. But in the internet, rarely anybody uses their numeric ids. For instance, Microsoft uses emails as user identifiers. The DNS maps between real IP addresses and memorizable symbolic names. People enjoy having fancy monikers.
|||Trying to grasp the subtle difference between "is unique" indeces and "unique key" columns, I came across the following text:


SQL Server 2005 Books Online

Unique Index Design Guidelines
Designing and Creating Databases > Indexes > Designing Indexes >

A unique index guarantees that the index key contains no duplicate values and therefore every row in the table is in some way unique. Specifying a unique index makes sense only when uniqueness is a characteristic of the data itself. For example, if you want to make sure that the values in the NationalIDNumber column in the HumanResources.Employee table are unique, when the primary key is EmployeeID, create a UNIQUE constraint on the NationalIDNumber column. If the user tries to enter the same value in that column for more than one employee, an error message is displayed and the duplicate value is not entered.

The engine designers foresee the situation where there is a redundancy -- two key columns in a table. But they do not explain why there may appear a need for such abnormal relation. Isn't EmployeeID redundant if they are identified by national id?

|||I think you are confusing a "primary key" with a "unique index/constraint".

I "primary key" is the link between other tables. For example, you are not, and should not, be linking Employees to Addressses by the "EmployeeName" field, you use the "EmployeeID".

Also, you NEVER EVER use a link between tables which the user could to change. Otherwise, you need to update every single table which links to the item.

For linking tables, you want very fast linking. Integers are 1000 times faster links than using a character string.

In your case, you are only wanting the "loginname" to be UNIQUE, not the link between tables. They are two totally different things and conform to the normalization rule because you don't duplicate "loginname" in other tables.

|||

Tom Phillips wrote:

I think you are confusing a "primary key" with a "unique index/constraint".


Yes, to things for the same (unique ID) purpose cannot be not confusing.

Tom Phillips wrote:


I "primary key" is the link between other tables. For example, you are not, and should not, be linking Employees to Addressses by the "EmployeeName" field, you use the "EmployeeID".


What is the problem to use the unique key "NationalID" for linking instead of EmploeeID primary key? Please, do not tell that the Microsoft decided so.

Tom Phillips wrote:


Also, you NEVER EVER use a link between tables which the user could to change. Otherwise, you need to update every single table which links to the item.

For linking tables, you want very fast linking. Integers are 1000 times faster links than using a character string.

This is what I wanted to see. It is not obvious, since the hash tables reach O(1) efficiency by hashing strings as well as serial numbers and the speed could easily be of the same order.

Tom Phillips wrote:


In your case, you are only wanting the "loginname" to be UNIQUE, not the link between tables. They are two totally different things and conform to the normalization rule because you don't duplicate "loginname" in other tables.

In my case I want to know the reason for overcomplicating the relations by inferring the redundant IDs to the emails I already have. The redundancy does violate the normalization as I have explained before. Normalization = elimination of redundancy.

|||

Another reason not to use SSN. SSN is not guaranteed to be unique.

Credit agencies got burned by this in the past|||

Valentin,

There are times when it is necessary to handle what you classsify as 'redundancy'. For example, in the USA, it is necessary to have an Employees TaxID number stored and available for government related reports and tax filings. However, that is 'protected' and sensitive data, and therefor 'should' NOT be used as a Primary Key for normal reports and operations. And we don't have 'mandated' NationalID number -yet.

Isn't EmployeeID redundant if they are identified by national id?

So in most situations, there will be an EmployeeID, Primary Key, and also a need to verify and be certain that there are no duplicate TaxIDs. So the Unique Constraint (Unique Index) helps in that respect.

With a Unique Constraint, you DO NOT have to have a value (as a Primary Key requires) -BUT if you do provide a value, it must be UNIQUE.

|||They are not the "same thing". They are used for different purposes. The ONLY thing the same is they are both unique. They are not redundant.

You NEVER, EVER link tables on a user changeable data. In the MS example, I assume, "NationalID" is a user entered field (probably Social Security Number). Although, it needs to be unique, because 2 people cannot have the same number, the user MIGHT change it for many reasons (it was entered wrong or changed for some other reason). Therefore, you do not link tables on the NationalID field. You use the EmployeeID field, which is also unique, but NEVER changes after it is created by the software.

From what you describe in your situation, lets say you have 2 tables:

User:
UserID INT IDENTITY(1,1) PK
UserName varchar(50)
UserEmailAddress varchar(255) (this is the max size of an SMTP email address) UNIQUE

LoginData:
LoginID INT IDENTITY(1,1) PK
UserID INT FK to User
LoginDate DATETIME

What would happen if the user changes his/her email address? The answer, NOTHING.

What I think you are suggesting is the following:

User:
UserID INT IDENTITY(1,1) PK
UserName varchar(50)
UserEmailAddress varchar(255) UNIQUE (this is the max size of an SMTP email address)

LoginData:
LoginID INT IDENTITY(1,1) PK
UserEmailAddress varchar(255) FK to User
LoginDate DATETIME

Now if the email address changes, you need to change every record in LoginData AND you have violated normalization rules by duplicating USER DATA (UserEmailAddress) in your tables. Also, you have made your table linking take 10,000 times longer to search by using a varchar(255) rather than the int.

|||Now, lets further complicate your issue by talking about adding some way to find "Last Login Date".

Normalization would say, you have it:

SELECT MAX(LoginData.LoginDate) FROM LoginData WHERE LoginID=@.LoginID

However, it is a terrible waste of time to scan 1 billion records.

FYI, in MS SQL (but no other engine) you could do:

SELECT TOP 1 LoginData.LoginDate FROM LoginData WHERE LoginID=@.LoginID ORDER BY LoginData.LoginDate DESC

So most people would just add "LastLoginDate datetime" to User and update it when the user logged in.

Yes, it violates normalization rules, but one field is nothing. Then it is 2, then 10, then 100 and then normalization is out the window. Smile

|||

Tom Phillips wrote:

They are not the "same thing". They are used for different purposes. The ONLY thing the same is they are both unique. They are not redundant.

You NEVER, EVER link tables on a user changeable data. In the MS example, I assume, "NationalID" is a user entered field (probably Social Security Number). Although, it needs to be unique, because 2 people cannot have the same number, the user MIGHT change it for many reasons (it was entered wrong or changed for some other reason). Therefore, you do not link tables on the NationalID field. You use the EmployeeID field, which is also unique, but NEVER changes after it is created by the software.

<skip>
Now if the email address changes, you need to change every record in LoginData AND you have violated normalization rules by duplicating USER DATA (UserEmailAddress) in your tables. Also, you have made your table linking take 10,000 times longer to search by using a varchar(255) rather than the int.

The fact that changing primary key (like object address in a program) is problematic does not mean that adding an extra reference you do no incur a redundancy.

Arnie Rowland wrote:

Valentin,

There are times when it is necessary to handle what you classsify as 'redundancy'. For example, in the USA, it is necessary to have an Employees TaxID number stored and available for government related reports and tax filings. However, that is 'protected' and sensitive data, and therefor 'should' NOT be used as a Primary Key for normal reports and operations. And we don't have 'mandated' NationalID number -yet.

But I'm addressing the case when the symbolic name is unique and mandatory. In our app, user logs in by email. In this case, email is not just unique. Since it is used to uniquely identify users, it is a key (nulls are not allowed). The int ID uses the same purpose. It duplicates the email field. The same situations I have in 'groups' table where human-manageable groups must have unique names but we refer the groups by ID. In addition to being redundant, the design overcomplicates the things because user normally wants to see the group (s)he belongs in text rather than an integer id of the group.

Though, the keys are duplicated, I cannot bring an inconsistency example, which proves the redundancy. But 3rd normal form requires the record fields to depend only on the primary key. Meantime, users in my database are identified/refered/addressed/pointed to by email. It is a natural primary key. Integer IDs were artificially introduced to do the same job.

Sunday, February 12, 2012

Char or varchar for a primary key?

Hi,
I'm planning the structure of a SqlServer 2005 database for a new
application.
The requirement is that primary keys must be "natural"; i.e. in the table
Customers the primary key will be a max. 10 characters string (but the
string may be filled i.e. with only 5 charachters).

Should I define these primary keys as char[5] or varchar[5]?
I'm interested in your opinion in particular about performace issue, because
there will be tables with millions of records...

Thanks,
Davide.>

Quote:

Originally Posted by

Should I define these primary keys as char[5] or varchar[5]?


Sorry, I intended char[10] or varchar [10]|||D. (d@.d.com) writes:

Quote:

Originally Posted by

I'm planning the structure of a SqlServer 2005 database for a new
application.
The requirement is that primary keys must be "natural"; i.e. in the table
Customers the primary key will be a max. 10 characters string (but the
string may be filled i.e. with only 5 charachters).
>
Should I define these primary keys as char[5] or varchar[5]?
I'm interested in your opinion in particular about performace issue,
because there will be tables with millions of records...


char(10) would make sense if key values are almost always 10 characters
long, but if the distribution varies with, say, 5 as the average varchar
would be better.

What sort of strings do you expect? If the values will be digits and upper-
case characters, you way want to consider a binary collation for the column,
at least if your default collation is a Windows collation.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||"D." wrote:

Quote:

Originally Posted by

>
Hi,
I'm planning the structure of a SqlServer 2005 database for a new
application.
The requirement is that primary keys must be "natural"; i.e. in the table
Customers the primary key will be a max. 10 characters string (but the
string may be filled i.e. with only 5 charachters).
>
Should I define these primary keys as char[5] or varchar[5]?
I'm interested in your opinion in particular about performace issue, because
there will be tables with millions of records...
>
Thanks,
Davide.


Assuming you will not change existing primary key values often (or
ever), the performance between CHAR and VARCHAR comes down to the space
requirements.

VARCHAR claims two bytes for the string length, plus the actual number
of characters in the string. So if the average primary key length
exceeds 8, you are better off with CHAR(10), otherwise you could use
VARCHAR(10).

Because of some other considerations (the 'first' VARCHAR column will
cost an additional 5 bytes per row), when in doubt, I would choose CHAR
over VARCHAR. In your case, I would choose VARCHAR(10) if the average
length is 6 or smaller. Otherwise I would choose CHAR(10).

Gert-Jan|||>

Quote:

Originally Posted by

What sort of strings do you expect? If the values will be digits and
upper-
case characters, you way want to consider a binary collation for the
column,
at least if your default collation is a Windows collation.
>


It sounds good!
Yes, my keys will be only uppercase and digits (some other symbols are
allowed, like dot and hyphen)

How do I set binary collation on a single column?
Do you think that this will improve performance on lookups?
Do you think that having a single column with a different collation will not
decrease performance?

Davide.|||>

Quote:

Originally Posted by

Because of some other considerations (the 'first' VARCHAR column will
cost an additional 5 bytes per row), when in doubt, I would choose CHAR
over VARCHAR. In your case, I would choose VARCHAR(10) if the average
length is 6 or smaller. Otherwise I would choose CHAR(10).
>


Usually the key is fully filled, so I think I'll try to change varchar to
char.

Davide.|||D. (d@.d.com) writes:

Quote:

Originally Posted by

How do I set binary collation on a single column?


With the COLLATE clause:

CREATE TABLE mytable (
col char(10) COLLATE Latin1_General_BIN2 NOT NULL,
...

Quote:

Originally Posted by

Do you think that this will improve performance on lookups?


Yes, since comparison is a straight byte-comparison you gain some cycles,
particularly if your default collation is a Windows collation. It's
diffiuclt to say exactly how much you will gain, because there is a lot
of if depends. For a simple lookup, it's may be only 5-10%. For an
operation as "col LIKE '%str%' it may be drastic as a factor of seven.

If your default collation is an SQL collation (one there the name starts
with SQL), the gain is likely to be so small, that it's not worth the
pain. Note that this only applies if you use char/varchar. For
nchar/nvarchar there is no difference between SQL and Windows collations.

Quote:

Originally Posted by

Do you think that having a single column with a different collation will
not decrease performance?


It will not, but there will be more hassle with programming. And it
would not be the only column with that collation, if there are other
tables with foreign keys to this table.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||Without more specs, I would go with CHAR(n) and a CHECK() constraint
that uses a regular expression to validate it.