Showing posts with label consider. Show all posts
Showing posts with label consider. Show all posts

Tuesday, February 14, 2012

character strings as primary keys

Hello,

How bad is normalizing the database to the 3rd form, which requires that all fields depend on nothing but primary key. Consider the first table you create -- users. They have int primary keys, which duplicate the real primary keys -- user names. When user logs in, the user's entry is uniquely identified by its name, which is not primary key. The fundamental design rule -- avoid redundancy -- is violated. A VERY serious reason should be there for that.

Usually, design is compromised by redundancy for performance. Here, both copies are stored in one remote database, but integer keys may be located/used faster. Additionally, using long string references everywhere instead of short integer keys may save a lot of storage space (additionally increasing speed). How serious these impacts are? Am I missing something?

Usually, login names are not allowed to change. You have problems changing primary keys because all the foreign keys must be updated accordingly. Does it reveal that most user databases use character strings as primary keys?

Valentin,

Names are rarely UNIQUE -a primary requirement for a Primary Key. You only have to look at a telephone book to see that using Names as a Primary Key will be impossible. (Granted, a few 'small' countries mandate name uniqueness.)

From: http://howmanyofme.com/people/John_Smith/

There are 49,842 people in the U.S. named John Smith.

User Names, while sometimes, in small systems, seem unique, and seem suitable for Primary Key usage, will, over time prove to be a significant problem. Then the system will have to devolve to using ( Name + identifier ), e.g., JohnSmith1, JohnSmith2, etc. I have found that it is so problemmatic to use Names as keys that is is not worth even considering. As you indicate, Names 'should' allow for changes, i.e, marriage, divorce, etc. But the Primary Key 'should not' change. If you consider the domain security system, the User Name is NOT the Primary Key, and the Login Name is changable. The Primary Key is the SID, a unique identifier normally hidden from view, somewhat like using a IDENTITY field (or other indentifier) -in fact, a surrogate key. (Beware: simplistic explanition.)

It is extremely difficult to design a system for identifying People with a 'natural' Primary Key. Some adamately espouse using the SSN (in the USA) as a Primary Key. But it is NOT a naturally occuring characteristic of the individual -it is a surrogate key created by a remote system. It is legally protected, and increased being enforced, as 'sensitive' data. Most data systems 'should' NOT use the SSN as a Primary Key -of course, exceptions are allowed, or required, by law. So what does that leave as a naturally occuring characteristic of the entity -nothing really, except perhaps biometrics. Now try using the DNA sequence as a key -talk about size... (Though I know of some attempts to synthesize through the sequence redundancy to create smaller and more usable keys. -Think DNA/Fingerprint/Biometric databases.)

From: http://www.eogn.com/archives/news0202.htm

The fact that DNA is inherited and that each individual is the product of his/her progenitors means that DNA can be used to not only create unique identifications, but also to identify members of the same family, the same clan or tribal group, or the same population.

I hope I have sufficiently challenged your assertion that names are usable as primary keys.

... which duplicate the real primary keys -- user names.

A Passport number a NOT good Primary Key candidate -many people do not have a passport.

A Driver's License Number is NOT a good Primary Key candidate -many people do not have one.

So the real question is, in my system, for our business needs, what are appropriate surrogate keys?

|||By the character strings, I did not mean the real names. It is clear that real people can be namesakes. But in the internet, rarely anybody uses their numeric ids. For instance, Microsoft uses emails as user identifiers. The DNS maps between real IP addresses and memorizable symbolic names. People enjoy having fancy monikers.
|||Trying to grasp the subtle difference between "is unique" indeces and "unique key" columns, I came across the following text:


SQL Server 2005 Books Online

Unique Index Design Guidelines
Designing and Creating Databases > Indexes > Designing Indexes >

A unique index guarantees that the index key contains no duplicate values and therefore every row in the table is in some way unique. Specifying a unique index makes sense only when uniqueness is a characteristic of the data itself. For example, if you want to make sure that the values in the NationalIDNumber column in the HumanResources.Employee table are unique, when the primary key is EmployeeID, create a UNIQUE constraint on the NationalIDNumber column. If the user tries to enter the same value in that column for more than one employee, an error message is displayed and the duplicate value is not entered.

The engine designers foresee the situation where there is a redundancy -- two key columns in a table. But they do not explain why there may appear a need for such abnormal relation. Isn't EmployeeID redundant if they are identified by national id?

|||I think you are confusing a "primary key" with a "unique index/constraint".

I "primary key" is the link between other tables. For example, you are not, and should not, be linking Employees to Addressses by the "EmployeeName" field, you use the "EmployeeID".

Also, you NEVER EVER use a link between tables which the user could to change. Otherwise, you need to update every single table which links to the item.

For linking tables, you want very fast linking. Integers are 1000 times faster links than using a character string.

In your case, you are only wanting the "loginname" to be UNIQUE, not the link between tables. They are two totally different things and conform to the normalization rule because you don't duplicate "loginname" in other tables.

|||

Tom Phillips wrote:

I think you are confusing a "primary key" with a "unique index/constraint".


Yes, to things for the same (unique ID) purpose cannot be not confusing.

Tom Phillips wrote:


I "primary key" is the link between other tables. For example, you are not, and should not, be linking Employees to Addressses by the "EmployeeName" field, you use the "EmployeeID".


What is the problem to use the unique key "NationalID" for linking instead of EmploeeID primary key? Please, do not tell that the Microsoft decided so.

Tom Phillips wrote:


Also, you NEVER EVER use a link between tables which the user could to change. Otherwise, you need to update every single table which links to the item.

For linking tables, you want very fast linking. Integers are 1000 times faster links than using a character string.

This is what I wanted to see. It is not obvious, since the hash tables reach O(1) efficiency by hashing strings as well as serial numbers and the speed could easily be of the same order.

Tom Phillips wrote:


In your case, you are only wanting the "loginname" to be UNIQUE, not the link between tables. They are two totally different things and conform to the normalization rule because you don't duplicate "loginname" in other tables.

In my case I want to know the reason for overcomplicating the relations by inferring the redundant IDs to the emails I already have. The redundancy does violate the normalization as I have explained before. Normalization = elimination of redundancy.

|||

Another reason not to use SSN. SSN is not guaranteed to be unique.

Credit agencies got burned by this in the past|||

Valentin,

There are times when it is necessary to handle what you classsify as 'redundancy'. For example, in the USA, it is necessary to have an Employees TaxID number stored and available for government related reports and tax filings. However, that is 'protected' and sensitive data, and therefor 'should' NOT be used as a Primary Key for normal reports and operations. And we don't have 'mandated' NationalID number -yet.

Isn't EmployeeID redundant if they are identified by national id?

So in most situations, there will be an EmployeeID, Primary Key, and also a need to verify and be certain that there are no duplicate TaxIDs. So the Unique Constraint (Unique Index) helps in that respect.

With a Unique Constraint, you DO NOT have to have a value (as a Primary Key requires) -BUT if you do provide a value, it must be UNIQUE.

|||They are not the "same thing". They are used for different purposes. The ONLY thing the same is they are both unique. They are not redundant.

You NEVER, EVER link tables on a user changeable data. In the MS example, I assume, "NationalID" is a user entered field (probably Social Security Number). Although, it needs to be unique, because 2 people cannot have the same number, the user MIGHT change it for many reasons (it was entered wrong or changed for some other reason). Therefore, you do not link tables on the NationalID field. You use the EmployeeID field, which is also unique, but NEVER changes after it is created by the software.

From what you describe in your situation, lets say you have 2 tables:

User:
UserID INT IDENTITY(1,1) PK
UserName varchar(50)
UserEmailAddress varchar(255) (this is the max size of an SMTP email address) UNIQUE

LoginData:
LoginID INT IDENTITY(1,1) PK
UserID INT FK to User
LoginDate DATETIME

What would happen if the user changes his/her email address? The answer, NOTHING.

What I think you are suggesting is the following:

User:
UserID INT IDENTITY(1,1) PK
UserName varchar(50)
UserEmailAddress varchar(255) UNIQUE (this is the max size of an SMTP email address)

LoginData:
LoginID INT IDENTITY(1,1) PK
UserEmailAddress varchar(255) FK to User
LoginDate DATETIME

Now if the email address changes, you need to change every record in LoginData AND you have violated normalization rules by duplicating USER DATA (UserEmailAddress) in your tables. Also, you have made your table linking take 10,000 times longer to search by using a varchar(255) rather than the int.

|||Now, lets further complicate your issue by talking about adding some way to find "Last Login Date".

Normalization would say, you have it:

SELECT MAX(LoginData.LoginDate) FROM LoginData WHERE LoginID=@.LoginID

However, it is a terrible waste of time to scan 1 billion records.

FYI, in MS SQL (but no other engine) you could do:

SELECT TOP 1 LoginData.LoginDate FROM LoginData WHERE LoginID=@.LoginID ORDER BY LoginData.LoginDate DESC

So most people would just add "LastLoginDate datetime" to User and update it when the user logged in.

Yes, it violates normalization rules, but one field is nothing. Then it is 2, then 10, then 100 and then normalization is out the window. Smile

|||

Tom Phillips wrote:

They are not the "same thing". They are used for different purposes. The ONLY thing the same is they are both unique. They are not redundant.

You NEVER, EVER link tables on a user changeable data. In the MS example, I assume, "NationalID" is a user entered field (probably Social Security Number). Although, it needs to be unique, because 2 people cannot have the same number, the user MIGHT change it for many reasons (it was entered wrong or changed for some other reason). Therefore, you do not link tables on the NationalID field. You use the EmployeeID field, which is also unique, but NEVER changes after it is created by the software.

<skip>
Now if the email address changes, you need to change every record in LoginData AND you have violated normalization rules by duplicating USER DATA (UserEmailAddress) in your tables. Also, you have made your table linking take 10,000 times longer to search by using a varchar(255) rather than the int.

The fact that changing primary key (like object address in a program) is problematic does not mean that adding an extra reference you do no incur a redundancy.

Arnie Rowland wrote:

Valentin,

There are times when it is necessary to handle what you classsify as 'redundancy'. For example, in the USA, it is necessary to have an Employees TaxID number stored and available for government related reports and tax filings. However, that is 'protected' and sensitive data, and therefor 'should' NOT be used as a Primary Key for normal reports and operations. And we don't have 'mandated' NationalID number -yet.

But I'm addressing the case when the symbolic name is unique and mandatory. In our app, user logs in by email. In this case, email is not just unique. Since it is used to uniquely identify users, it is a key (nulls are not allowed). The int ID uses the same purpose. It duplicates the email field. The same situations I have in 'groups' table where human-manageable groups must have unique names but we refer the groups by ID. In addition to being redundant, the design overcomplicates the things because user normally wants to see the group (s)he belongs in text rather than an integer id of the group.

Though, the keys are duplicated, I cannot bring an inconsistency example, which proves the redundancy. But 3rd normal form requires the record fields to depend only on the primary key. Meantime, users in my database are identified/refered/addressed/pointed to by email. It is a natural primary key. Integer IDs were artificially introduced to do the same job.

character strings as primary keys

Hello,

How bad is normalizing the database to the 3rd form, which requires that all fields depend on nothing but primary key. Consider the first table you create -- users. They have int primary keys, which duplicate the real primary keys -- user names. When user logs in, the user's entry is uniquely identified by its name, which is not primary key. The fundamental design rule -- avoid redundancy -- is violated. A VERY serious reason should be there for that.

Usually, design is compromised by redundancy for performance. Here, both copies are stored in one remote database, but integer keys may be located/used faster. Additionally, using long string references everywhere instead of short integer keys may save a lot of storage space (additionally increasing speed). How serious these impacts are? Am I missing something?

Usually, login names are not allowed to change. You have problems changing primary keys because all the foreign keys must be updated accordingly. Does it reveal that most user databases use character strings as primary keys?

Valentin,

Names are rarely UNIQUE -a primary requirement for a Primary Key. You only have to look at a telephone book to see that using Names as a Primary Key will be impossible. (Granted, a few 'small' countries mandate name uniqueness.)

From: http://howmanyofme.com/people/John_Smith/

There are 49,842 people in the U.S. named John Smith.

User Names, while sometimes, in small systems, seem unique, and seem suitable for Primary Key usage, will, over time prove to be a significant problem. Then the system will have to devolve to using ( Name + identifier ), e.g., JohnSmith1, JohnSmith2, etc. I have found that it is so problemmatic to use Names as keys that is is not worth even considering. As you indicate, Names 'should' allow for changes, i.e, marriage, divorce, etc. But the Primary Key 'should not' change. If you consider the domain security system, the User Name is NOT the Primary Key, and the Login Name is changable. The Primary Key is the SID, a unique identifier normally hidden from view, somewhat like using a IDENTITY field (or other indentifier) -in fact, a surrogate key. (Beware: simplistic explanition.)

It is extremely difficult to design a system for identifying People with a 'natural' Primary Key. Some adamately espouse using the SSN (in the USA) as a Primary Key. But it is NOT a naturally occuring characteristic of the individual -it is a surrogate key created by a remote system. It is legally protected, and increased being enforced, as 'sensitive' data. Most data systems 'should' NOT use the SSN as a Primary Key -of course, exceptions are allowed, or required, by law. So what does that leave as a naturally occuring characteristic of the entity -nothing really, except perhaps biometrics. Now try using the DNA sequence as a key -talk about size... (Though I know of some attempts to synthesize through the sequence redundancy to create smaller and more usable keys. -Think DNA/Fingerprint/Biometric databases.)

From: http://www.eogn.com/archives/news0202.htm

The fact that DNA is inherited and that each individual is the product of his/her progenitors means that DNA can be used to not only create unique identifications, but also to identify members of the same family, the same clan or tribal group, or the same population.

I hope I have sufficiently challenged your assertion that names are usable as primary keys.

... which duplicate the real primary keys -- user names.

A Passport number a NOT good Primary Key candidate -many people do not have a passport.

A Driver's License Number is NOT a good Primary Key candidate -many people do not have one.

So the real question is, in my system, for our business needs, what are appropriate surrogate keys?

|||By the character strings, I did not mean the real names. It is clear that real people can be namesakes. But in the internet, rarely anybody uses their numeric ids. For instance, Microsoft uses emails as user identifiers. The DNS maps between real IP addresses and memorizable symbolic names. People enjoy having fancy monikers.
|||Trying to grasp the subtle difference between "is unique" indeces and "unique key" columns, I came across the following text:


SQL Server 2005 Books Online

Unique Index Design Guidelines
Designing and Creating Databases > Indexes > Designing Indexes >

A unique index guarantees that the index key contains no duplicate values and therefore every row in the table is in some way unique. Specifying a unique index makes sense only when uniqueness is a characteristic of the data itself. For example, if you want to make sure that the values in the NationalIDNumber column in the HumanResources.Employee table are unique, when the primary key is EmployeeID, create a UNIQUE constraint on the NationalIDNumber column. If the user tries to enter the same value in that column for more than one employee, an error message is displayed and the duplicate value is not entered.

The engine designers foresee the situation where there is a redundancy -- two key columns in a table. But they do not explain why there may appear a need for such abnormal relation. Isn't EmployeeID redundant if they are identified by national id?

|||I think you are confusing a "primary key" with a "unique index/constraint".

I "primary key" is the link between other tables. For example, you are not, and should not, be linking Employees to Addressses by the "EmployeeName" field, you use the "EmployeeID".

Also, you NEVER EVER use a link between tables which the user could to change. Otherwise, you need to update every single table which links to the item.

For linking tables, you want very fast linking. Integers are 1000 times faster links than using a character string.

In your case, you are only wanting the "loginname" to be UNIQUE, not the link between tables. They are two totally different things and conform to the normalization rule because you don't duplicate "loginname" in other tables.

|||

Tom Phillips wrote:

I think you are confusing a "primary key" with a "unique index/constraint".


Yes, to things for the same (unique ID) purpose cannot be not confusing.

Tom Phillips wrote:


I "primary key" is the link between other tables. For example, you are not, and should not, be linking Employees to Addressses by the "EmployeeName" field, you use the "EmployeeID".


What is the problem to use the unique key "NationalID" for linking instead of EmploeeID primary key? Please, do not tell that the Microsoft decided so.

Tom Phillips wrote:


Also, you NEVER EVER use a link between tables which the user could to change. Otherwise, you need to update every single table which links to the item.

For linking tables, you want very fast linking. Integers are 1000 times faster links than using a character string.

This is what I wanted to see. It is not obvious, since the hash tables reach O(1) efficiency by hashing strings as well as serial numbers and the speed could easily be of the same order.

Tom Phillips wrote:


In your case, you are only wanting the "loginname" to be UNIQUE, not the link between tables. They are two totally different things and conform to the normalization rule because you don't duplicate "loginname" in other tables.

In my case I want to know the reason for overcomplicating the relations by inferring the redundant IDs to the emails I already have. The redundancy does violate the normalization as I have explained before. Normalization = elimination of redundancy.

|||

Another reason not to use SSN. SSN is not guaranteed to be unique.

Credit agencies got burned by this in the past|||

Valentin,

There are times when it is necessary to handle what you classsify as 'redundancy'. For example, in the USA, it is necessary to have an Employees TaxID number stored and available for government related reports and tax filings. However, that is 'protected' and sensitive data, and therefor 'should' NOT be used as a Primary Key for normal reports and operations. And we don't have 'mandated' NationalID number -yet.

Isn't EmployeeID redundant if they are identified by national id?

So in most situations, there will be an EmployeeID, Primary Key, and also a need to verify and be certain that there are no duplicate TaxIDs. So the Unique Constraint (Unique Index) helps in that respect.

With a Unique Constraint, you DO NOT have to have a value (as a Primary Key requires) -BUT if you do provide a value, it must be UNIQUE.

|||They are not the "same thing". They are used for different purposes. The ONLY thing the same is they are both unique. They are not redundant.

You NEVER, EVER link tables on a user changeable data. In the MS example, I assume, "NationalID" is a user entered field (probably Social Security Number). Although, it needs to be unique, because 2 people cannot have the same number, the user MIGHT change it for many reasons (it was entered wrong or changed for some other reason). Therefore, you do not link tables on the NationalID field. You use the EmployeeID field, which is also unique, but NEVER changes after it is created by the software.

From what you describe in your situation, lets say you have 2 tables:

User:
UserID INT IDENTITY(1,1) PK
UserName varchar(50)
UserEmailAddress varchar(255) (this is the max size of an SMTP email address) UNIQUE

LoginData:
LoginID INT IDENTITY(1,1) PK
UserID INT FK to User
LoginDate DATETIME

What would happen if the user changes his/her email address? The answer, NOTHING.

What I think you are suggesting is the following:

User:
UserID INT IDENTITY(1,1) PK
UserName varchar(50)
UserEmailAddress varchar(255) UNIQUE (this is the max size of an SMTP email address)

LoginData:
LoginID INT IDENTITY(1,1) PK
UserEmailAddress varchar(255) FK to User
LoginDate DATETIME

Now if the email address changes, you need to change every record in LoginData AND you have violated normalization rules by duplicating USER DATA (UserEmailAddress) in your tables. Also, you have made your table linking take 10,000 times longer to search by using a varchar(255) rather than the int.

|||Now, lets further complicate your issue by talking about adding some way to find "Last Login Date".

Normalization would say, you have it:

SELECT MAX(LoginData.LoginDate) FROM LoginData WHERE LoginID=@.LoginID

However, it is a terrible waste of time to scan 1 billion records.

FYI, in MS SQL (but no other engine) you could do:

SELECT TOP 1 LoginData.LoginDate FROM LoginData WHERE LoginID=@.LoginID ORDER BY LoginData.LoginDate DESC

So most people would just add "LastLoginDate datetime" to User and update it when the user logged in.

Yes, it violates normalization rules, but one field is nothing. Then it is 2, then 10, then 100 and then normalization is out the window. Smile

|||

Tom Phillips wrote:

They are not the "same thing". They are used for different purposes. The ONLY thing the same is they are both unique. They are not redundant.

You NEVER, EVER link tables on a user changeable data. In the MS example, I assume, "NationalID" is a user entered field (probably Social Security Number). Although, it needs to be unique, because 2 people cannot have the same number, the user MIGHT change it for many reasons (it was entered wrong or changed for some other reason). Therefore, you do not link tables on the NationalID field. You use the EmployeeID field, which is also unique, but NEVER changes after it is created by the software.

<skip>
Now if the email address changes, you need to change every record in LoginData AND you have violated normalization rules by duplicating USER DATA (UserEmailAddress) in your tables. Also, you have made your table linking take 10,000 times longer to search by using a varchar(255) rather than the int.

The fact that changing primary key (like object address in a program) is problematic does not mean that adding an extra reference you do no incur a redundancy.

Arnie Rowland wrote:

Valentin,

There are times when it is necessary to handle what you classsify as 'redundancy'. For example, in the USA, it is necessary to have an Employees TaxID number stored and available for government related reports and tax filings. However, that is 'protected' and sensitive data, and therefor 'should' NOT be used as a Primary Key for normal reports and operations. And we don't have 'mandated' NationalID number -yet.

But I'm addressing the case when the symbolic name is unique and mandatory. In our app, user logs in by email. In this case, email is not just unique. Since it is used to uniquely identify users, it is a key (nulls are not allowed). The int ID uses the same purpose. It duplicates the email field. The same situations I have in 'groups' table where human-manageable groups must have unique names but we refer the groups by ID. In addition to being redundant, the design overcomplicates the things because user normally wants to see the group (s)he belongs in text rather than an integer id of the group.

Though, the keys are duplicated, I cannot bring an inconsistency example, which proves the redundancy. But 3rd normal form requires the record fields to depend only on the primary key. Meantime, users in my database are identified/refered/addressed/pointed to by email. It is a natural primary key. Integer IDs were artificially introduced to do the same job.