Tuesday, February 14, 2012

Character Removal

Hello all,

I've been struggling with an interesting problem. I currently have a solution but it is very slow.

I will be cycling data through a table. Each cycle has 1 million records with 60 fields. One procedure I need to perform on this data is a character cleanse. I have a list of 12 characters that need to be removed.

Right now I have a stored procedure that pulls the characters from a table one at a time. It feeds it to a nested loop that replaces the character with nothing ('') on records that contain the character (something like "update tbl1 set FIELD = replace(FIELD, '&', '') where Field like '%&%'"). This works... but seems rather inefficient. It can take 10 minutes to do a 250,000 record table.

I have tried borrowing regular expressions from VBscript using com objects, it worked and seemed more efficient at first but then I threw a large file at it and it took a half hour to complete.

Im running SQL 2005 on a dual Xeon 3.4 box with 2 gb of ram.

Any advice would be greatly appreciated!!

~~~Thanks~~

The question is, is it taking so much time looking for the character or actually updating the field.

If your bottleneck is on the searching, the way to fix that is to create a full-text index.

Then, I would also change your statement to:

update tbl1
set field = replace(replace(replace(field,'&',''),'%',''),'$','') etc
where contains(field,' "&" OR "$" OR "%" ') -- etc

|||

Thanks for the suggestions, I'll do some rewrite and see if it improves.

Question, wouldn't creating this index slow down importing to the table and performing updates (which are the two main reasons that this database exists)? I guess what I'm worried about is simply spreading out the performance problem to other procedures.

Thanks again

|||

Can you give me a bit more information: Of the 60 fields, how many are you checking to update? Of the million records in each cycle, how many typically need updating? how many "bad" characters do you anticipate in 1 year...3 years?

|||

The table actually consists of 170 fields. We will load a file that will have up to 60 fields that require cleaning. A typical file that is cleansed has about 10% of the cells that need updating from the character cleanse procedure. However I would say that during the whole process we run on it, every row will be updated at least one time, sometimes multiple times.

As for how many bad characters in 1 year, I would say billions of bad character instances.

The 60 fields that require updating have been narrowed down by a view from the 170. Right now the stored procedure steps through whatever columns are in the view.

|||

Complex mathematic and string manipulation are always sql weak points. CLR was introduced in sql2k5 to remedy that. Since you're on sql2k5, you should definitely look into CLR Regex stored procedure.

This article should help.

http://msdn.microsoft.com/msdnmag/issues/07/02/SQLRegex/default.aspx

No comments:

Post a Comment