Big Data Normalization

In our world of Big Data, not all data is formatted in our desired format.  Oftentimes the solution to normalizing data seems simple but would require significant manual effort and time.  In this post I will walk you through a SQL solution to normalizing a commonly found denormalized table. 

In the example below, we have multiple CustomerIDs associated with single Customer Names.  The CustomerIDs are separated by commas and also have spaces thrown in here and there.  The goal is to show this data so that a single CustomerID appears on each line with its corresponding Customer Name – and without superfluous spaces. 

This example would be easy to do manually, but imagine you have a table with hundreds or thousands of rows – it would take forever!

CustomerName CustomerID
Walmart 32221, 492
Microsoft 27346990, 472
Exxon 1235
Target 5234, 28372936
Bank of America 5231
BMW 2839264,28284

 

Step 1)  Create a table in SQL.  Be sure to add an extra field for rowID as shown below.

CREATE TABLE Table1 (
    CustomerName VARCHAR(255)
    ,CustomerID VARCHAR(255)
    ,rowID INT IDENTITY(1,1) PRIMARY KEY
)

Step 2) Import your table into the newly created ‘Table1’.

Step 3) Run the following procedure.  If your table is separated by something other than a comma, change the @Separator variable. 

The first ‘While’ loop puts separates the CustomerIDs into new rows.  The second ‘While’ loop removes superfluous spaces (use if necessary).

DECLARE @Separator VARCHAR(1) = NULL
IF @Separator IS NULL SET @Separator = ','
DECLARE @SeparatorCharNum INT
DECLARE @a INT= 1
DECLARE @Space VARCHAR(1) = ' '
DECLARE @b INT= (SELECT MAX(rowID) FROM Table1)

WHILE @a <= @b BEGIN SET @a=@a --If there's no comma, go to the next CustomerName IF ( SELECT CHARINDEX(@Separator, CustomerID) FROM Table1 WHERE rowID = @a ) = 0 BEGIN SET @a= @a + 1 END --If there is a comma, add a new row to the table with the same CustomerName, plus the extracted CustomerID. --Then, delete that CustomerID from the original CustomerName / CustomerID list. ELSE BEGIN SET @SeparatorCharNum = ( SELECT CHARINDEX(@Separator, CustomerID) FROM Table1 WHERE rowID = @a )

   INSERT INTO Table1
   SELECT CustomerName, LEFT(CustomerID, @SeparatorCharNum - 1) AS CustomerID
   FROM Table1
   WHERE rowID = @a

   UPDATE Table1
   SET CustomerID = SUBSTRING(CustomerID, @SeparatorCharNum + 1, 255)
   WHERE rowID = @a
END
SET @a = @a

END

SET @a = 1 WHILE @a <= @b BEGIN IF ( SELECT CHARINDEX(@Space, CustomerID) FROM Table1 WHERE rowID = @a ) = 0 BEGIN SET @a = @a + 1 END ELSE BEGIN UPDATE Table1 SET CustomerID = REPLACE(CustomerID, ' ', '') WHERE rowID = @a SET @a = @a END END

SELECT * FROM Table1 GO

Your final output will yield the following table:

CustomerName CustomerID rowID
Walmart 492 1
Microsoft 472 2
Exxon 1235 3
Target 28372936 4
Bank of America 5231 5
BMW 28284 6
Walmart 32221 7
Microsoft 27346990 8
Target 5234 9
BMW 2839264 10

 Michael Kovalsky

Published on 07/16/2013

Authored by admin