Random Sampling in MS SQL Server


There are different ways of sampling data. Random sampling is one of the most popular and most easiest way of sampling. But before using the data through random sampling we should check the quality of the data.

As a Data Scientist we can not do any modeling work without data sampling. When we want to build model we have to prepare training set, validation set, testing set. When we want to run any campaign we have to made target group, control group. We do this kind of job by data sampling technique.

I have work experience of playing large scale of data in telecom, insurance sector from 2009. In those two sector I got opportunity to access different databases like Oracle, DB2, MS SQL Server, MySQL etc. Sometime I did random sampling in database and sometimes I did random sampling in R. Today I’ll show you step by step random sampling script in MS SQL Server.

--Just consider total base customer count is 150,000
--and we want to take 5% as sample as control group
DECLARE @control_count bigint;
SET @control_count=150000*.05;--total_base*5%
--to see the number of customer of control group
SELECT @control_count control_base_count;

-to add a new column is_control in the base table
ALTER TABLE [dbo].[tblc_cross_sell]
ADD is_control int;
--updating is_control field/variable to zero. Default it is null
UPDATE [dbo].[tblc_cross_sell]
SET is_control=0;

--it is the main query to tag random 7500 customer as control group
UPDATE d
SET d.is_control=1
FROM
(SELECT id,is_control FROM [dbo].[tblc_cross_sell]) AS d
INNER JOIN
(
SELECT TOP 7500 id
FROM [dbo].[tblc_cross_sell]
ORDER BY NEWID()--random generated id
) AS s
ON
(d.id=s.id)
;

If you want to learn random sampling technique in Oracle it will be available very soon. Just keep in touch with us.

 

Add a Comment