Thursday, 17 July 2014

NEWID() vs ROW_NUMBER() to fetch random records in sql server

There are scenarios, when requirement is to fetch random records from the database table. This can be achieved using NEWID() or ROW_NUMBER() methods of sql server.

Let see what these methods does. NEWID() methods create a GUID for each row and then perform the sorting on that, Whereas ROW_NUMBER() create a integer incremented value (ex. 1,2,3,4...) for each rows by performing sorting on specified column.

Since NEWID() generates a GUID for rows, it might be costly for two reasons:
              a) uses of disk I/O operation
              b) Performance issue for performing sorting on GUID.

Where as, using ROW_NUMBER() could be a good choice but in some scenario may not. To check the performance I did some research and I got below results which is bit surprising for me.

I did a performance testing for both the approach in a table with 17 columns and 4100911 (4 million +) rows, below is the results:

Using NEWID()
--------------------------------------------------------------------------
1.) Fetch 10 random records.
     Query: select top 10 * from [dbo].[MyTable] order by NEWID()
     Execution Time: 01:23 minutes
2.) Fetch 10 random records but only specified column
     Query: select top 10 col1 from [dbo].[MyCommon] order by NEWID()
     Execution Time: 03:05 minutes.

Using ROW_NUMBER()
-------------------------------------------------------------------------
1.) Fetch 10 random records.
     Query: WITH tblrows AS(SELECT *, ROW_NUMBER() OVER (ORDER BY col1) AS  row     FROM [dbo].[MyTable] )
SELECT * FROM tblrows where row in (5, 12, 345, 3456, 37890, 56778, 222222, 123456, 222222, 333333)

     Execution Time: 16:35 minutes
2.) Fetch 10 random records but only specified column
     Query: WITH tblrows AS(SELECT col1, ROW_NUMBER() OVER (ORDER BY col1) AS  row     FROM [dbo].[MyTable] )
SELECT col1 FROM tblrows where row in (5, 12, 345, 3456, 37890, 56778, 222222, 123456, 222222, 333333)

     Execution Time: 00:17 minutes.

After looking into above results, it might be tricky to make a decision between NEWID() and ROW_NUMBER().

My purpose for this blog was to make you aware of behavior for these two methods.

Use below scripts to clean up sql server's buffer and cache, if you are testing query execution time.
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
GO


Feel free to give your feedback. Thank you for reading.

No comments:

Post a Comment