Showing posts with label host_name. Show all posts
Showing posts with label host_name. Show all posts

Monday, March 26, 2012

Question on using ROW_NUMBER to remove duplicate rows in SQL 2005

Hi all--Given a table with the following definition in SQL Server 2005 SP2:

[DBA_AUDIT_SESSION](

[INSTANCE_NAME] [varchar](16) NULL,

[HOST_NAME] [varchar](64) NULL,

[OS_USERNAME] [varchar](255) NULL,

[USERNAME] [varchar](30) NULL,

[USERHOST] [varchar](128) NULL,

[TERMINAL] [varchar](255) NULL,

[TIMESTAMP] [datetime] NOT NULL,

[ACTION_NAME] [varchar](28) NULL,

[LOGOFF_TIME] [datetime] NULL,

[LOGOFF_LREAD] [float] NULL,

[LOGOFF_PREAD] [float] NULL,

[LOGOFF_LWRITE] [float] NULL,

[LOGOFF_DLOCK] [varchar](40) NULL,

[SESSIONID] [float] NOT NULL,

[RETURNCODE] [float] NOT NULL,

[CLIENT_ID] [varchar](64) NULL,

[SESSION_CPU] [float] NULL,

[Import_Date] [datetime] NULL

I am trying to remove duplicate rows using the Import_Date and SESSIONID fields as my identifying fields. Since I am new to the ROW_NUMBER() function, I am having problems with aggregate querying while trying to build up to a CTE and resolve this issue. For example, when I query with this:

SELECT ROW_NUMBER() OVER (order BY SESSIONID asc) AS ROWID, instance_name, sessionid, Import_date

FROM dba_audit_session

group by sessionid, instance_name, import_date

HAVING COUNT(sessionid)>1

I get zero rows, or a complaint about fields not being coupled with an aggregate function or not in a GROUPED BY clause if I use an * with the ROW_NUMBER() function here. Any thoughts about how I should proceed?

Thanks in advance,

Jonathan

I am not saying it will fix your code but it is the place to start, all aggregate functions in SQL Server ignores NULLs except COUNT(*); it counts all rows of the table. Run a search for COUNT(*) in the BOL. Hope this helps.|||Hey,

I am a noob.. but.. have a look at this..

http://support.microsoft.com/kb/139444

NOx
|||

Hmmm, that query worked fine for me.

What do you get when you run this:

Code Snippet

SELECT instance_name, sessionid, Import_date, count(*) as tally

FROM dba_audit_session

group by sessionid, instance_name, import_date

HAVING COUNT(sessionid)>1

|||You can delete all but the latest (in Import_Date, SESSIONID order) for each instance_name this way:

Code Snippet


with T_ranked as (
select
*,
rank() over (
partition by instance_name
order by Import_Date desc, SESSIONID desc
) as rk

from dba_audit_session
)
delete from T_ranked
where rk > 1

Steve Kass
Drew University
http://www.stevekass.com
|||

Hi--Found this one, liked how it got to the basics, but wanted to try working with SQL Server 2005's row_number() function based on this article:

http://www.databasejournal.com/features/mssql/article.php/3572301

The author of this article makes it look easy, but looks at a specific row in a resultset instead of a range of data.

|||

Use the following query. It will help you.

Code Snippet

;WITH CTE

as

(

SELECT

ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID

, instance_name

, sessionid

, Import_date

FROM

dba_audit_session

)

--Select * From CTE

Delete From CTE Where ROWID > 1

|||What if we don't know which rows are duplicates, meaning.. imagine a table with say 10000 rows and by a simple query you find out there are about 100 duplicate records.

What's the approach you would take in deleting these records?

Thanks
|||

Thank you both, Steve and Manivannan, for the tips on how to handle these with CTEs. These are exactly what I was looking for, and more. I've got several instances' worth of data to play with, so the partitioning by sessionid and the ranking by instance_name in both examples will come in handy. I will try them both out with sample data and let you know how they work out.

- Jonathan

|||

Hi Steve--I populated the test table with two copies of each row from two data sources, so I could check how your formula works. I just tried your ranking formula, substituting "select *" instead of the DELETE FROM statement so I could see the final result set before deletion of duplicates. The result set returns over 2.4 million rows, looking like this sample of duplicate data:

InstanceOS_UsernameHostSessionidImport_Date Rank

dwsmithhost11272912007-07-10 15:18:45.267 2

dwsmithhost11272912007-07-10 15:10:42.867 3

From what I can tell, the CTE as originally written will remove all but one row from the ranked and sorted data. Is there a way to tweak this CTE so that it checks for the sessionid of the row preceding the currently checked row for a duplicate entry and then removing the duplicate entry if one is found? (I tried tweaking it, but started to get into trouble with aggregate query messages when I attempted to do selects where the delete statement is written.)

Thanks in advance--Jonathan

|||

Hi--I think we're halfway there. I looked at your code, ran it with a sample of my data, and came up with a conclusion that I have an extra condition to handle. Given rowid > 1, I have to look at the Import_date and sessionid, compare the values from both rows, and keep the row given the latest Import_date. So, for example, if I have data like this:

1 dw 1053976 2007-07-10 15:10:42.867
1 dw 1053976 2007-07-10 15:18:45.267
1 dw 1053977 2007-07-10 15:10:42.867
1 dw 1053977 2007-07-10 15:18:45.267

...

I have to find the latest entry using a query like this:

SELECT

ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,

instance_name,

sessionid,

Import_date

FROM dba_audit_session

where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a

where a.sessionid = DBA_AUDIT_SESSION.sessionid);

Am I looking at another CTE for this?

|||

Hi--I think this is a good start point, and found I have an extra condition to meet.Given data like this:

Rowid instance_name sessionidImport_date

1 dw 1053976 2007-07-10 15:10:42.867

1 dw 1053976 2007-07-10 15:18:45.267

1 dw 1053977 2007-07-10 15:10:42.867

1 dw 1053977 2007-07-10 15:18:45.267

I have to keep the latest Import_date and remove prior entries.Given this query, I can make the selection:

SELECT ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,

instance_name,

sessionid,

Import_date

FROM dba_audit_session

where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a

where a.sessionid = DBA_AUDIT_SESSION.sessionid)

Do I need to nest this into a second CTE?

|||

Hi--I think this is a good start point, and found I have an extra condition to meet.Given data like this:

Rowid instance_name sessionidImport_date

1 dw 1053976 2007-07-10 15:10:42.867

1 dw 1053976 2007-07-10 15:18:45.267

1 dw 1053977 2007-07-10 15:10:42.867

1 dw 1053977 2007-07-10 15:18:45.267

I have to keep the latest Import_date and remove prior entries.Given this query, I can make the selection:

SELECT ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,

instance_name,

sessionid,

Import_date

FROM dba_audit_session

where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a

where a.sessionid = DBA_AUDIT_SESSION.sessionid)

Do I need to nest this into a second CTE?

|||I think you're missing the point. With rank(), you don't need any of this < (select max...) business. Just set the partition by and order by lists correctly so that the rows you want to keep are exactly the ones that get ranked #1.

Here's a wizard:

1. Fill in the blank. I want to keep one row for each .
2. Fill in the blank. The row I want to keep is first if I order by __.

Answer to 1 is your partition by list; answer to 2 is your order by list. My guess now is you want one row for each (instance_name, sessionid) combination, so it might be this (sessionid was in the order by list in my first try)

with T_ranked as (
select
*,
rank() over (
partition by instance_name, sessionid
order by Import_Date desc
) as rk

from dba_audit_session
)
delete from T_ranked
where rk > 1


If this still isn't it, it would be a huge help if you gave your desired answer along with sample data from which you want that answer.

SK
sql