Hi all--Given a table with the following definition in SQL Server 2005 SP2:
[DBA_AUDIT_SESSION](
[INSTANCE_NAME] [varchar](16) NULL,
[HOST_NAME] [varchar](64) NULL,
[OS_USERNAME] [varchar](255) NULL,
[USERNAME] [varchar](30) NULL,
[USERHOST] [varchar](128) NULL,
[TERMINAL] [varchar](255) NULL,
[TIMESTAMP] [datetime] NOT NULL,
[ACTION_NAME] [varchar](28) NULL,
[LOGOFF_TIME] [datetime] NULL,
[LOGOFF_LREAD] [float] NULL,
[LOGOFF_PREAD] [float] NULL,
[LOGOFF_LWRITE] [float] NULL,
[LOGOFF_DLOCK] [varchar](40) NULL,
[SESSIONID] [float] NOT NULL,
[RETURNCODE] [float] NOT NULL,
[CLIENT_ID] [varchar](64) NULL,
[SESSION_CPU] [float] NULL,
[Import_Date] [datetime] NULL
I am trying to remove duplicate rows using the Import_Date and SESSIONID fields as my identifying fields. Since I am new to the ROW_NUMBER() function, I am having problems with aggregate querying while trying to build up to a CTE and resolve this issue. For example, when I query with this:
SELECT ROW_NUMBER() OVER (order BY SESSIONID asc) AS ROWID, instance_name, sessionid, Import_date
FROM dba_audit_session
group by sessionid, instance_name, import_date
HAVING COUNT(sessionid)>1
I get zero rows, or a complaint about fields not being coupled with an aggregate function or not in a GROUPED BY clause if I use an * with the ROW_NUMBER() function here. Any thoughts about how I should proceed?
Thanks in advance,
Jonathan
I am not saying it will fix your code but it is the place to start, all aggregate functions in SQL Server ignores NULLs except COUNT(*); it counts all rows of the table. Run a search for COUNT(*) in the BOL. Hope this helps.|||Hey,I am a noob.. but.. have a look at this..
http://support.microsoft.com/kb/139444
NOx
|||
Hmmm, that query worked fine for me.
What do you get when you run this:
Code Snippet
SELECT instance_name, sessionid, Import_date, count(*) as tally
FROM dba_audit_session
group by sessionid, instance_name, import_date
HAVING COUNT(sessionid)>1
|||You can delete all but the latest (in Import_Date, SESSIONID order) for each instance_name this way:
Code Snippet
with T_ranked as (
select
*,
rank() over (
partition by instance_name
order by Import_Date desc, SESSIONID desc
) as rk
from dba_audit_session
)
delete from T_ranked
where rk > 1
Steve Kass
Drew University
http://www.stevekass.com
|||
Hi--Found this one, liked how it got to the basics, but wanted to try working with SQL Server 2005's row_number() function based on this article:
http://www.databasejournal.com/features/mssql/article.php/3572301
The author of this article makes it look easy, but looks at a specific row in a resultset instead of a range of data.
|||Use the following query. It will help you.
Code Snippet
;WITH CTE
as
(
SELECT
ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID
, instance_name
, sessionid
, Import_date
FROM
dba_audit_session
)
--Select * From CTE
Delete From CTE Where ROWID > 1
|||What if we don't know which rows are duplicates, meaning.. imagine a table with say 10000 rows and by a simple query you find out there are about 100 duplicate records.What's the approach you would take in deleting these records?
Thanks
|||
Thank you both, Steve and Manivannan, for the tips on how to handle these with CTEs. These are exactly what I was looking for, and more. I've got several instances' worth of data to play with, so the partitioning by sessionid and the ranking by instance_name in both examples will come in handy. I will try them both out with sample data and let you know how they work out.
- Jonathan
|||Hi Steve--I populated the test table with two copies of each row from two data sources, so I could check how your formula works. I just tried your ranking formula, substituting "select *" instead of the DELETE FROM statement so I could see the final result set before deletion of duplicates. The result set returns over 2.4 million rows, looking like this sample of duplicate data:
InstanceOS_UsernameHostSessionidImport_Date Rank
dwsmithhost11272912007-07-10 15:18:45.267 2
dwsmithhost11272912007-07-10 15:10:42.867 3
…
From what I can tell, the CTE as originally written will remove all but one row from the ranked and sorted data. Is there a way to tweak this CTE so that it checks for the sessionid of the row preceding the currently checked row for a duplicate entry and then removing the duplicate entry if one is found? (I tried tweaking it, but started to get into trouble with aggregate query messages when I attempted to do selects where the delete statement is written.)
Thanks in advance--Jonathan
|||Hi--I think we're halfway there. I looked at your code, ran it with a sample of my data, and came up with a conclusion that I have an extra condition to handle. Given rowid > 1, I have to look at the Import_date and sessionid, compare the values from both rows, and keep the row given the latest Import_date. So, for example, if I have data like this:
1 dw 1053976 2007-07-10 15:10:42.867
1 dw 1053976 2007-07-10 15:18:45.267
1 dw 1053977 2007-07-10 15:10:42.867
1 dw 1053977 2007-07-10 15:18:45.267
...
I have to find the latest entry using a query like this:
SELECT
ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,
instance_name,
sessionid,
Import_date
FROM dba_audit_session
where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a
where a.sessionid = DBA_AUDIT_SESSION.sessionid);
Am I looking at another CTE for this?
|||Hi--I think this is a good start point, and found I have an extra condition to meet.Given data like this:
Rowid instance_name sessionidImport_date
1 dw 1053976 2007-07-10 15:10:42.867
1 dw 1053976 2007-07-10 15:18:45.267
1 dw 1053977 2007-07-10 15:10:42.867
1 dw 1053977 2007-07-10 15:18:45.267
I have to keep the latest Import_date and remove prior entries.Given this query, I can make the selection:
SELECT ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,
instance_name,
sessionid,
Import_date
FROM dba_audit_session
where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a
where a.sessionid = DBA_AUDIT_SESSION.sessionid)
Do I need to nest this into a second CTE?
|||
Hi--I think this is a good start point, and found I have an extra condition to meet.Given data like this:
Rowid instance_name sessionidImport_date
1 dw 1053976 2007-07-10 15:10:42.867
1 dw 1053976 2007-07-10 15:18:45.267
1 dw 1053977 2007-07-10 15:10:42.867
1 dw 1053977 2007-07-10 15:18:45.267
I have to keep the latest Import_date and remove prior entries.Given this query, I can make the selection:
SELECT ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,
instance_name,
sessionid,
Import_date
FROM dba_audit_session
where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a
where a.sessionid = DBA_AUDIT_SESSION.sessionid)
Do I need to nest this into a second CTE?
|||I think you're missing the point. With rank(), you don't need any of this < (select max...) business. Just set the partition by and order by lists correctly so that the rows you want to keep are exactly the ones that get ranked #1.Here's a wizard:
1. Fill in the blank. I want to keep one row for each .
2. Fill in the blank. The row I want to keep is first if I order by __.
Answer to 1 is your partition by list; answer to 2 is your order by list. My guess now is you want one row for each (instance_name, sessionid) combination, so it might be this (sessionid was in the order by list in my first try)
with T_ranked as (
select
*,
rank() over (
partition by instance_name, sessionid
order by Import_Date desc
) as rk
from dba_audit_session
)
delete from T_ranked
where rk > 1
If this still isn't it, it would be a huge help if you gave your desired answer along with sample data from which you want that answer.
SK
sql