Showing posts with label all-given. Show all posts
Showing posts with label all-given. Show all posts

Monday, March 26, 2012

Question on using ROW_NUMBER to remove duplicate rows in SQL 2005

Hi all--Given a table with the following definition in SQL Server 2005 SP2:

[DBA_AUDIT_SESSION](

[INSTANCE_NAME] [varchar](16) NULL,

[HOST_NAME] [varchar](64) NULL,

[OS_USERNAME] [varchar](255) NULL,

[USERNAME] [varchar](30) NULL,

[USERHOST] [varchar](128) NULL,

[TERMINAL] [varchar](255) NULL,

[TIMESTAMP] [datetime] NOT NULL,

[ACTION_NAME] [varchar](28) NULL,

[LOGOFF_TIME] [datetime] NULL,

[LOGOFF_LREAD] [float] NULL,

[LOGOFF_PREAD] [float] NULL,

[LOGOFF_LWRITE] [float] NULL,

[LOGOFF_DLOCK] [varchar](40) NULL,

[SESSIONID] [float] NOT NULL,

[RETURNCODE] [float] NOT NULL,

[CLIENT_ID] [varchar](64) NULL,

[SESSION_CPU] [float] NULL,

[Import_Date] [datetime] NULL

I am trying to remove duplicate rows using the Import_Date and SESSIONID fields as my identifying fields. Since I am new to the ROW_NUMBER() function, I am having problems with aggregate querying while trying to build up to a CTE and resolve this issue. For example, when I query with this:

SELECT ROW_NUMBER() OVER (order BY SESSIONID asc) AS ROWID, instance_name, sessionid, Import_date

FROM dba_audit_session

group by sessionid, instance_name, import_date

HAVING COUNT(sessionid)>1

I get zero rows, or a complaint about fields not being coupled with an aggregate function or not in a GROUPED BY clause if I use an * with the ROW_NUMBER() function here. Any thoughts about how I should proceed?

Thanks in advance,

Jonathan

I am not saying it will fix your code but it is the place to start, all aggregate functions in SQL Server ignores NULLs except COUNT(*); it counts all rows of the table. Run a search for COUNT(*) in the BOL. Hope this helps.|||Hey,

I am a noob.. but.. have a look at this..

http://support.microsoft.com/kb/139444

NOx
|||

Hmmm, that query worked fine for me.

What do you get when you run this:

Code Snippet

SELECT instance_name, sessionid, Import_date, count(*) as tally

FROM dba_audit_session

group by sessionid, instance_name, import_date

HAVING COUNT(sessionid)>1

|||You can delete all but the latest (in Import_Date, SESSIONID order) for each instance_name this way:

Code Snippet


with T_ranked as (
select
*,
rank() over (
partition by instance_name
order by Import_Date desc, SESSIONID desc
) as rk

from dba_audit_session
)
delete from T_ranked
where rk > 1

Steve Kass
Drew University
http://www.stevekass.com
|||

Hi--Found this one, liked how it got to the basics, but wanted to try working with SQL Server 2005's row_number() function based on this article:

http://www.databasejournal.com/features/mssql/article.php/3572301

The author of this article makes it look easy, but looks at a specific row in a resultset instead of a range of data.

|||

Use the following query. It will help you.

Code Snippet

;WITH CTE

as

(

SELECT

ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID

, instance_name

, sessionid

, Import_date

FROM

dba_audit_session

)

--Select * From CTE

Delete From CTE Where ROWID > 1

|||What if we don't know which rows are duplicates, meaning.. imagine a table with say 10000 rows and by a simple query you find out there are about 100 duplicate records.

What's the approach you would take in deleting these records?

Thanks
|||

Thank you both, Steve and Manivannan, for the tips on how to handle these with CTEs. These are exactly what I was looking for, and more. I've got several instances' worth of data to play with, so the partitioning by sessionid and the ranking by instance_name in both examples will come in handy. I will try them both out with sample data and let you know how they work out.

- Jonathan

|||

Hi Steve--I populated the test table with two copies of each row from two data sources, so I could check how your formula works. I just tried your ranking formula, substituting "select *" instead of the DELETE FROM statement so I could see the final result set before deletion of duplicates. The result set returns over 2.4 million rows, looking like this sample of duplicate data:

InstanceOS_UsernameHostSessionidImport_Date Rank

dwsmithhost11272912007-07-10 15:18:45.267 2

dwsmithhost11272912007-07-10 15:10:42.867 3

From what I can tell, the CTE as originally written will remove all but one row from the ranked and sorted data. Is there a way to tweak this CTE so that it checks for the sessionid of the row preceding the currently checked row for a duplicate entry and then removing the duplicate entry if one is found? (I tried tweaking it, but started to get into trouble with aggregate query messages when I attempted to do selects where the delete statement is written.)

Thanks in advance--Jonathan

|||

Hi--I think we're halfway there. I looked at your code, ran it with a sample of my data, and came up with a conclusion that I have an extra condition to handle. Given rowid > 1, I have to look at the Import_date and sessionid, compare the values from both rows, and keep the row given the latest Import_date. So, for example, if I have data like this:

1 dw 1053976 2007-07-10 15:10:42.867
1 dw 1053976 2007-07-10 15:18:45.267
1 dw 1053977 2007-07-10 15:10:42.867
1 dw 1053977 2007-07-10 15:18:45.267

...

I have to find the latest entry using a query like this:

SELECT

ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,

instance_name,

sessionid,

Import_date

FROM dba_audit_session

where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a

where a.sessionid = DBA_AUDIT_SESSION.sessionid);

Am I looking at another CTE for this?

|||

Hi--I think this is a good start point, and found I have an extra condition to meet.Given data like this:

Rowid instance_name sessionidImport_date

1 dw 1053976 2007-07-10 15:10:42.867

1 dw 1053976 2007-07-10 15:18:45.267

1 dw 1053977 2007-07-10 15:10:42.867

1 dw 1053977 2007-07-10 15:18:45.267

I have to keep the latest Import_date and remove prior entries.Given this query, I can make the selection:

SELECT ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,

instance_name,

sessionid,

Import_date

FROM dba_audit_session

where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a

where a.sessionid = DBA_AUDIT_SESSION.sessionid)

Do I need to nest this into a second CTE?

|||

Hi--I think this is a good start point, and found I have an extra condition to meet.Given data like this:

Rowid instance_name sessionidImport_date

1 dw 1053976 2007-07-10 15:10:42.867

1 dw 1053976 2007-07-10 15:18:45.267

1 dw 1053977 2007-07-10 15:10:42.867

1 dw 1053977 2007-07-10 15:18:45.267

I have to keep the latest Import_date and remove prior entries.Given this query, I can make the selection:

SELECT ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,

instance_name,

sessionid,

Import_date

FROM dba_audit_session

where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a

where a.sessionid = DBA_AUDIT_SESSION.sessionid)

Do I need to nest this into a second CTE?

|||I think you're missing the point. With rank(), you don't need any of this < (select max...) business. Just set the partition by and order by lists correctly so that the rows you want to keep are exactly the ones that get ranked #1.

Here's a wizard:

1. Fill in the blank. I want to keep one row for each .
2. Fill in the blank. The row I want to keep is first if I order by __.

Answer to 1 is your partition by list; answer to 2 is your order by list. My guess now is you want one row for each (instance_name, sessionid) combination, so it might be this (sessionid was in the order by list in my first try)

with T_ranked as (
select
*,
rank() over (
partition by instance_name, sessionid
order by Import_Date desc
) as rk

from dba_audit_session
)
delete from T_ranked
where rk > 1


If this still isn't it, it would be a huge help if you gave your desired answer along with sample data from which you want that answer.

SK
sql

Friday, March 9, 2012

Question on index creation for large table

Hi all--Given a table with the current setup and structure, I would like to create a clustered index for a table that gets updated once per week values I would like to keep unique:

CREATETABLE [dbo].[DBA_AUDIT_SESSION](

[INSTANCE_NAME] [varchar](16)NULL,

[HOST_NAME] [varchar](64)NULL,

[OS_USERNAME] [varchar](255)NULL,

[USERNAME] [varchar](30)NULL,

[USERHOST] [varchar](128)NULL,

[TERMINAL] [varchar](255)NULL,

[TIMESTAMP] [datetime] NOTNULL,

[ACTION_NAME] [varchar](28)NULL,

[LOGOFF_TIME] [datetime] NULL,

[LOGOFF_LREAD] [float] NULL,

[LOGOFF_PREAD] [float] NULL,

[LOGOFF_LWRITE] [float] NULL,

[LOGOFF_DLOCK] [varchar](40)NULL,

[SESSIONID] [float] NOTNULL,

[RETURNCODE] [float] NOTNULL,

[CLIENT_ID] [varchar](64)NULL,

[SESSION_CPU] [float] NULL,

[Import_Date] [datetime] NULL

)

This is is for an existing table within a SQL Server 2005 SP1 database.

I was thinking about creating an IDENTITY column and creating that as a composite primary key with TIMESTAMP, but the following posting seems to suggest that's not a good idea for performance and not the best design:

http://www.sqlservercentral.com/forums/shwmessage.aspx?forumid=9&messageid=152846

Average size for this table is 1.75-2.5 million rows, and the TIMESTAMP field doesn't carry completely unique values. The data is sourced from an Oracle 9i and/or 10gR2 database, populated by an SSIS package. Suggestions on creating a good clustered index for this table? Would I gain value by using the NEWSEQUENTIALID function in SQL Server 2005 on a uniqueidentifier datatype column?

Thanks,

Jonathan

The primary question is why do you think that this table needs a 'clustered index'.

If you want to create a 'sequential' load, then clustering on an IDENTITY field is enough.

Typically, Audit tables have a minimal need for indexing -and it rarely has a primary factor of load sequence.

|||

Hi Arnie--I'm renovating the entire repository to factor in a couple of changes:

1. The audience for this data (manager, security & network administrator) commented on how slow queries returned data once three-four months' data accumulated in this repository.

2. I also have to store this on a VMWare-based virtual server with only 1 GB of RAM. (A request for additional resources was turned down a few months ago.) Given that, I'm looking to come up with the best possible indexing strategy with a need to store at least 3-6 months' data at a time, which leaves up to 11-12 million rows to query. That's an awful lot of data to sort through, so I'm trying to also create a long-term indexing strategy without reinventing the wheel for a server with fairly tight resource restraints.

Given there's a large amount of data that doesn't change more than once per month, a clustered index seems the best candidate for this table and another, similar table called 'dba_audit_trail', based on an Oracle-derived view. Thoughts?

- Jonathan

|||

I'd cluster index the [Timestamp] or [Import_Date] field (whichever one is capturing the datetime of the insert)-and NOT bother with an IDENTITY field.

(Duplicate [Timestamp] values are not an issue for the clustered index.)

IF all queries have a [Timestamp] criteria, I'd carefully examing other criteria filters to determine indexing the remaining fields.

|||OK--Thanks for the advice.