Showing posts with label sp2. Show all posts
Showing posts with label sp2. Show all posts

Monday, March 26, 2012

Question on using ROW_NUMBER to remove duplicate rows in SQL 2005

Hi all--Given a table with the following definition in SQL Server 2005 SP2:

[DBA_AUDIT_SESSION](

[INSTANCE_NAME] [varchar](16) NULL,

[HOST_NAME] [varchar](64) NULL,

[OS_USERNAME] [varchar](255) NULL,

[USERNAME] [varchar](30) NULL,

[USERHOST] [varchar](128) NULL,

[TERMINAL] [varchar](255) NULL,

[TIMESTAMP] [datetime] NOT NULL,

[ACTION_NAME] [varchar](28) NULL,

[LOGOFF_TIME] [datetime] NULL,

[LOGOFF_LREAD] [float] NULL,

[LOGOFF_PREAD] [float] NULL,

[LOGOFF_LWRITE] [float] NULL,

[LOGOFF_DLOCK] [varchar](40) NULL,

[SESSIONID] [float] NOT NULL,

[RETURNCODE] [float] NOT NULL,

[CLIENT_ID] [varchar](64) NULL,

[SESSION_CPU] [float] NULL,

[Import_Date] [datetime] NULL

I am trying to remove duplicate rows using the Import_Date and SESSIONID fields as my identifying fields. Since I am new to the ROW_NUMBER() function, I am having problems with aggregate querying while trying to build up to a CTE and resolve this issue. For example, when I query with this:

SELECT ROW_NUMBER() OVER (order BY SESSIONID asc) AS ROWID, instance_name, sessionid, Import_date

FROM dba_audit_session

group by sessionid, instance_name, import_date

HAVING COUNT(sessionid)>1

I get zero rows, or a complaint about fields not being coupled with an aggregate function or not in a GROUPED BY clause if I use an * with the ROW_NUMBER() function here. Any thoughts about how I should proceed?

Thanks in advance,

Jonathan

I am not saying it will fix your code but it is the place to start, all aggregate functions in SQL Server ignores NULLs except COUNT(*); it counts all rows of the table. Run a search for COUNT(*) in the BOL. Hope this helps.|||Hey,

I am a noob.. but.. have a look at this..

http://support.microsoft.com/kb/139444

NOx
|||

Hmmm, that query worked fine for me.

What do you get when you run this:

Code Snippet

SELECT instance_name, sessionid, Import_date, count(*) as tally

FROM dba_audit_session

group by sessionid, instance_name, import_date

HAVING COUNT(sessionid)>1

|||You can delete all but the latest (in Import_Date, SESSIONID order) for each instance_name this way:

Code Snippet


with T_ranked as (
select
*,
rank() over (
partition by instance_name
order by Import_Date desc, SESSIONID desc
) as rk

from dba_audit_session
)
delete from T_ranked
where rk > 1

Steve Kass
Drew University
http://www.stevekass.com
|||

Hi--Found this one, liked how it got to the basics, but wanted to try working with SQL Server 2005's row_number() function based on this article:

http://www.databasejournal.com/features/mssql/article.php/3572301

The author of this article makes it look easy, but looks at a specific row in a resultset instead of a range of data.

|||

Use the following query. It will help you.

Code Snippet

;WITH CTE

as

(

SELECT

ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID

, instance_name

, sessionid

, Import_date

FROM

dba_audit_session

)

--Select * From CTE

Delete From CTE Where ROWID > 1

|||What if we don't know which rows are duplicates, meaning.. imagine a table with say 10000 rows and by a simple query you find out there are about 100 duplicate records.

What's the approach you would take in deleting these records?

Thanks
|||

Thank you both, Steve and Manivannan, for the tips on how to handle these with CTEs. These are exactly what I was looking for, and more. I've got several instances' worth of data to play with, so the partitioning by sessionid and the ranking by instance_name in both examples will come in handy. I will try them both out with sample data and let you know how they work out.

- Jonathan

|||

Hi Steve--I populated the test table with two copies of each row from two data sources, so I could check how your formula works. I just tried your ranking formula, substituting "select *" instead of the DELETE FROM statement so I could see the final result set before deletion of duplicates. The result set returns over 2.4 million rows, looking like this sample of duplicate data:

InstanceOS_UsernameHostSessionidImport_Date Rank

dwsmithhost11272912007-07-10 15:18:45.267 2

dwsmithhost11272912007-07-10 15:10:42.867 3

From what I can tell, the CTE as originally written will remove all but one row from the ranked and sorted data. Is there a way to tweak this CTE so that it checks for the sessionid of the row preceding the currently checked row for a duplicate entry and then removing the duplicate entry if one is found? (I tried tweaking it, but started to get into trouble with aggregate query messages when I attempted to do selects where the delete statement is written.)

Thanks in advance--Jonathan

|||

Hi--I think we're halfway there. I looked at your code, ran it with a sample of my data, and came up with a conclusion that I have an extra condition to handle. Given rowid > 1, I have to look at the Import_date and sessionid, compare the values from both rows, and keep the row given the latest Import_date. So, for example, if I have data like this:

1 dw 1053976 2007-07-10 15:10:42.867
1 dw 1053976 2007-07-10 15:18:45.267
1 dw 1053977 2007-07-10 15:10:42.867
1 dw 1053977 2007-07-10 15:18:45.267

...

I have to find the latest entry using a query like this:

SELECT

ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,

instance_name,

sessionid,

Import_date

FROM dba_audit_session

where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a

where a.sessionid = DBA_AUDIT_SESSION.sessionid);

Am I looking at another CTE for this?

|||

Hi--I think this is a good start point, and found I have an extra condition to meet.Given data like this:

Rowid instance_name sessionidImport_date

1 dw 1053976 2007-07-10 15:10:42.867

1 dw 1053976 2007-07-10 15:18:45.267

1 dw 1053977 2007-07-10 15:10:42.867

1 dw 1053977 2007-07-10 15:18:45.267

I have to keep the latest Import_date and remove prior entries.Given this query, I can make the selection:

SELECT ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,

instance_name,

sessionid,

Import_date

FROM dba_audit_session

where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a

where a.sessionid = DBA_AUDIT_SESSION.sessionid)

Do I need to nest this into a second CTE?

|||

Hi--I think this is a good start point, and found I have an extra condition to meet.Given data like this:

Rowid instance_name sessionidImport_date

1 dw 1053976 2007-07-10 15:10:42.867

1 dw 1053976 2007-07-10 15:18:45.267

1 dw 1053977 2007-07-10 15:10:42.867

1 dw 1053977 2007-07-10 15:18:45.267

I have to keep the latest Import_date and remove prior entries.Given this query, I can make the selection:

SELECT ROW_NUMBER() OVER (Partition By sessionid, import_date Order BY sessionid) AS ROWID,

instance_name,

sessionid,

Import_date

FROM dba_audit_session

where Import_Date < (select max(a.Import_Date) from DBA_AUDIT_SESSION as a

where a.sessionid = DBA_AUDIT_SESSION.sessionid)

Do I need to nest this into a second CTE?

|||I think you're missing the point. With rank(), you don't need any of this < (select max...) business. Just set the partition by and order by lists correctly so that the rows you want to keep are exactly the ones that get ranked #1.

Here's a wizard:

1. Fill in the blank. I want to keep one row for each .
2. Fill in the blank. The row I want to keep is first if I order by __.

Answer to 1 is your partition by list; answer to 2 is your order by list. My guess now is you want one row for each (instance_name, sessionid) combination, so it might be this (sessionid was in the order by list in my first try)

with T_ranked as (
select
*,
rank() over (
partition by instance_name, sessionid
order by Import_Date desc
) as rk

from dba_audit_session
)
delete from T_ranked
where rk > 1


If this still isn't it, it would be a huge help if you gave your desired answer along with sample data from which you want that answer.

SK
sql

Friday, March 23, 2012

Question on Tranaction Isolation Level.

Source:
SQL2K
SP3
Destination:
SQL2K5
SP2
I have requirements to import data from the source to the destination, and
once on the destination the data will be read only. I also have the
situation where the data owners (not me) may make schema changes on the
source whenever they want to, and do not need to tell me. Most of the time
this would only be widening an existing column, so I'm not all that worried.
This data is only to be grabbed once a night during a slow period.
Options:
Replication: This is out because the data owner will want to widen a column
without needing to consult with me first (I would need to drop replication
first). I'm aware of the changes to 2005 replication, but they won't be
upgrading anytime soon.
DTS/ SSIS: This is out because they will not want to tell me if they widen a
column. Just because they widen the source, that doesn't help my
destination.
Because of all these requirements/ circumstances, I want to just create a
Linked Server and do a Select...Into every night and drop/ re-create the
table on the destination. This will make schema changes on the source
transparent to me. Of course though, I don't want my Select...Into to block
users on the source while the import is occurring. That being said, I was
thinking about setting the Transaction Isolation Level (TIL) to Read
Uncommitted (RU) in the Select...Into. I don't really care if the
consistency is off a bit as it's only updated once a day anyways. This is
how it would look:
Set Transaction Isolation Level Read Uncommitted
select * into T48
from myLinkedServer.database.dbo.T48
So my question is: Will setting the TIL to RU like this actually set it for
the source (the Linked Server) or the destination (my server)? If it does in
fact set it for the source, would NOLOCK work?
TIA, ChrisR1) Can't you use SSIS and force it to drop/create tables each night? That
would probably be most efficient mechanism.
2) You pick up performance if you set the database to ReadOnly when you are
done moving data.
3) I don't believe the set (or other hints) will translate over to the
linked server.
"ChrisR" <ChrisR@.foo.com> wrote in message
news:Oet8c6dAIHA.1204@.TK2MSFTNGP03.phx.gbl...
> Source:
> SQL2K
> SP3
> Destination:
> SQL2K5
> SP2
> I have requirements to import data from the source to the destination, and
> once on the destination the data will be read only. I also have the
> situation where the data owners (not me) may make schema changes on the
> source whenever they want to, and do not need to tell me. Most of the time
> this would only be widening an existing column, so I'm not all that
> worried. This data is only to be grabbed once a night during a slow
> period.
> Options:
> Replication: This is out because the data owner will want to widen a
> column without needing to consult with me first (I would need to drop
> replication first). I'm aware of the changes to 2005 replication, but they
> won't be upgrading anytime soon.
> DTS/ SSIS: This is out because they will not want to tell me if they widen
> a column. Just because they widen the source, that doesn't help my
> destination.
>
> Because of all these requirements/ circumstances, I want to just create a
> Linked Server and do a Select...Into every night and drop/ re-create the
> table on the destination. This will make schema changes on the source
> transparent to me. Of course though, I don't want my Select...Into to
> block users on the source while the import is occurring. That being said,
> I was thinking about setting the Transaction Isolation Level (TIL) to Read
> Uncommitted (RU) in the Select...Into. I don't really care if the
> consistency is off a bit as it's only updated once a day anyways. This is
> how it would look:
> Set Transaction Isolation Level Read Uncommitted
> select * into T48
> from myLinkedServer.database.dbo.T48
> So my question is: Will setting the TIL to RU like this actually set it
> for the source (the Linked Server) or the destination (my server)? If it
> does in fact set it for the source, would NOLOCK work?
> TIA, ChrisR
>|||1) I don't think SSIS will widen the columns on the destination
automatically if they have been widened on the source (perhaps it can
somehow as I don't have a lot of SSIS experience?). Select...Into will take
care of this for me. Sure I can bury my Select...Into code in a SSIS
Package, but I don't think that's what you were getting at.
2) Not really an option, but thanks.
3) This is what I was afraid of.
Thanks!
"TheSQLGuru" <kgboles@.earthlink.net> wrote in message
news:13fq948dgolb0a5@.corp.supernews.com...
> 1) Can't you use SSIS and force it to drop/create tables each night? That
> would probably be most efficient mechanism.
> 2) You pick up performance if you set the database to ReadOnly when you
> are done moving data.
> 3) I don't believe the set (or other hints) will translate over to the
> linked server.
>
> "ChrisR" <ChrisR@.foo.com> wrote in message
> news:Oet8c6dAIHA.1204@.TK2MSFTNGP03.phx.gbl...
>> Source:
>> SQL2K
>> SP3
>> Destination:
>> SQL2K5
>> SP2
>> I have requirements to import data from the source to the destination,
>> and once on the destination the data will be read only. I also have the
>> situation where the data owners (not me) may make schema changes on the
>> source whenever they want to, and do not need to tell me. Most of the
>> time this would only be widening an existing column, so I'm not all that
>> worried. This data is only to be grabbed once a night during a slow
>> period.
>> Options:
>> Replication: This is out because the data owner will want to widen a
>> column without needing to consult with me first (I would need to drop
>> replication first). I'm aware of the changes to 2005 replication, but
>> they won't be upgrading anytime soon.
>> DTS/ SSIS: This is out because they will not want to tell me if they
>> widen a column. Just because they widen the source, that doesn't help my
>> destination.
>>
>> Because of all these requirements/ circumstances, I want to just create a
>> Linked Server and do a Select...Into every night and drop/ re-create the
>> table on the destination. This will make schema changes on the source
>> transparent to me. Of course though, I don't want my Select...Into to
>> block users on the source while the import is occurring. That being said,
>> I was thinking about setting the Transaction Isolation Level (TIL) to
>> Read Uncommitted (RU) in the Select...Into. I don't really care if the
>> consistency is off a bit as it's only updated once a day anyways. This is
>> how it would look:
>> Set Transaction Isolation Level Read Uncommitted
>> select * into T48
>> from myLinkedServer.database.dbo.T48
>> So my question is: Will setting the TIL to RU like this actually set it
>> for the source (the Linked Server) or the destination (my server)? If it
>> does in fact set it for the source, would NOLOCK work?
>> TIA, ChrisR
>|||One other approach would be to import into a permanent staging table -
perhaps in a staging database - that matches your data SOURCE which I
take it is not changing. Then use SQL to INSERT to the target table
from staging.
Roy Harvey
Beacon Falls, CT
On Fri, 28 Sep 2007 07:49:13 -0700, "ChrisR" <ChrisR@.foo.com> wrote:
>Source:
>SQL2K
>SP3
>Destination:
>SQL2K5
>SP2
>I have requirements to import data from the source to the destination, and
>once on the destination the data will be read only. I also have the
>situation where the data owners (not me) may make schema changes on the
>source whenever they want to, and do not need to tell me. Most of the time
>this would only be widening an existing column, so I'm not all that worried.
>This data is only to be grabbed once a night during a slow period.
>Options:
>Replication: This is out because the data owner will want to widen a column
>without needing to consult with me first (I would need to drop replication
>first). I'm aware of the changes to 2005 replication, but they won't be
>upgrading anytime soon.
>DTS/ SSIS: This is out because they will not want to tell me if they widen a
>column. Just because they widen the source, that doesn't help my
>destination.
>
>Because of all these requirements/ circumstances, I want to just create a
>Linked Server and do a Select...Into every night and drop/ re-create the
>table on the destination. This will make schema changes on the source
>transparent to me. Of course though, I don't want my Select...Into to block
>users on the source while the import is occurring. That being said, I was
>thinking about setting the Transaction Isolation Level (TIL) to Read
>Uncommitted (RU) in the Select...Into. I don't really care if the
>consistency is off a bit as it's only updated once a day anyways. This is
>how it would look:
>Set Transaction Isolation Level Read Uncommitted
>select * into T48
>from myLinkedServer.database.dbo.T48
>So my question is: Will setting the TIL to RU like this actually set it for
>the source (the Linked Server) or the destination (my server)? If it does in
>fact set it for the source, would NOLOCK work?
>TIA, ChrisR
>|||I think what you are saying is to first do a Select...Into a local table as
opposed to directly into the Linked Server? If so, I was already thinking
the same thing. If not, could you please elaborate?
"Roy Harvey (SQL Server MVP)" <roy_harvey@.snet.net> wrote in message
news:jvjqf39vh2kut6smef1rg2dj33crnkstij@.4ax.com...
> One other approach would be to import into a permanent staging table -
> perhaps in a staging database - that matches your data SOURCE which I
> take it is not changing. Then use SQL to INSERT to the target table
> from staging.
> Roy Harvey
> Beacon Falls, CT
> On Fri, 28 Sep 2007 07:49:13 -0700, "ChrisR" <ChrisR@.foo.com> wrote:
>>Source:
>>SQL2K
>>SP3
>>Destination:
>>SQL2K5
>>SP2
>>I have requirements to import data from the source to the destination, and
>>once on the destination the data will be read only. I also have the
>>situation where the data owners (not me) may make schema changes on the
>>source whenever they want to, and do not need to tell me. Most of the time
>>this would only be widening an existing column, so I'm not all that
>>worried.
>>This data is only to be grabbed once a night during a slow period.
>>Options:
>>Replication: This is out because the data owner will want to widen a
>>column
>>without needing to consult with me first (I would need to drop replication
>>first). I'm aware of the changes to 2005 replication, but they won't be
>>upgrading anytime soon.
>>DTS/ SSIS: This is out because they will not want to tell me if they widen
>>a
>>column. Just because they widen the source, that doesn't help my
>>destination.
>>
>>Because of all these requirements/ circumstances, I want to just create a
>>Linked Server and do a Select...Into every night and drop/ re-create the
>>table on the destination. This will make schema changes on the source
>>transparent to me. Of course though, I don't want my Select...Into to
>>block
>>users on the source while the import is occurring. That being said, I was
>>thinking about setting the Transaction Isolation Level (TIL) to Read
>>Uncommitted (RU) in the Select...Into. I don't really care if the
>>consistency is off a bit as it's only updated once a day anyways. This is
>>how it would look:
>>Set Transaction Isolation Level Read Uncommitted
>>select * into T48
>>from myLinkedServer.database.dbo.T48
>>So my question is: Will setting the TIL to RU like this actually set it
>>for
>>the source (the Linked Server) or the destination (my server)? If it does
>>in
>>fact set it for the source, would NOLOCK work?
>>TIA, ChrisR|||On Fri, 28 Sep 2007 13:03:23 -0700, "ChrisR" <ChrisR@.foo.com> wrote:
>I think what you are saying is to first do a Select...Into a local table as
>opposed to directly into the Linked Server? If so, I was already thinking
>the same thing. If not, could you please elaborate?
You have a source table somewhere on another server, call it Source.
You have a target table on the target server, call it Target. Target
is subject to change, but Source is not. Create a permanent table
Target_Imported, on the target server. Put it in the same database as
Target, or in a staging database. Make the columns match the Target
table in name and order, but make the column data types match the
related column in Source. Since the types match importing from Source
into Target_Import using BCP, DTS, SSIS or even INSERT/SELECT from a
linked server should be straight forward. Because the column names
match, copying the data between Target_Imported and Target requires a
simple INSERT/SELECT. You could add WITH RECOMPILE to the stored
procedure that does the copy so that the sort of minor datatype
changes you describe are handled without intervention.
Hope that is clearer. I don't know that it is any better than what
you proposed, just another idea.
Roy Harvey
Beacon Falls, CT|||Thanks, but you have it backwards. The source can change at any time,
without any notice. Therefore, if the source changes, I want the target to
change as well. Hence, Select...Into.
Thanks again!
"Roy Harvey (SQL Server MVP)" <roy_harvey@.snet.net> wrote in message
news:c2oqf3p37k0ieg5n6j34120j662a0e319a@.4ax.com...
> On Fri, 28 Sep 2007 13:03:23 -0700, "ChrisR" <ChrisR@.foo.com> wrote:
>>I think what you are saying is to first do a Select...Into a local table
>>as
>>opposed to directly into the Linked Server? If so, I was already thinking
>>the same thing. If not, could you please elaborate?
> You have a source table somewhere on another server, call it Source.
> You have a target table on the target server, call it Target. Target
> is subject to change, but Source is not. Create a permanent table
> Target_Imported, on the target server. Put it in the same database as
> Target, or in a staging database. Make the columns match the Target
> table in name and order, but make the column data types match the
> related column in Source. Since the types match importing from Source
> into Target_Import using BCP, DTS, SSIS or even INSERT/SELECT from a
> linked server should be straight forward. Because the column names
> match, copying the data between Target_Imported and Target requires a
> simple INSERT/SELECT. You could add WITH RECOMPILE to the stored
> procedure that does the copy so that the sort of minor datatype
> changes you describe are handled without intervention.
> Hope that is clearer. I don't know that it is any better than what
> you proposed, just another idea.
> Roy Harvey
> Beacon Falls, CT