large data - Page 456 - Developer IT

c# equivalent to John Lakos - Large Scale c++ software design?

- by David Archer

I own and have read most of this book, a while back. Now I'm about to embark on a pretty large project and was wanting a refresher. Anyone got a recommendation of something similar that has a focus on c#. I know the principles remain, but if only for my readability.. cheers

Read the article

How to extend a large website to an iPhone app?

- by xoail

I am trying to create an iPhone app for a large website (as big as amazon.com) and it involves using cookies and what not to get authenticated via the Apache intercepter and access the web services exposed by the main website. For that I am looking for strategies to go about developing it. I am new to iPhone development and I am mostly looking for some architectural guidance. Does anyone know how services like eBay and Amazon work seamlessly across the website and iPhone app?

Read the article

What regular expression(s) would I use to remove escaped html from large sets of data.

- by Elizabeth Buckwalter

Our database is filled with articles retrieved from RSS feeds. I was unsure of what data I would be getting, and how much filtering was already setup (WP-O-Matic Wordpress plugin using the SimplePie library). This plugin does some basic encoding before insertion using Wordpress's built in post insert function which also does some filtering. I've figured out most of the filters before insertion, but now I have whacko data that I need to remove. This is an example of whacko data that I have data in one field which the content I want in the front, but this part removed which is at the end: <img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:V_sGLiPBpWU" border="0"></img> <img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?d=qj6IDK7rITs" border="0"></img> <img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:D7DqB2pKExk" Notice how some of the images are escape and some aren't. I believe this has to do with the last part being cut off so as to be unrecognizable as an html tag, which then caused it to be html endcoded. Another field has only this which is now filtered before insertion, but I have to get rid of the others: <img src="http://farm3.static.flickr.com/2183/2289902369_1d95bcdb85.jpg" alt="post_img" width="80" (all examples are on one line, but broken up for readability) Question: What is the best way to work with the above escaped html (or portion of an html tag)? I can do it in Perl, PHP, SQL, Ruby, and even Python. I believe Perl to be the best at text parsing, so that's why I used the Perl tag. And PHP times out on large database operations, so that's pretty much out unless I wanted to do batch processing and what not. PS One of the nice things about using Wordpress's insert post function, is that if you use php's strip_tags function to strip out all html, insert post function will insert <p> at the paragraph points. Let me know if there's anything more that I can answer. Some article that didn't quite answer my questions. (http://stackoverflow.com/questions/2016751/remove-text-from-within-a-database-text-field) (http://stackoverflow.com/questions/462831/regular-expression-to-escape-html-ampersands-while-respecting-cdata)

Read the article

How to extend a large website to iPhone app?

- by xoail

I am trying to create an iphone app for a large website (as big as amazon.com) and it involves using cookies and what not to get authenticated via the apache intercepter and access the web services exposed by the main website. For that I am looking for strategies to go about developing it. I am new to iphone dev and mostly looking for some architectural guidance. Does anyone know how does apps like ebay and amazon work seamlessly across the website and iphone?

Read the article

How large is the download for the .NET 4.0 client profile?

- by Ben Straub

I've seen numbers about the full install size of the framework, but my Google-fu has failed me when it comes to how large the upgrade download will be. For instance, for a user who has 2.0 installed, how much will they have to download to upgrade to 4.0CP?

Read the article

JPA: what is the proper pattern for iterating over large result sets?

- by Caffeine Coma

Let's say I have a table with millions of rows. Using JPA, what's the proper way to iterate over a query against that table, such that I don't have all an in-memory List with millions of objects? I suspect that the following will blow up if the table is large: List<Model> models = entityManager().createQuery("from Model m", Model.class).getResultList(); for (Model model : models) { // do something with model } Is pagination (looping and manually updating setFirstResult()/setMaxResult()) really the best solution?

Read the article

Which term to use when referring to functional data structures: persistent or immutable?

- by Bob

In the context of functional programming which is the correct term to use: persistent or immutable? When I Google "immutable data structures" I get a Wikipedia link to an article on "Persistent data structure" which even goes on to say: such data structures are effectively immutable Which further confuses things for me. Do functional programs rely on persistent data structures or immutable data structures? Or are they always the same thing?

Read the article

When do you decide to split up large projects into smaller projects?

- by Khou

When/where do you decide to split a large visual studio project into smaller multiple projects? When you split the project, do you ... -group database tables -group similar functionality -other..

Read the article

Problem with receiving data form serial port in c#?

- by moon

hello i have problem with receiving data from serial port in c# in am inserting a new line operator at the end of data buffer. then i send this data buffer on serial port, after this my c# GUI receiver will take this data via Readline() function but it always give me raw data not the actual one how to resolve this problem.

Read the article

Why Document DB ( like mongodb and couchdb ) are better for large amount of data ?

- by xRobot

I am very newbie to this world of document db. So... why this db are better than RDBMS ( like mysql or postgresql ) for very large amount of data ?

Read the article

Progress during large file copy (Copy-Item & Write-Progress?)

- by Jason Jarrett

Is there any way to copy a really large file (from one server to another) in Powershell AND display it's progress? There are solutions out there to use Write-Progress in conjunction with looping to copy many files and display progress. However I can't seem to find anything that would show progress of a single file. Any thoughts?

Read the article

How do I serve a large file using Pylons?

- by Chris R

I am writing a Pylons-based download gateway. The gateway's client will address files by ID: /file_gw/download/1 Internally, the file itself is accessed via HTTP from an internal file server: http://internal-srv/path/to/file_1.content The files may be quite large, so I want to stream the content. I store metadata about the file in a StoredFile model object: class StoredFile(Base): id = Column(Integer, primary_key=True) name = Column(String) size = Column(Integer) content_type = Column(String) url = Column(String) Given this, what's the best (ie: most architecturally-sound, performant, et al) way to write my file_gw controller?

Read the article

Solve:Value was either too large or too small for an Int32.

- by jithincool

I am passing query string and the url is as follows- http://localhost:1086/Web/EditMobile.aspx?sno=2. But when i try to enter the url as follows,localhost:1086/Web/EditMobile.aspx?sno=2*3424324423432424* , i get the following error-Value was either too large or too small for an Int32. How do i handle this error. I must get an error report like " the value is not found in the table or database"

Read the article

How to sequence events while uploading large files to Amazon S3?

- by coolpawan4u

I am seeing problems uploading large files e.g. 100MB. Files go to Amazon S3 successfully, but after that scripts are waiting for if($flag == 1) and do not execute the rest of the code. Can anyone tell me the solution for this? if($s3->putObjectFile($uploadDIR, $bucket, $name, S3::ACL)) { $flag = 1; } if($flag == 1) { //continue code }

Read the article

In SQL Server, how can I separate a large number of tsql statement into batches?

- by Yousui

Hi guys, In SQL Server, how can I separate a large number of tsql statement into batches? Should I use the GO statement in stored procedures or functions? Should I use the GO statement in explicit transaction management situation(between BEGIN TRANSACTION and ROLLBACK TRANSACTION or COMMIT TRANSACTION)? Are there some best practice about this topic? Great thanks in advance.

Read the article

Firefox all but freezes during large file upload; Ajax progress bar infeasible; IE6 works fine

- by Sean

I want to provide a progress bar for my users who upload very large files. I did some reading and implemented what should be a pretty straightforward solution: I have a <form> element that contains an file input element; its target is set to the ID of a hidden iframe. On the server side, there's some Spring magic that attaches an object to the user's session; the progress of the upload can be queried from this object. After submitting the form, I start a repeating Ajax call using setInterval that queries the server for the percent-complete using the aforementioned session object. The call repeats every half-second, skipping the Ajax call if the previous call has not yet completed. I use the data from the call to update the width of an onscreen element. When the server call reports that the upload is complete, I clear the interval timer. I created a 100-megabyte file and uploaded it using my interface. This is using Firefox 3.6.3. What I found is that although the upload takes 20-25 seconds, the progress bar doesn't get updated until the very end. Moreover, the entire browser is basically frozen until the upload completes. I assumed that my method must be flawed, but I tried the same page using IE6, and was utterly amazed when it behaved as I had designed it to--the progress bar got updated every half second, and the whole upload only took about 15 seconds, much faster than Firefox. I don't have many add-ons installed, but I tried disabling Firebug and restarting my browser. This marginally improved the performance--I got perhaps a single additional progress bar update mid-upload--but still far from acceptable. Can anyone tell me what I can do to bring Firefox's performance up to the level of IE6? Ugh, I can't believe I actually typed that. EDIT: I just tried uploading a large file from a Firefox 3.6.3 browser on a different machine than the one that's running my web server, and it worked fine. Huh.

Read the article

Indexing with pointer C/C++

- by Leavenotrace

Hey I'm trying to write a program to carry out newtons method and find the roots of the equation exp(-x)-(x^2)+3. It works in so far as finding the root, but I also want it to print out the root after each iteration but I can't get it to work, Could anyone point out my mistake I think its something to do with my indexing? Thanks a million :) #include <stdio.h> #include <math.h> #include <malloc.h> //Define Functions: double evalf(double x) { double answer=exp(-x)-(x*x)+3; return(answer); } double evalfprime(double x) { double answer=-exp(-x)-2*x; return(answer); } double *newton(double initialrt,double accuracy,double *data) { double root[102]; data=root; int maxit = 0; root[0] = initialrt; for (int i=1;i<102;i++) { *(data+i)=*(data+i-1)-evalf(*(data+i-1))/evalfprime(*(data+i-1)); if(fabs(*(data+i)-*(data+i-1))<accuracy) { maxit=i; break; } maxit=i; } if((maxit+1==102)&&(fabs(*(data+maxit)-*(data+maxit-1))>accuracy)) { printf("\nMax iteration reached, method terminated"); } else { printf("\nMethod successful"); printf("\nNumber of iterations: %d\nRoot Estimate: %lf\n",maxit+1,*(data+maxit)); } return(data); } int main() { double root,accuracy; double *data=(double*)malloc(sizeof(double)*102); printf("NEWTONS METHOD PROGRAMME:\nEquation: f(x)=exp(-x)-x^2+3=0\nMax No iterations=100\n\nEnter initial root estimate\n>> "); scanf("%lf",&root); _flushall(); printf("\nEnter accuracy required:\n>>"); scanf("%lf",&accuracy); *data= *newton(root,accuracy,data); printf("Iteration Root Error\n "); printf("%d %lf \n", 0,*(data)); for(int i=1;i<102;i++) { printf("%d %5.5lf %5.5lf\n", i,*(data+i),*(data+i)-*(data+i-1)); if(*(data+i*sizeof(double))-*(data+i*sizeof(double)-1)==0) { break; } } getchar(); getchar(); free(data); return(0); }

Read the article

I got the address of a large managed object in WinDbg, what next?

- by Mahen

I created a high memory utilization dump and using !dumpheap -stat and !dumpheap -mt I got the address of two large string generic list of 30 MB each. I want to know more about these lists. What they contain or which piece of code is using them. Is there a way to find them out? 0:000 !do 2b370038 Name: System.Object[] MethodTable: 71e240bc EEClass: 71c0da54 Size: 33554448(0x2000010) bytes Array: Rank 1, Number of elements 8388608, Type CLASS Element Type: System.Collections.Generic.List`1[[System.String, mscorlib]] Fields: None

Read the article

How to use an excel data-set for a multi-line ggplot in R?

- by user1299887

I have a data set in excel that I am trying to create a multiple line plot with on R. The data set contains 7 food groups and the calories consumed daily associated to the groups. As well, there is that set of data over 38 years (from 1970-2008) and I am attempting to use this data set to create a multiple line plot on R. I have tried for hours on end but can not seem to get R to recognize the variables within the data set.

Read the article

Configuring Multiple Instances of MySQL in Solaris 11

- by rajeshr

Recently someone asked me for steps to configure multiple instances of MySQL database in an Operating Platform. Coz of my familiarity with Solaris OE, I prepared some notes on configuring multiple instances of MySQL database on Solaris 11. Maybe it's useful for some: If you want to run Solaris Operating System (or any other OS of your choice) as a virtualized instance in desktop, consider using Virtual Box. To download Solaris Operating System, click here. Once you have your Solaris Operating System (Version 11) up and running and have Internet connectivity to gain access to the Image Packaging System (IPS), please follow the steps as mentioned below to install MySQL and configure multiple instances: 1. Install MySQL Database in Solaris 11 $ sudo pkg install mysql-51 2. Verify if the mysql is installed: $ svcs -a | grep mysql Note: Service FMRI will look similar to the one here: svc:/application/database/mysql:version_51 3. Prepare data file system for MySQL Instance 1 zfs create rpool/mysql zfs create rpool/mysql/data zfs set mountpoint=/mysql/data rpool/mysql/data 4. Prepare data file system for MySQL Instance 2 zfs create rpool/mysql/data2 zfs set mountpoint=/mysql/data rpool/mysql/data2 5. Change the mysql/datadir of the MySQL Service (SMF) to point to /mysql/data $ svcprop mysql:version_51 | grep mysql/data $ svccfg -s mysql:version_51 setprop mysql/data=/mysql/data 6. Create a new instance of MySQL 5.1 (a) Copy the manifest of the default instance to temporary directory: $ sudo cp /lib/svc/manifest/application/database/mysql_51.xml /var/tmp/mysql_51_2.xml (b) Make appropriate modifications on the XML file $ sudo vi /var/tmp/mysql_51_2.xml - Change the "instance name" section to a new value "version_51_2" - Change the value of property name "data" to point to the ZFS file system "/mysql/data2" 7. Import the manifest to the SMF repository: $ sudo svccfg import /var/tmp/mysql_51_2.xml 8. Before starting the service, copy the file /etc/mysql/my.cnf to the data directories /mysql/data & /mysql/data2. $ sudo cp /etc/mysql/my.cnf /mysql/data/ $ sudo cp /etc/mysql/my.cnf /mysql/data2/ 9. Make modifications to the my.cnf in each of the data directories as required: $ sudo vi /mysql/data/my.cnf Under the [client] section port=3306 socket=/tmp/mysql.sock ---- ---- Under the [mysqld] section port=3306 socket=/tmp/mysql.sock datadir=/mysql/data ----- ----- server-id=1 $ sudo vi /mysql/data2/my.cnf Under the [client] section port=3307 socket=/tmp/mysql2.sock ----- ----- Under the [mysqld] section port=3307 socket=/tmp/mysql2.sock datadir=/mysql/data2 ----- ----- server-id=2 10. Make appropriate modification to the startup script of MySQL (managed by SMF) to point to the appropriate my.cnf for each instance: $ sudo vi /lib/svc/method/mysql_51 Note: Search for all occurences of mysqld_safe command and modify it to include the --defaults-file option. An example entry would look as follows: ${MySQLBIN}/mysqld_safe --defaults-file=${MYSQLDATA}/my.cnf --user=mysql --datadir=${MYSQLDATA} --pid=file=${PIDFILE} 11. Start the service: $ sudo svcadm enable mysql:version_51_2 $ sudo svcadm enable mysql:version_51 12. Verify that the two services are running by using: $ svcs mysql 13. Verify the processes: $ ps -ef | grep mysqld 14. Connect to each mysqld instance and verify: $ mysql --defaults-file=/mysql/data/my.cnf -u root -p $ mysql --defaults-file=/mysql/data2/my.cnf -u root -p Some references for Solaris 11 newbies Taking your first steps with Solaris 11 Introducing the basics of Image Packaging System Service Management Facility How To Guide For a detailed list of official educational modules available on Solaris 11, please visit here For MySQL courses from Oracle University access this page.

Read the article

How would you gather client's data on Google App Engine without using Datastore/Backend Instances too much?

- by ruslan

I'm relatively new to StackExchange and not sure if it's appropriate place to ask design question. Site gives me a hint "The question you're asking appears subjective and is likely to be closed". Please let me know. Anyway.. One of the projects I'm working on is online survey engine. It's my first big commercial project on Google App Engine. I need your advice on how to collect stats and efficiently record them in DataStore without bankrupting me. Initial requirements are: After user finishes survey client sends list of pairs [ID (int) + PercentHit (double)]. This list shows how close answers of this user match predefined answers of reference answerers (which identified by IDs). I call them "target IDs". Creator of the survey wants to see aggregated % for given IDs for last hour, particular timeframe or from the beginning of the survey. Some surveys may have thousands of target/reference answerers. So I created entity public class HitsStatsDO implements Serializable { @Id transient private Long id; transient private Long version = (long) 0; transient private Long startDate; @Parent transient private Key parent; // fake parent which contains target id @Transient int targetId; private double avgPercent; private long hitCount; } But writing HitsStatsDO for each target from each user would give a lot of data. For instance I had a survey with 3000 targets which was answered by ~4 million people within one week with 300K people taking survey in first day. Even if we assume they were answering it evenly for 24 hours it would give us ~1040 writes/second. Obviously it hits concurrent writes limit of Datastore. I decided I'll collect data for one hour and save that, that's why there are avgPercent and hitCount in HitsStatsDO. GAE instances are stateless so I had to use dynamic backend instance. There I have something like this: // Contains stats for one hour private class Shard { ReadWriteLock lock = new ReentrantReadWriteLock(); Map<Integer, HitsStatsDO> map = new HashMap<Integer, HitsStatsDO>(); // Key is target ID public void saveToDatastore(); public void updateStats(Long startDate, Map<Integer, Double> hits); } and map with shard for current hour and previous hour (which doesn't stay here for long) private HashMap<Long, Shard> shards = new HashMap<Long, Shard>(); // Key is HitsStatsDO.startDate So once per hour I dump Shard for previous hour to Datastore. Plus I have class LifetimeStats which keeps Map<Integer, HitsStatsDO> in memcached where map-key is target ID. Also in my backend shutdown hook method I dump stats for unfinished hour to Datastore. There is only one major issue here - I have only ONE backend instance :) It raises following questions on which I'd like to hear your opinion: Can I do this without using backend instance ? What if one instance is not enough ? How can I split data between multiple dynamic backend instances? It hard because I don't know how many I have because Google creates new one as load increases. I know I can launch exact number of resident backend instances. But how many ? 2, 5, 10 ? What if I have no load at all for a week. Constantly running 10 backend instances is too expensive. What do I do with data from clients while backend instance is dead/restarting? Thank you very much in advance for your thoughts.

Read the article

How can I gather client's data on Google App Engine without using Datastore/Backend Instances too much?

- by ruslan

One of the projects I'm working on is online survey engine. It's my first big commercial project on Google App Engine. I need your advice on how to collect stats and efficiently record them in DataStore without bankrupting me. Initial requirements are: After user finishes survey client sends list of pairs [ID (int) + PercentHit (double)]. This list shows how close answers of this user match predefined answers of reference answerers (which identified by IDs). I call them "target IDs". Creator of the survey wants to see aggregated % for given IDs for last hour, particular timeframe or from the beginning of the survey. Some surveys may have thousands of target/reference answerers. So I created entity public class HitsStatsDO implements Serializable { @Id transient private Long id; transient private Long version = (long) 0; transient private Long startDate; @Parent transient private Key parent; // fake parent which contains target id @Transient int targetId; private double avgPercent; private long hitCount; } But writing HitsStatsDO for each target from each user would give a lot of data. For instance I had a survey with 3000 targets which was answered by ~4 million people within one week with 300K people taking survey in first day. Even if we assume they were answering it evenly for 24 hours it would give us ~1040 writes/second. Obviously it hits concurrent writes limit of Datastore. I decided I'll collect data for one hour and save that, that's why there are avgPercent and hitCount in HitsStatsDO. GAE instances are stateless so I had to use dynamic backend instance. There I have something like this: // Contains stats for one hour private class Shard { ReadWriteLock lock = new ReentrantReadWriteLock(); Map<Integer, HitsStatsDO> map = new HashMap<Integer, HitsStatsDO>(); // Key is target ID public void saveToDatastore(); public void updateStats(Long startDate, Map<Integer, Double> hits); } and map with shard for current hour and previous hour (which doesn't stay here for long) private HashMap<Long, Shard> shards = new HashMap<Long, Shard>(); // Key is HitsStatsDO.startDate So once per hour I dump Shard for previous hour to Datastore. Plus I have class LifetimeStats which keeps Map<Integer, HitsStatsDO> in memcached where map-key is target ID. Also in my backend shutdown hook method I dump stats for unfinished hour to Datastore. There is only one major issue here - I have only ONE backend instance :) It raises following questions on which I'd like to hear your opinion: Can I do this without using backend instance ? What if one instance is not enough ? How can I split data between multiple dynamic backend instances? It hard because I don't know how many I have because Google creates new one as load increases. I know I can launch exact number of resident backend instances. But how many ? 2, 5, 10 ? What if I have no load at all for a week. Constantly running 10 backend instances is too expensive. What do I do with data from clients while backend instance is dead/restarting?

Read the article

Are there any Microsoft Exchange Clients for iOS and Android that store their local data in an encrypted manner?

- by Zac B

I don't feel like this is a product recommendation question, more of a "does this tech even exist and is it feasible" question, but if I'm wrong, feel free to give this question the boot. Context: Our company has a bunch of traveling employees who access the company's Exchange server via thier iDevices or android phones, but because of the data protection laws in the state where our company is based (and the nature of the data our company works with), a recent security audit found that all mobile devices (laptops, phones, etc) operated by our company need to have all company correspondence and related data encrypted all the time. For laptops, that was easy: BitLocker or TrueCrypt, problem solved. For phones and tablets, however, I'm stumped. Sure, you can put lock screens/passwords on the phones, but the data is still accessible via external extraction, as law enforcement authorities already know. Question: Are there any clients for Microsoft Exchange that run on iOS or Android which store local data encrypted? The people using our mobile devices do a lot of their work while offline, so just giving them OWA access with SSL connection security isn't enough. Are there apps/technologies that present an additional login credential prompt to decrypt locally stored data in the app's storage area on the phone? My gut reaction when I started looking into this was "that doesn't sound like something Apple would allow into the App Store", but I've been wrong before...

Read the article

Security implications of adding www-data to /etc/sudoers to run php-cgi as a different user

- by BMiner

What I really want to do is allow the 'www-data' user to have the ability to launch php-cgi as another user. I just want to make sure that I fully understand the security implications. The server should support a shared hosting environment where various (possibly untrusted) users have chroot'ed FTP access to the server to store their HTML and PHP files. Then, since PHP scripts can be malicious and read/write others' files, I'd like to ensure that each users' PHP scripts run with the same user permissions for that user (instead of running as www-data). Long story short, I have added the following line to my /etc/sudoers file, and I wanted to run it past the community as a sanity check: www-data ALL = (%www-data) NOPASSWD: /usr/bin/php-cgi This line should only allow www-data to run a command like this (without a password prompt): sudo -u some_user /usr/bin/php-cgi ...where some_user is a user in the group www-data. What are the security implications of this? This should then allow me to modify my Lighttpd configuration like this: fastcgi.server += ( ".php" => (( "bin-path" => "sudo -u some_user /usr/bin/php-cgi", "socket" => "/tmp/php.socket", "max-procs" => 1, "bin-environment" => ( "PHP_FCGI_CHILDREN" => "4", "PHP_FCGI_MAX_REQUESTS" => "10000" ), "bin-copy-environment" => ( "PATH", "SHELL", "USER" ), "broken-scriptfilename" => "enable" )) ) ...allowing me to spawn new FastCGI server instances for each user.

Read the article

Beware Sneaky Reads with Unique Indexes

- by Paul White NZ

A few days ago, Sandra Mueller (twitter | blog) asked a question using twitter’s #sqlhelp hash tag: “Might SQL Server retrieve (out-of-row) LOB data from a table, even if the column isn’t referenced in the query?” Leaving aside trivial cases (like selecting a computed column that does reference the LOB data), one might be tempted to say that no, SQL Server does not read data you haven’t asked for. In general, that’s quite correct; however there are cases where SQL Server might sneakily retrieve a LOB column… Example Table Here’s a T-SQL script to create that table and populate it with 1,000 rows: CREATE TABLE dbo.LOBtest ( pk INTEGER IDENTITY NOT NULL, some_value INTEGER NULL, lob_data VARCHAR(MAX) NULL, another_column CHAR(5) NULL, CONSTRAINT [PK dbo.LOBtest pk] PRIMARY KEY CLUSTERED (pk ASC) ); GO DECLARE @Data VARCHAR(MAX); SET @Data = REPLICATE(CONVERT(VARCHAR(MAX), 'x'), 65540); WITH Numbers (n) AS ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM master.sys.columns C1, master.sys.columns C2 ) INSERT LOBtest WITH (TABLOCKX) ( some_value, lob_data ) SELECT TOP (1000) N.n, @Data FROM Numbers N WHERE N.n <= 1000; Test 1: A Simple Update Let’s run a query to subtract one from every value in the some_value column: UPDATE dbo.LOBtest WITH (TABLOCKX) SET some_value = some_value - 1; As you might expect, modifying this integer column in 1,000 rows doesn’t take very long, or use many resources. The STATITICS IO and TIME output shows a total of 9 logical reads, and 25ms elapsed time. The query plan is also very simple: Looking at the Clustered Index Scan, we can see that SQL Server only retrieves the pk and some_value columns during the scan: The pk column is needed by the Clustered Index Update operator to uniquely identify the row that is being changed. The some_value column is used by the Compute Scalar to calculate the new value. (In case you are wondering what the Top operator is for, it is used to enforce SET ROWCOUNT). Test 2: Simple Update with an Index Now let’s create a nonclustered index keyed on the some_value column, with lob_data as an included column: CREATE NONCLUSTERED INDEX [IX dbo.LOBtest some_value (lob_data)] ON dbo.LOBtest (some_value) INCLUDE ( lob_data ) WITH ( FILLFACTOR = 100, MAXDOP = 1, SORT_IN_TEMPDB = ON ); This is not a useful index for our simple update query; imagine that someone else created it for a different purpose. Let’s run our update query again: UPDATE dbo.LOBtest WITH (TABLOCKX) SET some_value = some_value - 1; We find that it now requires 4,014 logical reads and the elapsed query time has increased to around 100ms. The extra logical reads (4 per row) are an expected consequence of maintaining the nonclustered index. The query plan is very similar to before (click to enlarge): The Clustered Index Update operator picks up the extra work of maintaining the nonclustered index. The new Compute Scalar operators detect whether the value in the some_value column has actually been changed by the update. SQL Server may be able to skip maintaining the nonclustered index if the value hasn’t changed (see my previous post on non-updating updates for details). Our simple query does change the value of some_data in every row, so this optimization doesn’t add any value in this specific case. The output list of columns from the Clustered Index Scan hasn’t changed from the one shown previously: SQL Server still just reads the pk and some_data columns. Cool. Overall then, adding the nonclustered index hasn’t had any startling effects, and the LOB column data still isn’t being read from the table. Let’s see what happens if we make the nonclustered index unique. Test 3: Simple Update with a Unique Index Here’s the script to create a new unique index, and drop the old one: CREATE UNIQUE NONCLUSTERED INDEX [UQ dbo.LOBtest some_value (lob_data)] ON dbo.LOBtest (some_value) INCLUDE ( lob_data ) WITH ( FILLFACTOR = 100, MAXDOP = 1, SORT_IN_TEMPDB = ON ); GO DROP INDEX [IX dbo.LOBtest some_value (lob_data)] ON dbo.LOBtest; Remember that SQL Server only enforces uniqueness on index keys (the some_data column). The lob_data column is simply stored at the leaf-level of the non-clustered index. With that in mind, we might expect this change to make very little difference. Let’s see: UPDATE dbo.LOBtest WITH (TABLOCKX) SET some_value = some_value - 1; Whoa! Now look at the elapsed time and logical reads: Scan count 1, logical reads 2016, physical reads 0, read-ahead reads 0, lob logical reads 36015, lob physical reads 0, lob read-ahead reads 15992. CPU time = 172 ms, elapsed time = 16172 ms. Even with all the data and index pages in memory, the query took over 16 seconds to update just 1,000 rows, performing over 52,000 LOB logical reads (nearly 16,000 of those using read-ahead). Why on earth is SQL Server reading LOB data in a query that only updates a single integer column? The Query Plan The query plan for test 3 looks a bit more complex than before: In fact, the bottom level is exactly the same as we saw with the non-unique index. The top level has heaps of new stuff though, which I’ll come to in a moment. You might be expecting to find that the Clustered Index Scan is now reading the lob_data column (for some reason). After all, we need to explain where all the LOB logical reads are coming from. Sadly, when we look at the properties of the Clustered Index Scan, we see exactly the same as before: SQL Server is still only reading the pk and some_value columns – so what’s doing the LOB reads? Updates that Sneakily Read Data We have to go as far as the Clustered Index Update operator before we see LOB data in the output list: [Expr1020] is a bit flag added by an earlier Compute Scalar. It is set true if the some_value column has not been changed (part of the non-updating updates optimization I mentioned earlier). The Clustered Index Update operator adds two new columns: the lob_data column, and some_value_OLD. The some_value_OLD column, as the name suggests, is the pre-update value of the some_value column. At this point, the clustered index has already been updated with the new value, but we haven’t touched the nonclustered index yet. An interesting observation here is that the Clustered Index Update operator can read a column into the data flow as part of its update operation. SQL Server could have read the LOB data as part of the initial Clustered Index Scan, but that would mean carrying the data through all the operations that occur prior to the Clustered Index Update. The server knows it will have to go back to the clustered index row to update it, so it delays reading the LOB data until then. Sneaky! Why the LOB Data Is Needed This is all very interesting (I hope), but why is SQL Server reading the LOB data? For that matter, why does it need to pass the pre-update value of the some_value column out of the Clustered Index Update? The answer relates to the top row of the query plan for test 3. I’ll reproduce it here for convenience: Notice that this is a wide (per-index) update plan. SQL Server used a narrow (per-row) update plan in test 2, where the Clustered Index Update took care of maintaining the nonclustered index too. I’ll talk more about this difference shortly. The Split/Sort/Collapse combination is an optimization, which aims to make per-index update plans more efficient. It does this by breaking each update into a delete/insert pair, reordering the operations, removing any redundant operations, and finally applying the net effect of all the changes to the nonclustered index. Imagine we had a unique index which currently holds three rows with the values 1, 2, and 3. If we run a query that adds 1 to each row value, we would end up with values 2, 3, and 4. The net effect of all the changes is the same as if we simply deleted the value 1, and added a new value 4. By applying net changes, SQL Server can also avoid false unique-key violations. If we tried to immediately update the value 1 to a 2, it would conflict with the existing value 2 (which would soon be updated to 3 of course) and the query would fail. You might argue that SQL Server could avoid the uniqueness violation by starting with the highest value (3) and working down. That’s fine, but it’s not possible to generalize this logic to work with every possible update query. SQL Server has to use a wide update plan if it sees any risk of false uniqueness violations. It’s worth noting that the logic SQL Server uses to detect whether these violations are possible has definite limits. As a result, you will often receive a wide update plan, even when you can see that no violations are possible. Another benefit of this optimization is that it includes a sort on the index key as part of its work. Processing the index changes in index key order promotes sequential I/O against the nonclustered index. A side-effect of all this is that the net changes might include one or more inserts. In order to insert a new row in the index, SQL Server obviously needs all the columns – the key column and the included LOB column. This is the reason SQL Server reads the LOB data as part of the Clustered Index Update. In addition, the some_value_OLD column is required by the Split operator (it turns updates into delete/insert pairs). In order to generate the correct index key delete operation, it needs the old key value. The irony is that in this case the Split/Sort/Collapse optimization is anything but. Reading all that LOB data is extremely expensive, so it is sad that the current version of SQL Server has no way to avoid it. Finally, for completeness, I should mention that the Filter operator is there to filter out the non-updating updates. Beating the Set-Based Update with a Cursor One situation where SQL Server can see that false unique-key violations aren’t possible is where it can guarantee that only one row is being updated. Armed with this knowledge, we can write a cursor (or the WHILE-loop equivalent) that updates one row at a time, and so avoids reading the LOB data: SET NOCOUNT ON; SET STATISTICS XML, IO, TIME OFF; DECLARE @PK INTEGER, @StartTime DATETIME; SET @StartTime = GETUTCDATE(); DECLARE curUpdate CURSOR LOCAL FORWARD_ONLY KEYSET SCROLL_LOCKS FOR SELECT L.pk FROM LOBtest L ORDER BY L.pk ASC; OPEN curUpdate; WHILE (1 = 1) BEGIN FETCH NEXT FROM curUpdate INTO @PK; IF @@FETCH_STATUS = -1 BREAK; IF @@FETCH_STATUS = -2 CONTINUE; UPDATE dbo.LOBtest SET some_value = some_value - 1 WHERE CURRENT OF curUpdate; END; CLOSE curUpdate; DEALLOCATE curUpdate; SELECT DATEDIFF(MILLISECOND, @StartTime, GETUTCDATE()); That completes the update in 1280 milliseconds (remember test 3 took over 16 seconds!) I used the WHERE CURRENT OF syntax there and a KEYSET cursor, just for the fun of it. One could just as well use a WHERE clause that specified the primary key value instead. Clustered Indexes A clustered index is the ultimate index with included columns: all non-key columns are included columns in a clustered index. Let’s re-create the test table and data with an updatable primary key, and without any non-clustered indexes: IF OBJECT_ID(N'dbo.LOBtest', N'U') IS NOT NULL DROP TABLE dbo.LOBtest; GO CREATE TABLE dbo.LOBtest ( pk INTEGER NOT NULL, some_value INTEGER NULL, lob_data VARCHAR(MAX) NULL, another_column CHAR(5) NULL, CONSTRAINT [PK dbo.LOBtest pk] PRIMARY KEY CLUSTERED (pk ASC) ); GO DECLARE @Data VARCHAR(MAX); SET @Data = REPLICATE(CONVERT(VARCHAR(MAX), 'x'), 65540); WITH Numbers (n) AS ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM master.sys.columns C1, master.sys.columns C2 ) INSERT LOBtest WITH (TABLOCKX) ( pk, some_value, lob_data ) SELECT TOP (1000) N.n, N.n, @Data FROM Numbers N WHERE N.n <= 1000; Now here’s a query to modify the cluster keys: UPDATE dbo.LOBtest SET pk = pk + 1; The query plan is: As you can see, the Split/Sort/Collapse optimization is present, and we also gain an Eager Table Spool, for Halloween protection. In addition, SQL Server now has no choice but to read the LOB data in the Clustered Index Scan: The performance is not great, as you might expect (even though there is no non-clustered index to maintain): Table 'LOBtest'. Scan count 1, logical reads 2011, physical reads 0, read-ahead reads 0, lob logical reads 36015, lob physical reads 0, lob read-ahead reads 15992. Table 'Worktable'. Scan count 1, logical reads 2040, physical reads 0, read-ahead reads 0, lob logical reads 34000, lob physical reads 0, lob read-ahead reads 8000. SQL Server Execution Times: CPU time = 483 ms, elapsed time = 17884 ms. Notice how the LOB data is read twice: once from the Clustered Index Scan, and again from the work table in tempdb used by the Eager Spool. If you try the same test with a non-unique clustered index (rather than a primary key), you’ll get a much more efficient plan that just passes the cluster key (including uniqueifier) around (no LOB data or other non-key columns): A unique non-clustered index (on a heap) works well too: Both those queries complete in a few tens of milliseconds, with no LOB reads, and just a few thousand logical reads. (In fact the heap is rather more efficient). There are lots more fun combinations to try that I don’t have space for here. Final Thoughts The behaviour shown in this post is not limited to LOB data by any means. If the conditions are met, any unique index that has included columns can produce similar behaviour – something to bear in mind when adding large INCLUDE columns to achieve covering queries, perhaps. Paul White Email: [email protected] Twitter: @PaulWhiteNZ

Search Results

Search found 65558 results on 2623 pages for 'large data'.

Page 456/2623 | < Previous Page | 452 453 454 455 456 457 458 459 460 461 462 463 | Next Page >

- by David Archer

- by xoail

- by Elizabeth Buckwalter

- by xoail

- by Ben Straub

- by Caffeine Coma

- by Bob

- by Khou

- by moon

- by xRobot

- by Jason Jarrett

- by Chris R

- by jithincool

- by coolpawan4u

- by Yousui

- by Sean

- by Leavenotrace

- by Mahen

- by user1299887

- by rajeshr

- by ruslan

- by ruslan

- by Zac B

- by BMiner

- by Paul White NZ

< Previous Page | 452 453 454 455 456 457 458 459 460 461 462 463 | Next Page >