Search Results

Search found 1058 results on 43 pages for 'compute'.

Page 13/43 | < Previous Page | 9 10 11 12 13 14 15 16 17 18 19 20  | Next Page >

  • SSH Login to an EC2 instance failing with previously working keys...

    - by Matthew Savage
    We recently had an issues where I had rebooted our EC2 instance (Ubuntu x86_64, version 9.10 server) and due to an EC2 issue the instance needed to be stopped and was down for a few days. Now I have been able to bring the instance back online I cannot connect to SSH using the keypair which previously worked. Unfortunately SSH is the only way to get into this server, and while I have another system running in its place there are a number of things I would like to try and retrieve from the machine. Running SSH in verbose mode yields the following: [Broc-MBP.local]: Broc:~/.ssh ? ssh -i ~/.ssh/EC2Keypair.pem -l ubuntu ec2-xxx.compute-1.amazonaws.com -vvv OpenSSH_5.2p1, OpenSSL 0.9.8l 5 Nov 2009 debug1: Reading configuration data /Users/Broc/.ssh/config debug1: Reading configuration data /etc/ssh_config debug2: ssh_connect: needpriv 0 debug1: Connecting to ec2-xxx.compute-1.amazonaws.com [184.73.109.130] port 22. debug1: Connection established. debug3: Not a RSA1 key file /Users/Broc/.ssh/EC2Keypair.pem. debug2: key_type_from_name: unknown key type '-----BEGIN' debug3: key_read: missing keytype debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug2: key_type_from_name: unknown key type '-----END' debug3: key_read: missing keytype debug1: identity file /Users/Broc/.ssh/EC2Keypair.pem type -1 debug3: Not a RSA1 key file /Users/Broc/.ssh/id_rsa. debug2: key_type_from_name: unknown key type '-----BEGIN' debug3: key_read: missing keytype debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug2: key_type_from_name: unknown key type '-----END' debug3: key_read: missing keytype debug1: identity file /Users/Broc/.ssh/id_rsa type 1 debug1: Remote protocol version 2.0, remote software version OpenSSH_5.1p1 Debian-6ubuntu2 debug1: match: OpenSSH_5.1p1 Debian-6ubuntu2 pat OpenSSH* debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_5.2 debug2: fd 3 setting O_NONBLOCK debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug2: kex_parse_kexinit: diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1 debug2: kex_parse_kexinit: ssh-rsa,ssh-dss debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected] debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected] debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: none,[email protected],zlib debug2: kex_parse_kexinit: none,[email protected],zlib debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: first_kex_follows 0 debug2: kex_parse_kexinit: reserved 0 debug2: kex_parse_kexinit: diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1 debug2: kex_parse_kexinit: ssh-rsa,ssh-dss debug2: kex_parse_kexinit: aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour128,arcfour256,arcfour,aes192-cbc,aes256-cbc,[email protected],aes128-ctr,aes192-ctr,aes256-ctr debug2: kex_parse_kexinit: aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour128,arcfour256,arcfour,aes192-cbc,aes256-cbc,[email protected],aes128-ctr,aes192-ctr,aes256-ctr debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: none,[email protected] debug2: kex_parse_kexinit: none,[email protected] debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: first_kex_follows 0 debug2: kex_parse_kexinit: reserved 0 debug2: mac_setup: found hmac-md5 debug1: kex: server->client aes128-ctr hmac-md5 none debug2: mac_setup: found hmac-md5 debug1: kex: client->server aes128-ctr hmac-md5 none debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP debug2: dh_gen_key: priv key bits set: 123/256 debug2: bits set: 500/1024 debug1: SSH2_MSG_KEX_DH_GEX_INIT sent debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY debug3: check_host_in_hostfile: filename /Users/Broc/.ssh/known_hosts debug3: check_host_in_hostfile: match line 106 debug3: check_host_in_hostfile: filename /Users/Broc/.ssh/known_hosts debug3: check_host_in_hostfile: match line 106 debug1: Host 'ec2-xxx.compute-1.amazonaws.com' is known and matches the RSA host key. debug1: Found key in /Users/Broc/.ssh/known_hosts:106 debug2: bits set: 521/1024 debug1: ssh_rsa_verify: signature correct debug2: kex_derive_keys debug2: set_newkeys: mode 1 debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug2: set_newkeys: mode 0 debug1: SSH2_MSG_NEWKEYS received debug1: SSH2_MSG_SERVICE_REQUEST sent debug2: service_accept: ssh-userauth debug1: SSH2_MSG_SERVICE_ACCEPT received debug2: key: /Users/Broc/.ssh/id_rsa (0x100125f70) debug2: key: /Users/Broc/.ssh/EC2Keypair.pem (0x0) debug1: Authentications that can continue: publickey debug3: start over, passed a different list publickey debug3: preferred publickey,keyboard-interactive,password debug3: authmethod_lookup publickey debug3: remaining preferred: keyboard-interactive,password debug3: authmethod_is_enabled publickey debug1: Next authentication method: publickey debug1: Offering public key: /Users/Broc/.ssh/id_rsa debug3: send_pubkey_test debug2: we sent a publickey packet, wait for reply debug1: Authentications that can continue: publickey debug1: Trying private key: /Users/Broc/.ssh/EC2Keypair.pem debug1: read PEM private key done: type RSA debug3: sign_and_send_pubkey debug2: we sent a publickey packet, wait for reply debug1: Authentications that can continue: publickey debug2: we did not send a packet, disable method debug1: No more authentication methods to try. Permission denied (publickey). [Broc-MBP.local]: Broc:~/.ssh ? So, right now I'm really at a loss and not sure what to do. While I've already got another system taking the place of this one I'd really like to have access back :|

    Read the article

  • C# Neural Networks with Encog

    - by JoshReuben
    Neural Networks ·       I recently read a book Introduction to Neural Networks for C# , by Jeff Heaton. http://www.amazon.com/Introduction-Neural-Networks-C-2nd/dp/1604390093/ref=sr_1_2?ie=UTF8&s=books&qid=1296821004&sr=8-2-spell. Not the 1st ANN book I've perused, but a nice revision.   ·       Artificial Neural Networks (ANNs) are a mechanism of machine learning – see http://en.wikipedia.org/wiki/Artificial_neural_network , http://en.wikipedia.org/wiki/Category:Machine_learning ·       Problems Not Suited to a Neural Network Solution- Programs that are easily written out as flowcharts consisting of well-defined steps, program logic that is unlikely to change, problems in which you must know exactly how the solution was derived. ·       Problems Suited to a Neural Network – pattern recognition, classification, series prediction, and data mining. Pattern recognition - network attempts to determine if the input data matches a pattern that it has been trained to recognize. Classification - take input samples and classify them into fuzzy groups. ·       As far as machine learning approaches go, I thing SVMs are superior (see http://en.wikipedia.org/wiki/Support_vector_machine ) - a neural network has certain disadvantages in comparison: an ANN can be overtrained, different training sets can produce non-deterministic weights and it is not possible to discern the underlying decision function of an ANN from its weight matrix – they are black box. ·       In this post, I'm not going to go into internals (believe me I know them). An autoassociative network (e.g. a Hopfield network) will echo back a pattern if it is recognized. ·       Under the hood, there is very little maths. In a nutshell - Some simple matrix operations occur during training: the input array is processed (normalized into bipolar values of 1, -1) - transposed from input column vector into a row vector, these are subject to matrix multiplication and then subtraction of the identity matrix to get a contribution matrix. The dot product is taken against the weight matrix to yield a boolean match result. For backpropogation training, a derivative function is required. In learning, hill climbing mechanisms such as Genetic Algorithms and Simulated Annealing are used to escape local minima. For unsupervised training, such as found in Self Organizing Maps used for OCR, Hebbs rule is applied. ·       The purpose of this post is not to mire you in technical and conceptual details, but to show you how to leverage neural networks via an abstraction API - Encog   Encog ·       Encog is a neural network API ·       Links to Encog: http://www.encog.org , http://www.heatonresearch.com/encog, http://www.heatonresearch.com/forum ·       Encog requires .Net 3.5 or higher – there is also a Silverlight version. Third-Party Libraries – log4net and nunit. ·       Encog supports feedforward, recurrent, self-organizing maps, radial basis function and Hopfield neural networks. ·       Encog neural networks, and related data, can be stored in .EG XML files. ·       Encog Workbench allows you to edit, train and visualize neural networks. The Encog Workbench can generate code. Synapses and layers ·       the primary building blocks - Almost every neural network will have, at a minimum, an input and output layer. In some cases, the same layer will function as both input and output layer. ·       To adapt a problem to a neural network, you must determine how to feed the problem into the input layer of a neural network, and receive the solution through the output layer of a neural network. ·       The Input Layer - For each input neuron, one double value is stored. An array is passed as input to a layer. Encog uses the interface INeuralData to hold these arrays. The class BasicNeuralData implements the INeuralData interface. Once the neural network processes the input, an INeuralData based class will be returned from the neural network's output layer. ·       convert a double array into an INeuralData object : INeuralData data = new BasicNeuralData(= new double[10]); ·       the Output Layer- The neural network outputs an array of doubles, wraped in a class based on the INeuralData interface. ·        The real power of a neural network comes from its pattern recognition capabilities. The neural network should be able to produce the desired output even if the input has been slightly distorted. ·       Hidden Layers– optional. between the input and output layers. very much a “black box”. If the structure of the hidden layer is too simple it may not learn the problem. If the structure is too complex, it will learn the problem but will be very slow to train and execute. Some neural networks have no hidden layers. The input layer may be directly connected to the output layer. Further, some neural networks have only a single layer. A single layer neural network has the single layer self-connected. ·       connections, called synapses, contain individual weight matrixes. These values are changed as the neural network learns. Constructing a Neural Network ·       the XOR operator is a frequent “first example” -the “Hello World” application for neural networks. ·       The XOR Operator- only returns true when both inputs differ. 0 XOR 0 = 0 1 XOR 0 = 1 0 XOR 1 = 1 1 XOR 1 = 0 ·       Structuring a Neural Network for XOR  - two inputs to the XOR operator and one output. ·       input: 0.0,0.0 1.0,0.0 0.0,1.0 1.0,1.0 ·       Expected output: 0.0 1.0 1.0 0.0 ·       A Perceptron - a simple feedforward neural network to learn the XOR operator. ·       Because the XOR operator has two inputs and one output, the neural network will follow suit. Additionally, the neural network will have a single hidden layer, with two neurons to help process the data. The choice for 2 neurons in the hidden layer is arbitrary, and often comes down to trial and error. ·       Neuron Diagram for the XOR Network ·       ·       The Encog workbench displays neural networks on a layer-by-layer basis. ·       Encog Layer Diagram for the XOR Network:   ·       Create a BasicNetwork - Three layers are added to this network. the FinalizeStructure method must be called to inform the network that no more layers are to be added. The call to Reset randomizes the weights in the connections between these layers. var network = new BasicNetwork(); network.AddLayer(new BasicLayer(2)); network.AddLayer(new BasicLayer(2)); network.AddLayer(new BasicLayer(1)); network.Structure.FinalizeStructure(); network.Reset(); ·       Neural networks frequently start with a random weight matrix. This provides a starting point for the training methods. These random values will be tested and refined into an acceptable solution. However, sometimes the initial random values are too far off. Sometimes it may be necessary to reset the weights again, if training is ineffective. These weights make up the long-term memory of the neural network. Additionally, some layers have threshold values that also contribute to the long-term memory of the neural network. Some neural networks also contain context layers, which give the neural network a short-term memory as well. The neural network learns by modifying these weight and threshold values. ·       Now that the neural network has been created, it must be trained. Training a Neural Network ·       construct a INeuralDataSet object - contains the input array and the expected output array (of corresponding range). Even though there is only one output value, we must still use a two-dimensional array to represent the output. public static double[][] XOR_INPUT ={ new double[2] { 0.0, 0.0 }, new double[2] { 1.0, 0.0 }, new double[2] { 0.0, 1.0 }, new double[2] { 1.0, 1.0 } };   public static double[][] XOR_IDEAL = { new double[1] { 0.0 }, new double[1] { 1.0 }, new double[1] { 1.0 }, new double[1] { 0.0 } };   INeuralDataSet trainingSet = new BasicNeuralDataSet(XOR_INPUT, XOR_IDEAL); ·       Training is the process where the neural network's weights are adjusted to better produce the expected output. Training will continue for many iterations, until the error rate of the network is below an acceptable level. Encog supports many different types of training. Resilient Propagation (RPROP) - general-purpose training algorithm. All training classes implement the ITrain interface. The RPROP algorithm is implemented by the ResilientPropagation class. Training the neural network involves calling the Iteration method on the ITrain class until the error is below a specific value. The code loops through as many iterations, or epochs, as it takes to get the error rate for the neural network to be below 1%. Once the neural network has been trained, it is ready for use. ITrain train = new ResilientPropagation(network, trainingSet);   for (int epoch=0; epoch < 10000; epoch++) { train.Iteration(); Debug.Print("Epoch #" + epoch + " Error:" + train.Error); if (train.Error > 0.01) break; } Executing a Neural Network ·       Call the Compute method on the BasicNetwork class. Console.WriteLine("Neural Network Results:"); foreach (INeuralDataPair pair in trainingSet) { INeuralData output = network.Compute(pair.Input); Console.WriteLine(pair.Input[0] + "," + pair.Input[1] + ", actual=" + output[0] + ",ideal=" + pair.Ideal[0]); } ·       The Compute method accepts an INeuralData class and also returns a INeuralData object. Neural Network Results: 0.0,0.0, actual=0.002782538818034049,ideal=0.0 1.0,0.0, actual=0.9903741937121177,ideal=1.0 0.0,1.0, actual=0.9836807956566187,ideal=1.0 1.0,1.0, actual=0.0011646072586172778,ideal=0.0 ·       the network has not been trained to give the exact results. This is normal. Because the network was trained to 1% error, each of the results will also be within generally 1% of the expected value.

    Read the article

  • Operator of the week - Assert

    - by Fabiano Amorim
    Well my friends, I was wondering how to help you in a practical way to understand execution plans. So I think I'll talk about the Showplan Operators. Showplan Operators are used by the Query Optimizer (QO) to build the query plan in order to perform a specified operation. A query plan will consist of many physical operators. The Query Optimizer uses a simple language that represents each physical operation by an operator, and each operator is represented in the graphical execution plan by an icon. I'll try to talk about one operator every week, but so as to avoid having to continue to write about these operators for years, I'll mention only of those that are more common: The first being the Assert. The Assert is used to verify a certain condition, it validates a Constraint on every row to ensure that the condition was met. If, for example, our DDL includes a check constraint which specifies only two valid values for a column, the Assert will, for every row, validate the value passed to the column to ensure that input is consistent with the check constraint. Assert  and Check Constraints: Let's see where the SQL Server uses that information in practice. Take the following T-SQL: IF OBJECT_ID('Tab1') IS NOT NULL   DROP TABLE Tab1 GO CREATE TABLE Tab1(ID Integer, Gender CHAR(1))  GO  ALTER TABLE TAB1 ADD CONSTRAINT ck_Gender_M_F CHECK(Gender IN('M','F'))  GO INSERT INTO Tab1(ID, Gender) VALUES(1,'X') GO To the command above the SQL Server has generated the following execution plan: As we can see, the execution plan uses the Assert operator to check that the inserted value doesn't violate the Check Constraint. In this specific case, the Assert applies the rule, 'if the value is different to "F" and different to "M" than return 0 otherwise returns NULL'. The Assert operator is programmed to show an error if the returned value is not NULL; in other words, the returned value is not a "M" or "F". Assert checking Foreign Keys Now let's take a look at an example where the Assert is used to validate a foreign key constraint. Suppose we have this  query: ALTER TABLE Tab1 ADD ID_Genders INT GO  IF OBJECT_ID('Tab2') IS NOT NULL   DROP TABLE Tab2 GO CREATE TABLE Tab2(ID Integer PRIMARY KEY, Gender CHAR(1))  GO  INSERT INTO Tab2(ID, Gender) VALUES(1, 'F') INSERT INTO Tab2(ID, Gender) VALUES(2, 'M') INSERT INTO Tab2(ID, Gender) VALUES(3, 'N') GO  ALTER TABLE Tab1 ADD CONSTRAINT fk_Tab2 FOREIGN KEY (ID_Genders) REFERENCES Tab2(ID) GO  INSERT INTO Tab1(ID, ID_Genders, Gender) VALUES(1, 4, 'X') Let's look at the text execution plan to see what these Assert operators were doing. To see the text execution plan just execute SET SHOWPLAN_TEXT ON before run the insert command. |--Assert(WHERE:(CASE WHEN NOT [Pass1008] AND [Expr1007] IS NULL THEN (0) ELSE NULL END))      |--Nested Loops(Left Semi Join, PASSTHRU:([Tab1].[ID_Genders] IS NULL), OUTER REFERENCES:([Tab1].[ID_Genders]), DEFINE:([Expr1007] = [PROBE VALUE]))           |--Assert(WHERE:(CASE WHEN [Tab1].[Gender]<>'F' AND [Tab1].[Gender]<>'M' THEN (0) ELSE NULL END))           |    |--Clustered Index Insert(OBJECT:([Tab1].[PK]), SET:([Tab1].[ID] = RaiseIfNullInsert([@1]),[Tab1].[ID_Genders] = [@2],[Tab1].[Gender] = [Expr1003]), DEFINE:([Expr1003]=CONVERT_IMPLICIT(char(1),[@3],0)))           |--Clustered Index Seek(OBJECT:([Tab2].[PK]), SEEK:([Tab2].[ID]=[Tab1].[ID_Genders]) ORDERED FORWARD) Here we can see the Assert operator twice, first (looking down to up in the text plan and the right to left in the graphical plan) validating the Check Constraint. The same concept showed above is used, if the exit value is "0" than keep running the query, but if NULL is returned shows an exception. The second Assert is validating the result of the Tab1 and Tab2 join. It is interesting to see the "[Expr1007] IS NULL". To understand that you need to know what this Expr1007 is, look at the Probe Value (green text) in the text plan and you will see that it is the result of the join. If the value passed to the INSERT at the column ID_Gender exists in the table Tab2, then that probe will return the join value; otherwise it will return NULL. So the Assert is checking the value of the search at the Tab2; if the value that is passed to the INSERT is not found  then Assert will show one exception. If the value passed to the column ID_Genders is NULL than the SQL can't show a exception, in that case it returns "0" and keeps running the query. If you run the INSERT above, the SQL will show an exception because of the "X" value, but if you change the "X" to "F" and run again, it will show an exception because of the value "4". If you change the value "4" to NULL, 1, 2 or 3 the insert will be executed without any error. Assert checking a SubQuery: The Assert operator is also used to check one subquery. As we know, one scalar subquery can't validly return more than one value: Sometimes, however, a  mistake happens, and a subquery attempts to return more than one value . Here the Assert comes into play by validating the condition that a scalar subquery returns just one value. Take the following query: INSERT INTO Tab1(ID_TipoSexo, Sexo) VALUES((SELECT ID_TipoSexo FROM Tab1), 'F')    INSERT INTO Tab1(ID_TipoSexo, Sexo) VALUES((SELECT ID_TipoSexo FROM Tab1), 'F')    |--Assert(WHERE:(CASE WHEN NOT [Pass1016] AND [Expr1015] IS NULL THEN (0) ELSE NULL END))        |--Nested Loops(Left Semi Join, PASSTHRU:([tempdb].[dbo].[Tab1].[ID_TipoSexo] IS NULL), OUTER REFERENCES:([tempdb].[dbo].[Tab1].[ID_TipoSexo]), DEFINE:([Expr1015] = [PROBE VALUE]))              |--Assert(WHERE:([Expr1017]))             |    |--Compute Scalar(DEFINE:([Expr1017]=CASE WHEN [tempdb].[dbo].[Tab1].[Sexo]<>'F' AND [tempdb].[dbo].[Tab1].[Sexo]<>'M' THEN (0) ELSE NULL END))              |         |--Clustered Index Insert(OBJECT:([tempdb].[dbo].[Tab1].[PK__Tab1__3214EC277097A3C8]), SET:([tempdb].[dbo].[Tab1].[ID_TipoSexo] = [Expr1008],[tempdb].[dbo].[Tab1].[Sexo] = [Expr1009],[tempdb].[dbo].[Tab1].[ID] = [Expr1003]))              |              |--Top(TOP EXPRESSION:((1)))              |                   |--Compute Scalar(DEFINE:([Expr1008]=[Expr1014], [Expr1009]='F'))              |                        |--Nested Loops(Left Outer Join)              |                             |--Compute Scalar(DEFINE:([Expr1003]=getidentity((1856985942),(2),NULL)))              |                             |    |--Constant Scan              |                             |--Assert(WHERE:(CASE WHEN [Expr1013]>(1) THEN (0) ELSE NULL END))              |                                  |--Stream Aggregate(DEFINE:([Expr1013]=Count(*), [Expr1014]=ANY([tempdb].[dbo].[Tab1].[ID_TipoSexo])))             |                                       |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[Tab1].[PK__Tab1__3214EC277097A3C8]))              |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[Tab2].[PK__Tab2__3214EC27755C58E5]), SEEK:([tempdb].[dbo].[Tab2].[ID]=[tempdb].[dbo].[Tab1].[ID_TipoSexo]) ORDERED FORWARD)  You can see from this text showplan that SQL Server as generated a Stream Aggregate to count how many rows the SubQuery will return, This value is then passed to the Assert which then does its job by checking its validity. Is very interesting to see that  the Query Optimizer is smart enough be able to avoid using assert operators when they are not necessary. For instance: INSERT INTO Tab1(ID_TipoSexo, Sexo) VALUES((SELECT ID_TipoSexo FROM Tab1 WHERE ID = 1), 'F') INSERT INTO Tab1(ID_TipoSexo, Sexo) VALUES((SELECT TOP 1 ID_TipoSexo FROM Tab1), 'F')  For both these INSERTs, the Query Optimiser is smart enough to know that only one row will ever be returned, so there is no need to use the Assert. Well, that's all folks, I see you next week with more "Operators". Cheers, Fabiano

    Read the article

  • Windows Azure Evolution - Web Sites (aka Antares) Part 1

    - by Shaun
    This is the 3rd post of my Windows Azure Evolution series, focus on the new features and enhancement which was alone with the Windows Azure Platform Upgrade June 2012, announced at the MEET Windows Azure event on 7th June. In the first post I introduced the new preview developer portal and how to works for the existing features such as cloud services, storages and SQL databases. In the second one I talked about the Windows Azure .NET SDK 1.7 on the latest Visual Studio 2012 RC on Windows 8. From this one I will begin to introduce some new features. Now let’s have a look on the first one of them, Windows Azure Web Sites.   Overview Windows Azure Web Sites (WAWS), as known as Antares, was a new feature still in preview stage in this upgrade. It allows people to quickly and easily deploy websites to a highly scalable cloud environment, uses the languages and open source apps of the choice then deploy such as FTP, Git and TFS. It also can be integrated with Windows Azure services like SQL Database, Caching, CDN and Storage easily. After read its introduction we may have a question: since we can deploy a website from both cloud service web role and web site, what’s the different between them? So, let’s have a quick compare.   CLOUD SERVICE WEB SITE OS Windows Server Windows Server Virtualization Windows Azure Virtual Machine Windows Azure Virtual Machine Host IIS IIS Platform ASP.NET WebForm, ASP.NET MVC, WCF ASP.NET WebForm, ASP.NET MVC, PHP Language C#, VB.NET C#, VB.NET, PHP Database SQL Database SQL Database, MySQL Architecture Multi layered, background worker, message queuing, etc.. Simple website with backend database. VS Project Windows Azure Cloud Service ASP.NET Web Form, ASP.NET MVC, etc.. Out-of-box Gallery (none) Drupal, DotNetNuke, WordPress, etc.. Deployment Package upload, Visual Studio publish FTP, Git, TFS, WebMatrix Compute Mode Dedicate VM Shared Across VMs, Dedicate VM Scale Scale up, scale out Scale up, scale out As you can see, there are many difference between the cloud service and web site, but the main point is that, the cloud service focus on those complex architecture web application. For example, if you want to build a website with frontend layer, middle business layer and data access layer, with some background worker process connected through the message queue, then you should better use cloud service, since it provides full control of your code and application. But if you just want to build a personal blog or a  business portal, then you can use the web site. Since the web site have many galleries, you can create them even without any coding and configuration. David Pallmann have an awesome figure explains the benefits between the could service, web site and virtual machine.   Create a Personal Blog in Web Site from Gallery As I mentioned above, one of the big feature in WAWS is to build a website from an existing gallery, which means we don’t need to coding and configure. What we need to do is open the windows azure developer portal and click the NEW button, select WEB SITE and FROM GALLERY. In the popping up windows there are many websites we can choose to use. For example, for personal blog there are Orchard CMS, WordPress; for CMS there are DotNetNuke, Drupal 7, mojoPortal. Let’s select WordPress and click the next button. The next step is to configure the web site. We will need to specify the DNS name and select the subscription and region. Since the WordPress uses MySQL as its backend database, we also need to create a MySQL database as well. Windows Azure Web Sites utilize ClearDB to host the MySQL databases. You cannot create a MySQL database directly from SQL Databases section. Finally, since we selected to create a new MySQL database we need to specify the database name and region in the last step. Also we need to accept the ClearDB’s terms as well. Then windows azure platform will download the WordPress codes and deploy the MySQL database and website. Then it will be ready to use. Select the website and click the BROWSE button, the WordPress administration page will be shown. After configured the WordPress here is my personal web blog on the cloud. It took me no more than 10 minutes to establish without any coding.   Monitor, Configure, Scale and Linked Resources Let’s click into the website I had just created in the portal and have a look on what we can do. In the website details page where are five sections. - Dashboard The overall information about this website, such as the basic usage status, public URL, compute mode, FTP address, subscription and links that we can specify the deployment credentials, TFS and Git publish setting, etc.. - Monitor Some status information such as the CPU usage, memory usage etc., errors, etc.. We can add more metrics by clicking the ADD METRICS button and the bottom as well. - Configure Here we can set the configurations of our website such as the .NET and PHP runtime version, diagnostics settings, application settings and the IIS default documents. - Scale This is something interesting. In WAWS there are two compute mode or called web site mode. One is “shared”, which means our website will be shared with other web sites in a group of windows azure virtual machines. Each web site have its own process (w3wp.exe) with some sandbox technology to isolate from others. When we need to scaling-out our web site in shared mode, we actually increased the working process count. Hence in shared mode we cannot specify the virtual machine size since they are shared across all web sites. This is a little bit different than the scaling mode of the cloud service (hosted service web role and worker role). The other mode called “dedicate”, which means our web site will use the whole windows azure virtual machine. This is the same hosting behavior as cloud service web role. In web role it will be deployed on the virtual machines we specified and all of them are only used by us. In web sites dedicate mode, it’s the same. In this mode when we scaling-out our web site we will use more virtual machines, and each of them will only host our own website. And we can specify the virtual machine size in this mode. In the developer portal we can select which mode we are using from the scale section. In shared mode we can only specify the instance count, but in dedicate mode we can specify the instance size as well as the instance count. - Linked Resource The MySQL database created alone with the creation of our WordPress web site is a linked resource. We can add more linked resources in this section.   Pricing For the web site itself, since this feature is in preview period if you are using shared mode, then you will get free up to 10 web sites. But if you are using dedicate mode, the price would be the virtual machines you are using. For example, if you are using dedicate and configured two middle size virtual machines then you will pay $230.40 per month. If there is SQL Database linked to your web site then they will be charged separately based on the Pay-As-You-Go price. For example a 1GB web edition database costs $9.99 per month. And the bandwidth will be charged as well. For example 10GB outbound data transfer costs $1.20 per month. For more information about the pricing please have a look at the windows azure pricing page.   Summary Windows Azure Web Sites gives us easier and quicker way to create, develop and deploy website to window azure platform. Comparing with the cloud service web role, the WAWS have many out-of-box gallery we can use directly. So if you just want to build a blog, CMS or business portal you don’t need to learn ASP.NET, you don’t need to learn how to configure DotNetNuke, you don’t need to learn how to prepare PHP and MySQL. By using WAWS gallery you can establish a website within 10 minutes without any lines of code. But in some cases we do need to code by ourselves. We may need to tweak the layout of our pages, or we may have a traditional ASP.NET or PHP web application which needed to migrated to the cloud. Besides the gallery WAWS also provides many features to download, upload code. It also provides the feature to integrate with some version control services such as TFS and Git. And it also provides the deploy approaches through FTP and Web Deploy. In the next post I will demonstrate how to use WebMatrix to download and modify the website, and how to use TFS and Git to deploy automatically one our code changes committed.   Hope this helps, Shaun All documents and related graphics, codes are provided "AS IS" without warranty of any kind. Copyright © Shaun Ziyan Xu. This work is licensed under the Creative Commons License.

    Read the article

  • Why doesn't my implementation of ElGamal work for long text strings?

    - by angstrom91
    I'm playing with the El Gamal cryptosystem, and my goal is to be able to encipher and decipher long sequences of text. I have come up with a method that works for short sequences, but does not work for long sequences, and I cannot figure out why. El Gamal requires the plaintext to be an integer. I have turned my string into a byte[] using the .getBytes() method for Strings, and then created a BigInteger out of the byte[]. After encryption/decryption, I turn the BigInteger into a byte[] using the .toByteArray() method for BigIntegers, and then create a new String object from the byte[]. This works perfectly when i call ElGamalEncipher with strings up to 129 characters. With 130 or more characters, the output produced from ElGamalDecipher is garbled. Can someone suggest how to solve this issue? Is this an issue with my method of turning the string into a BigInteger? If so, is there a better way to turn my string of text into a BigInteger and back? Below is my encipher/decipher code with a program to demonstrate the problem. import java.math.BigInteger; public class Main { static BigInteger P = new BigInteger("15893293927989454301918026303382412" + "2586402937727056707057089173871237566896685250125642378268385842" + "6917261652781627945428519810052550093673226849059197769795219973" + "9423619267147615314847625134014485225178547696778149706043781174" + "2873134844164791938367765407368476144402513720666965545242487520" + "288928241768306844169"); static BigInteger G = new BigInteger("33234037774370419907086775226926852" + "1714093595439329931523707339920987838600777935381196897157489391" + "8360683761941170467795379762509619438720072694104701372808513985" + "2267495266642743136795903226571831274837537691982486936010899433" + "1742996138863988537349011363534657200181054004755211807985189183" + "22832092343085067869"); static BigInteger R = new BigInteger("72294619754760174015019300613282868" + "7219874058383991405961870844510501809885568825032608592198728334" + "7842806755320938980653857292210955880919036195738252708294945320" + "3969657021169134916999794791553544054426668823852291733234236693" + "4178738081619274342922698767296233937873073756955509269717272907" + "8566607940937442517"); static BigInteger A = new BigInteger("32189274574111378750865973746687106" + "3695160924347574569923113893643975328118502246784387874381928804" + "6865920942258286938666201264395694101012858796521485171319748255" + "4630425677084511454641229993833255506759834486100188932905136959" + "7287419551379203001848457730376230681693887924162381650252270090" + "28296990388507680954"); public static void main(String[] args) { FewChars(); System.out.println(); ManyChars(); } public static void FewChars() { //ElGamalEncipher(String plaintext, BigInteger p, BigInteger g, BigInteger r) BigInteger[] cipherText = ElGamal.ElGamalEncipher("This is a string " + "of 129 characters which works just fine . This is a string " + "of 129 characters which works just fine . This is a s", P, G, R); System.out.println("This is a string of 129 characters which works " + "just fine . This is a string of 129 characters which works " + "just fine . This is a s"); //ElGamalDecipher(BigInteger c, BigInteger d, BigInteger a, BigInteger p) System.out.println("The decrypted text is: " + ElGamal.ElGamalDecipher(cipherText[0], cipherText[1], A, P)); } public static void ManyChars() { //ElGamalEncipher(String plaintext, BigInteger p, BigInteger g, BigInteger r) BigInteger[] cipherText = ElGamal.ElGamalEncipher("This is a string " + "of 130 characters which doesn’t work! This is a string of " + "130 characters which doesn’t work! This is a string of ", P, G, R); System.out.println("This is a string of 130 characters which doesn’t " + "work! This is a string of 130 characters which doesn’t work!" + " This is a string of "); //ElGamalDecipher(BigInteger c, BigInteger d, BigInteger a, BigInteger p) System.out.println("The decrypted text is: " + ElGamal.ElGamalDecipher(cipherText[0], cipherText[1], A, P)); } } import java.math.BigInteger; import java.security.SecureRandom; public class ElGamal { public static BigInteger[] ElGamalEncipher(String plaintext, BigInteger p, BigInteger g, BigInteger r) { // returns a BigInteger[] cipherText // cipherText[0] is c // cipherText[1] is d SecureRandom sr = new SecureRandom(); BigInteger[] cipherText = new BigInteger[2]; BigInteger pText = new BigInteger(plaintext.getBytes()); // 1: select a random integer k such that 1 <= k <= p-2 BigInteger k = new BigInteger(p.bitLength() - 2, sr); // 2: Compute c = g^k(mod p) BigInteger c = g.modPow(k, p); // 3: Compute d= P*r^k = P(g^a)^k(mod p) BigInteger d = pText.multiply(r.modPow(k, p)).mod(p); // C =(c,d) is the ciphertext cipherText[0] = c; cipherText[1] = d; return cipherText; } public static String ElGamalDecipher(BigInteger c, BigInteger d, BigInteger a, BigInteger p) { //returns the plaintext enciphered as (c,d) // 1: use the private key a to compute the least non-negative residue // of an inverse of (c^a)' (mod p) BigInteger z = c.modPow(a, p).modInverse(p); BigInteger P = z.multiply(d).mod(p); byte[] plainTextArray = P.toByteArray(); return new String(plainTextArray); } }

    Read the article

  • concurrency::accelerator_view

    - by Daniel Moth
    Overview We saw previously that accelerator represents a target for our C++ AMP computation or memory allocation and that there is a notion of a default accelerator. We ended that post by introducing how one can obtain accelerator_view objects from an accelerator object through the accelerator class's default_view property and the create_view method. The accelerator_view objects can be thought of as handles to an accelerator. You can also construct an accelerator_view given another accelerator_view (through the copy constructor or the assignment operator overload). Speaking of operator overloading, you can also compare (for equality and inequality) two accelerator_view objects between them to determine if they refer to the same underlying accelerator. We'll see later that when we use concurrency::array objects, the allocation of data takes place on an accelerator at array construction time, so there is a constructor overload that accepts an accelerator_view object. We'll also see later that a new concurrency::parallel_for_each function overload can take an accelerator_view object, so it knows on what target to execute the computation (represented by a lambda that the parallel_for_each also accepts). Beyond normal usage, accelerator_view is a quality of service concept that offers isolation to multiple "consumers" of an accelerator. If in your code you are accessing the accelerator from multiple threads (or, in general, from different parts of your app), then you'll want to create separate accelerator_view objects for each thread. flush, wait, and queuing_mode When you create an accelerator_view via the create_view method of the accelerator, you pass in an option of immediate or deferred, which are the two members of the queuing_mode enum. At any point you can access this value from the queuing_mode property of the accelerator_view. When the queuing_mode value is immediate (which is the default), any commands sent to the device such as kernel invocations and data transfers (e.g. parallel_for_each and copy, as we'll see in future posts), will get submitted as soon as the runtime sees fit (that is the definition of immediate). When the value of queuing_mode is deferred, the commands will be batched up. To send all buffered commands to the device for execution, there is a non-blocking flush method that you can call. If you wish to block until all the commands have been sent, there is a wait method you can call. Deferring is a more advanced scenario aimed at performance gains when you are submitting many device commands and you want to avoid the tiny overhead of flushing/submitting each command separately. Querying information Just like accelerator, accelerator_view exposes the is_debug and version properties. In fact, you can always access the accelerator object from the accelerator property on the accelerator_view class to access the accelerator interface we looked at previously. Interop with D3D (aka DX) In a later post I'll show an example of an app that uses C++ AMP to compute data that is used in pixel shaders. In those scenarios, you can benefit by integrating C++ AMP into your graphics pipeline and one of the building blocks for that is being able to use the same device context from both the compute kernel and the other shaders. You can do that by going from accelerator_view to device context (and vice versa), through part of our interop API in amp.h: *get_device, create_accelerator_view. More on those in a later post. Comments about this post by Daniel Moth welcome at the original blog.

    Read the article

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

  • Triangle Line-Segment Intersection - detecting near misses

    - by Will
    A ray is a very poor approximation of a player! I think approximating a player with a sphere traveling a straight line each game tick will solve my problems of the player intersecting edges of scenery because their line segment missed it yet their own model is not infinitely thin... I have a 3D triangle and a line segment. I have the normal triangle-line-segment intersection code which I admit I have only a woolly grasp of. To model movement and compute collisions of the player I have to determine if a line passes within sphere-radius of a triangle. But I can find no convenient line near-miss intersection code! Here's the classic triangle intersection ### commented ### code with my starting assumptions: function triangle_ray_intersection(a,b,c,ray_origin,ray_dir,ray_radius) { // http://softsurfer.com/Archive/algorithm_0105/algorithm_0105.htm#intersect_RayTriangle%28%29 // get triangle edge vectors and plane normal var u = vec3_sub(b,a); var v = vec3_sub(c,a); var n = vec3_cross(u,v); if(n[0]==0 && n[1]==0 && n[2]==0) return null; // triangle is degenerate var w0 = vec3_sub(ray_origin,a); var j = vec3_dot(n,ray_dir); if(Math.abs(j) < 0.00000001) { //### if parallel, might still pass within ray_radius of it return null; // parallel, disjoint or on plane } var i = -vec3_dot(n,w0); // get intersect point of ray with triangle plane var k = i / j; if(k < 0.0) return null; // ray goes away from triangle //### as its a line segment, k > 1+ray_radius means no intersect var hit = vec3_add(ray_origin,vec3_scale(ray_dir,k)); // intersect point of ray and plane // is I inside T? //### here I'm a bit lost; this is presumably computing barycentric coordinates? var uu = vec3_dot(u,u); var uv = vec3_dot(u,v); var vv = vec3_dot(v,v); var w = vec3_sub(hit,a); var wu = vec3_dot(w,u); var wv = vec3_dot(w,v); var D = uv * uv - uu * vv; var s = (uv * wv - vv * wu) / D; //### therefore, compute if its within ray_radius scaled to the 0..1 of barycentric coordinates? if(s<0.0 || s>1.0) return null; // I is outside T var t = (uv * wu - uu * wv) / D; if(t<0.0 || (s+t)>1.0) return null; // I is outside T //### finally, if it passses a barycentric test it might still be too far //### to a point; must check that its distance from a corner is within ray_radius too if more than one barycentric coord is >1 //### so we have rounded corners... return [hit,n]; // I is in T } Given the distance between the point of plane intersection and each corner, I ought to be able to determine distance at world scale of how far beyond the edge - beyond 1.0 in barycentric coordinates for each axis - that point is... At this point my head explodes! Is this the right track? What's the actual code? UPDATE: you can earn 100 pts on SO if you answer this question there...! How can you determine if a line segment passes within some distance of a triangle?

    Read the article

  • Why do my pyramids fade black and then back to colour again

    - by geminiCoder
    I have the following vertecies and norms GLfloat verts[36] = { -0.5, 0, 0.5, 0, 0, -0.5, 0.5, 0, 0.5, 0, 0, -0.5, 0.5, 0, 0.5, 0, 1, 0, -0.5, 0, 0.5, 0, 0, -0.5, 0, 1, 0, 0.5, 0, 0.5, -0.5, 0, 0.5, 0, 1, 0 }; GLfloat norms[36] = { 0, -1, 0, 0, -1, 0, 0, -1, 0, -1, 0.25, 0.5, -1, 0.25, 0.5, -1, 0.25, 0.5, 1, 0.25, -0.5, 1, 0.25, -0.5, 1, 0.25, -0.5, 0, -0.5, -1, 0, -0.5, -1, 0, -0.5, -1 }; I am writing my fists Open GL game, But I need to know for sure if my Normals are correct as the colours aren't rendering correctly. my Pyramids are coloured then fade to black every half rotation then back again. My app so far is based on the boiler plate code provided by apple. heres my modified setUp Method [EAGLContext setCurrentContext:self.context]; [self loadShaders]; self.effect = [[GLKBaseEffect alloc] init]; self.effect.light0.enabled = GL_TRUE; self.effect.light0.diffuseColor = GLKVector4Make(1.0f, 0.4f, 0.4f, 1.0f); glEnable(GL_DEPTH_TEST); glGenVertexArraysOES(1, &_vertexArray); //create vertex array glBindVertexArrayOES(_vertexArray); glGenBuffers(1, &_vertexBuffer); glBindBuffer(GL_ARRAY_BUFFER, _vertexBuffer); glBufferData(GL_ARRAY_BUFFER, sizeof(verts) + sizeof(norms), NULL, GL_STATIC_DRAW); //create vertex buffer big enough for both verts and norms and pass NULL as data.. uint8_t *ptr = (uint8_t *)glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES); //map buffer to pass data to it memcpy(ptr, verts, sizeof(verts)); //copy verts memcpy(ptr+sizeof(verts), norms, sizeof(norms)); //copy norms to position after verts glUnmapBufferOES(GL_ARRAY_BUFFER); glEnableVertexAttribArray(GLKVertexAttribPosition); glVertexAttribPointer(GLKVertexAttribPosition, 3, GL_FLOAT, GL_FALSE, 0, BUFFER_OFFSET(0)); //tell GL where verts are in buffer glEnableVertexAttribArray(GLKVertexAttribNormal); glVertexAttribPointer(GLKVertexAttribNormal, 3, GL_FLOAT, GL_FALSE, 0, BUFFER_OFFSET(sizeof(verts))); //tell GL where norms are in buffer glBindVertexArrayOES(0); And the update method. - (void)update { float aspect = fabsf(self.view.bounds.size.width / self.view.bounds.size.height); GLKMatrix4 projectionMatrix = GLKMatrix4MakePerspective(GLKMathDegreesToRadians(65.0f), aspect, 0.1f, 100.0f); self.effect.transform.projectionMatrix = projectionMatrix; GLKMatrix4 baseModelViewMatrix = GLKMatrix4MakeTranslation(0.0f, 0.0f, -4.0f); baseModelViewMatrix = GLKMatrix4Rotate(baseModelViewMatrix, _rotation, 0.0f, 1.0f, 0.0f); // Compute the model view matrix for the object rendered with GLKit GLKMatrix4 modelViewMatrix = GLKMatrix4MakeTranslation(0.0f, 0.0f, -1.5f); modelViewMatrix = GLKMatrix4Rotate(modelViewMatrix, _rotation, 1.0f, 1.0f, 1.0f); modelViewMatrix = GLKMatrix4Multiply(baseModelViewMatrix, modelViewMatrix); self.effect.transform.modelviewMatrix = modelViewMatrix; // Compute the model view matrix for the object rendered with ES2 modelViewMatrix = GLKMatrix4MakeTranslation(0.0f, 0.0f, 1.5f); modelViewMatrix = GLKMatrix4Rotate(modelViewMatrix, _rotation, 1.0f, 1.0f, 1.0f); modelViewMatrix = GLKMatrix4Multiply(baseModelViewMatrix, modelViewMatrix); _normalMatrix = GLKMatrix3InvertAndTranspose(GLKMatrix4GetMatrix3(modelViewMatrix), NULL); _modelViewProjectionMatrix = GLKMatrix4Multiply(projectionMatrix, modelViewMatrix); _rotation += self.timeSinceLastUpdate * 0.5f; } But providing I understand this correct one pyramid is using the GLKit base effect shaders and the other the shaders which are included in the project. So for both of them to have the same error, I thought it would be the Norms?

    Read the article

  • In Java Concurrency In Practice by Brian Goetz, why is the Memoizer class not annotated with @ThreadSafe?

    - by dig_dug
    Java Concurrency In Practice by Brian Goetz provides an example of a efficient scalable cache for concurrent use. The final version of the example showing the implementation for class Memoizer (pg 108) shows such a cache. I am wondering why the class is not annotated with @ThreadSafe? The client, class Factorizer, of the cache is properly annotated with @ThreadSafe. The appendix states that if a class is not annotated with either @ThreadSafe or @Immutable that it should be assumed that it isn't thread safe. Memoizer seems thread-safe though. Here is the code for Memoizer: public class Memoizer<A, V> implements Computable<A, V> { private final ConcurrentMap<A, Future<V>> cache = new ConcurrentHashMap<A, Future<V>>(); private final Computable<A, V> c; public Memoizer(Computable<A, V> c) { this.c = c; } public V compute(final A arg) throws InterruptedException { while (true) { Future<V> f = cache.get(arg); if (f == null) { Callable<V> eval = new Callable<V>() { public V call() throws InterruptedException { return c.compute(arg); } }; FutureTask<V> ft = new FutureTask<V>(eval); f = cache.putIfAbsent(arg, ft); if (f == null) { f = ft; ft.run(); } } try { return f.get(); } catch (CancellationException e) { cache.remove(arg, f); } catch (ExecutionException e) { throw launderThrowable(e.getCause()); } } } }

    Read the article

  • Hadoop safemode recovery - taking lot of time

    - by Algorist
    Hi, We are running our cluster on Amazon EC2. we are using cloudera scripts to setup hadoop. On the master node, we start below services. 609 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode' 610 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start secondarynamenode' 611 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start jobtracker' 612 613 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop dfsadmin -safemode wait' On the slave machine, we run the below services. 625 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start datanode' 626 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start tasktracker' The main problem we are facing is, hdfs safemode recovery is taking more than an hour and this is causing delays in our job completion. Below are the main log messages. 1. domU-12-31-39-0A-34-61.compute-1.internal 10/05/05 20:44:19 INFO ipc.Client: Retrying connect to server: ec2-184-73-64-64.compute-1.amazonaws.com/10.192.11.240:8020. Already tried 21 time(s). 2. The reported blocks 283634 needs additional 322258 blocks to reach the threshold 0.9990 of total blocks 606499. Safe mode will be turned off automatically. The first message is thrown in task trackers log because, job tracker is not started. job tracker didn't start because of hdfs safemode recovery. The second message is thrown during the recovery process. Is there something I am doing wrong? How much time does normal hdfs safemode recovery takes? Will there be any speedup, by not starting task trackers till job tracker is started? Are there any known hadoop problems on amazon cluster? Thanks for your help. Regards Bala Mudiam

    Read the article

  • Amazon EC2 and jbossws

    - by avjaz
    Hi - I've deployed a webservice to a Jboss instance running on Amazon EC2. The webservice works fine locally, but when I deploy on EC2, and go to the /jbossws/services page the Endpoint Address for the webservice is the private DNS of the ec2 instance (domU-X-X-X-X etc...), not the public dns (which I would like it to be). I've tried loading the wsdl by changing the private hostname to the public IP; that works, but when I try to call any of the operations I get a HostNotFoundException, I'm guessing due to the fact that the generated wsdl has the stanza: <service name='XXXService'> <port binding='tns:XXXBinding' name='XXXPort'> <soap:address location='http://domU-XX-XX-XX-XX-XX-XX.compute-1.internal:8080/xx/xx/xx'/> </port> </service> where http://domU-XX-XX-XX-XX-XX-XX.compute-1.internal is the internal dns of the ec2 instance. The wsdl is auto generated - Is there a JAXB annotation I can use so that I can force the generated wsdl to use the public dns of the EC2 instance? Many thanks -

    Read the article

  • Recursively adding threads to a Java thread pool

    - by Leith
    I am working on a tutorial for my Java concurrency course. The objective is to use thread pools to compute prime numbers in parallel. The design is based on the Sieve of Eratosthenes. It has an array of n bools, where n is the largest integer you are checking, and each element in the array represents one integer. True is prime, false is non prime, and the array is initially all true. A thread pool is used with a fixed number of threads (we are supposed to experiment with the number of threads in the pool and observe the performance). A thread is given a integer multiple to process. The thread then finds the first true element in the array that is not a multiple of thread's integer. The thread then creates a new thread on the thread pool which is given the found number. After a new thread is formed, the existing thread then continues to set all multiples of it's integer in the array to false. The main program thread starts the first thread with the integer '2', and then waits for all spawned threads to finish. It then spits out the prime numbers and the time taken to compute. The issue I have is that the more threads there are in the thread pool, the slower it takes with 1 thread being the fastest. It should be getting faster not slower! All the stuff on the internet about Java thread pools create n worker threads the main thread then wait for all threads to finish. The method I use is recursive as a worker can spawn more worker threads. I would like to know what is going wrong, and if Java thread pools can be used recursively.

    Read the article

  • How should I go about implementing a points-to analysis in Maude?

    - by reprogrammer
    I'm going to implement a points-to analysis algorithm. I'd like to implement this analysis mainly based on the algorithm by Whaley and Lam. Whaley and Lam use a BDD based implementation of Datalog to represent and compute the points-to analysis relations. The following lists some of the relations that are used in a typical points-to analysis. Note that D(w, z) :- A(w, x),B(x, y), C(y, z) means D(w, z) is true if A(w, x), B(x, y), and C(y, z) are all true. BDD is the data structure used to represent these relations. Relations input vP0 (variable : V, heap : H) input store (base : V, field : F, source : V) input load (base : V, field : F, dest : V) input assign (dest : V, source : V) output vP (variable : V, heap : H) output hP (base : H, field : F, target : H) Rules vP(v, h) :- vP0(v, h) vP(v1, h) :- assign(v1, v2), vP(v2, h) hP(h1, f,h2) :- store(v1, f, v2), vP(v1, h1), vP(v2, h2) vP(v2, h2) :- load(v1, f, v2), vP(v1, h1), hP(h1, f, h2) I need to understand if Maude is a good environment for implementing points-to analysis. I noticed that Maude uses a BDD library called BuDDy. But, it looks like that Maude uses BDDs for a different purpose, i.e. unification. So, I thought I might be able to use Maude instead of a Datalog engine to compute the relations of my points-to analysis. I assume Maude propagates independent information concurrently. And this concurrency could potentially make my points-to analysis faster than sequential processing of rules. But, I don't know the best way to represent my relations in Maude. Should I implement BDD in Maude myself, or Maude's internal unification based on BDD has the same effect?

    Read the article

  • Issue with getting 2 chars from string using indexer

    - by Learner
    I am facing an issue in reading char values. See my program below. I want to evaluate an infix expression. As you can see I want to read '10' , '*', '20' and then use them...but if I use string indexer s[0] will be '1' and not '10' and hence I am not able to get the expected result. Can you guys suggest me something? Code is in c# class Program { static void Main(string[] args) { string infix = "10*2+20-20+3"; float result = EvaluateInfix(infix); Console.WriteLine(result); Console.ReadKey(); } public static float EvaluateInfix(string s) { Stack<float> operand = new Stack<float>(); Stack<char> operator1 = new Stack<char>(); int len = s.Length; for (int i = 0; i < len; i++) { if (isOperator(s[i])) // I am having an issue here as s[i] gives each character and I want the number 10 operator1.Push(s[i]); else { operand.Push(s[i]); if (operand.Count == 2) Compute(operand, operator1); } } return operand.Pop(); } public static void Compute(Stack<float> operand, Stack<char> operator1) { float operand1 = operand.Pop(); float operand2 = operand.Pop(); char op = operator1.Pop(); if (op == '+') operand.Push(operand1 + operand2); else if(op=='-') operand.Push(operand1 - operand2); else if(op=='*') operand.Push(operand1 * operand2); else if(op=='/') operand.Push(operand1 / operand2); } public static bool isOperator(char c) { bool result = false; if (c == '+' || c == '-' || c == '*' || c == '/') result = true; return result; } } }

    Read the article

  • malloc works, cudaHostAlloc segfaults?

    - by Mikhail
    I am new to CUDA and I want to use cudaHostAlloc. I was able to isolate my problem to this following code. Using malloc for host allocation works, using cudaHostAlloc results in a segfault, possibly because the area allocated is invalid? When I dump the pointer in both cases it is not null, so cudaHostAlloc returns something... works in_h = (int*) malloc(length*sizeof(int)); //works for (int i = 0;i<length;i++) in_h[i]=2; doesn't work cudaHostAlloc((void**)&in_h,length*sizeof(int),cudaHostAllocDefault); for (int i = 0;i<length;i++) in_h[i]=2; //segfaults Standalone Code #include <stdio.h> void checkDevice() { cudaDeviceProp info; int deviceName; cudaGetDevice(&deviceName); cudaGetDeviceProperties(&info,deviceName); if (!info.deviceOverlap) { printf("Compute device can't use streams and should be discared."); exit(EXIT_FAILURE); } } int main() { checkDevice(); int *in_h; const int length = 10000; cudaHostAlloc((void**)&in_h,length*sizeof(int),cudaHostAllocDefault); printf("segfault comming %d\n",in_h); for (int i = 0;i<length;i++) { in_h[i]=2; } free(in_h); return EXIT_SUCCESS; } ~ Invocation [id129]$ nvcc fun.cu [id129]$ ./a.out segfault comming 327641824 Segmentation fault (core dumped) Details Program is run in interactive mode on a cluster. I was told that an invocation of the program from the compute node pushes it to the cluster. Have not had any trouble with other home made toy cuda codes.

    Read the article

  • OpenGL particle system

    - by allan
    I'm really new with OpenGL, so bear with me. I'm trying to simulate a particle system using OpenGl but I can't get it to work, this is what I have so far: #include <GL/glut.h> int main (int argc, char **argv){ // data allocation, various non opengl stuff ............ glutInit(&argc, argv); glutInitDisplayMode(GLUT_RGB | GLUT_DOUBLE ); glutInitWindowPosition(100,100); glutInitWindowSize(size, size); glPointSize (4); glutCreateWindow("test gl"); ............ // initial state, not opengl ............ glViewport(0,0,size,size); glutDisplayFunc(display); glutIdleFunc(compute); glutMainLoop(); } void compute (void) { // change state not opengl glutPostRedisplay(); } void display (void) { glClear(GL_COLOR_BUFFER_BIT); glBegin(GL_POINTS); for(i = 0; i<nparticles; i++) { // two types of particles if (TYPE(particle[i]) == 1) glColor3f(1,0,0); else glColor3f(0,0,1); glVertex2f(X(particle[i]),Y(particle[i])); } glEnd(); glFlush(); glutSwapBuffers(); } I get a black window after a couple of seconds (the window has just the title bar before that). Where do I go wrong? Any help would be very much appreciated. Thanks. LE: the x and y coordinates of each particle are within the interval (0,size)

    Read the article

  • CMYK + CMYK = ? CMYK / 2 = ?

    - by Pete
    Suppose there are two colors defined in CMYK: color1 = 30, 40, 50, 60 color2 = 50, 60, 70, 80 If they were to be printed what values would the resulting color have? color_new = min(cyan1 + cyan2, 100), min(magenta1 + magenta2, 100), min(yellow1 + yellow2, 100), min(black1 + black2, 100)? Suppose there is a color defined in CMYK: color = 40, 30, 30, 100 It is possible to print a color at partial intensity, i.e. as a tint. What values would have a 50% tint of that color? color_new = cyan / 2, magenta / 2, yellow / 2, black / 2? I'm asking this to better understand the "tintTransform" function in PDF Reference 1.7, 4.5.5 Special Color Spaces, DeviceN Color Spaces Update: To better clarify: I'm not entirely concerned with human perception or how the CMYK dyies react to the paper. If someone specifies 90% tint which, when printed, looks like full intensity colorant, that's ok. In other words, if I asking how to compute 50% of cmyk(40, 30, 30, 100) I'm asking how to compute the new values, regardless of whether the result looks half-dark or not. Update 2: I'm confused now. I checked this in InDesign and Acrobat. For example Pantone 3005 has CMYK 100, 34, 0, 2, and its 25% tint has CMYK 25, 8.5, 0, 0.5. Does it mean I can "monkey around in a linear way"?

    Read the article

  • mvn deploy to AWS (ssh via distributionManagement)

    - by Dexter
    I am working on deploying the WAR file to AWS using Maven. I am planning to use 'mvn deploy' for the same which would ssh the war file to AWS. I am following http://maven.apache.org/plugins/maven-deploy-plugin/examples/deploy-ssh-external.html. This is my POM file <project> ... <distributionManagement> <repository> <id>ssh-aws</id> <url>scpexe://<ec2 instance>.compute-1.amazonaws.com</url> </repository> </distributionManagement> <build> <extensions> <!-- Enabling the use of FTP --> <extension> <groupId>org.apache.maven.wagon</groupId> <artifactId>wagon-ssh-external</artifactId> <version>1.0-beta-6</version> </extension> </extensions> </build> .. </project> This is my settings.xml <server> <id>ssh-aws</id> <username>aws-user</username> </server> The only issue is that I am unable to figure out the url in distributionManagement node of pom.xml. I am able to ssh in the AWS server by the following. ssh -i ~/pemfile/pemfile-key.pem aws-user@<ec2 instance>.compute-1.amazonaws.com But when I run mvn clean deploy, I receive this.. Exit code: 1 - Permission denied (publickey). -> [Help 1] Thanks in advance.

    Read the article

  • Running an RMI registry on one computer and connecting to it from another

    - by Hippopotamus
    Hello. I am doing a student assignment, using java RMI. I've programmed a simple RMI-server application that provides method which return some strings to the client. When I start the server on localhost and connect to it by client on the same computer, everything goes well. However, I am not able to do this between two computers on home network. The computers both have no trouble of connecting by a simple C program with similar functionality, so I guess the problem is with JVM here. I am binding the class to rmiregistry with try{ ComputeImpl R = new ComputeImpl(); Naming.rebind("rmi://localhost/ComputeService",R); } catch(Exception e){ System.out.println("Trouble: " +e); } And I'm doing the lookup for RMI registry in the client application with providing the argument while launching the application: StringBuffer rmi_address = new StringBuffer(); rmi_address.append("rmi://").append(args[1]).append("/ComputeService"); Compute R = (Compute) Naming.lookup(rmi_address.toString()); Is the problem with my code or with JVM? Thanks in advance.

    Read the article

  • 2D platformer gravity physics with slow-motion

    - by DD
    Hi all, I fine tuned my 2d platformer physics and when I added slow-motion I realized that it is messed up. The problem I have is that for some reason the physics still depends on framerate. So when I scale down time elapsed, every force is scaled down as well. So the jump force is scaled down, meaning in slow-motion, character jumps vertically smaller height and gravity force is scaled down as well so the character goes further in the air without falling. I'm sending update function in hopes that someone can help me out here (I separated vertical (jump, gravity) and walking (arbitrary walking direction on a platform - platforms can be of any angle) vectors): characterUpdate:(float)dt { //Compute walking velocity walkingAcceleration = direction of platform * walking acceleration constant * dt; initialWalkingVelocity = walkingVelocity; if( isWalking ) { if( !isJumping ) walkingVelocity = walkingVelocity + walkingAcceleration; else walkingVelocity = walkingVelocity + Vector( walking acceleration constant * dt, 0 ); } // Compute jump/fall velocity if( !isOnPlatform ) { initialVerticalVelocity = verticalVelocity; verticalVelocity = verticalVelocity + verticalAcceleration * dt; } // Add walking velocity position = position + ( walkingVelocity + initialWalkingVelocity ) * 0.5 * dt; //Add jump/fall velocity if not on a platform if( !isOnPlatform ) position = position + ( verticalVelocity + initialVerticalVelocity ) * 0.5 * dt; verticalAcceleration.y = Gravity * dt; }

    Read the article

  • Instrumenting a string

    - by George Polevoy
    Somewhere in C++ era i have crafted a library, which enabled string representation of the computation history. Having a math expression like: TScalar Compute(TScalar a, TScalar b, TScalar c) { return ( a + b ) * c; } I could render it's string representation: r = Compute(VerbalScalar("a", 1), VerbalScalar("b", 2), VerbalScalar("c", 3)); Assert.AreEqual(9, r.Value); Assert.AreEqual("(a+b)*c==(1+2)*3", r.History ); C++ operator overloading allowed for substitution of a simple type with a complex self-tracking entity with an internal tree representation of everything happening with the objects. Now i would like to have the same possibility for NET strings, only instead of variable names i would like to see a stack traces of all the places in code which affected a string. And i want it to work with existing code, and existing compiled assemblies. Also i want all this to hook into visual studio debugger, so i could set a breakpoint, and see everything that happened with a string. Which technology would allow this kind of things? I know it sound like an utopia, but I think visual studio code coverage tools actually do the same kind of job while instrumenting the assemblies.

    Read the article

  • C++ variable to const expression

    - by user1344784
    template <typename Real> class A{ }; template <typename Real> class B{ }; //... a few dozen more similar template classes class Computer{ public slots: void setFrom(int from){ from_ = from; } void setTo(int to){ to_ = to; } private: template <int F, int T> void compute(){ using boost::fusion::vector; using boost::fusion::at_c; vector<A<float>, B<float>, ...> v; at_c<from_>(v).operator()(at_c<to_>(v)); //error; needs to be const-expression. }; This question isn't about Qt, but there is a line of Qt code in my example. The setFrom() and setTo() are functions that are called based on user selection via the GUI widget. The root of my problem is that 'from' and 'to' are variables. In my compute member function I need to pick a type (A, B, etc.) based on the values of 'from' and 'to'. The only way I know how to do what I need to do is to use switch statements, but that's extremely tedious in my case and I would like to avoid. Is there anyway to convert the error line to a constant-expression?

    Read the article

  • Computing, storing, and retrieving values to and from an N-Dimensional matrix

    - by Adam S
    This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge. Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints. Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long. In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table. The questions now, are: What format(s) might be acceptable for storing the data? What programs (MATLAB, C#, etc) might be best suited to compute the data? C# must be used to import the data on the device - is this possible given your answer to #1?

    Read the article

  • How to mock/stub a directory of files and their contents using RSpec?

    - by John Topley
    A while ago I asked "How to test obtaining a list of files within a directory using RSpec?" and although I got a couple of useful answers, I'm still stuck, hence a new question with some more detail about what I'm trying to do. I'm writing my first RubyGem. It has a module that contains a class method that returns an array containing a list of non-hidden files within a specified directory. Like this: files = Foo.bar :directory => './public' The array also contains an element that represents metadata about the files. This is actually a hash of hashes generated from the contents of the files, the idea being that changing even a single file changes the hash. I've written my pending RSpec examples, but I really have no idea how to implement them: it "should compute a hash of the files within the specified directory" it "shouldn't include hidden files or directories within the specified directory" it "should compute a different hash if the content of a file changes" I really don't want to have the tests dependent on real files acting as fixtures. How can I mock or stub the files and their contents? The gem implementation will use Find.find, but as one of the answers to my other question said, I don't need to test the library. I really have no idea how to write these specs, so any help much appreciated!

    Read the article

< Previous Page | 9 10 11 12 13 14 15 16 17 18 19 20  | Next Page >