Search Results

Search found 1725 results on 69 pages for 'compute shader'.

Page 31/69 | < Previous Page | 27 28 29 30 31 32 33 34 35 36 37 38  | Next Page >

  • Texture errors in CubeMap

    - by shade4159
    I am trying to apply this texture as a cubemap. This is my result: Clearly I am doing something with my texture coordinates, but I cannot for the life of me figure out what. I don't even see a pattern to the texture fragments. They just seem like a jumble of different faces. Can anyone shed some light on this? Vertex shader: #version 400 in vec4 vPosition; in vec3 inTexCoord; smooth out vec3 texCoord; uniform mat4 projMatrix; void main() { texCoord = inTexCoord; gl_Position = projMatrix * vPosition; } My fragment shader: #version 400 smooth in vec3 texCoord; out vec4 fColor; uniform samplerCube textures void main() { fColor = texture(textures,texCoord); } Vertices of cube: point4 worldVerts[8] = { vec4( 15, 15, 15, 1 ), vec4( -15, 15, 15, 1 ), vec4( -15, 15, -15, 1 ), vec4( 15, 15, -15, 1 ), vec4( -15, -15, 15, 1 ), vec4( 15, -15, 15, 1 ), vec4( 15, -15, -15, 1 ), vec4( -15, -15, -15, 1 ) }; Cube rendering: void worldCube(point4* verts, int& Index, point4* points, vec3* texVerts) { quadInv( verts[0], verts[1], verts[2], verts[3], 1, Index, points, texVerts); quadInv( verts[6], verts[3], verts[2], verts[7], 2, Index, points, texVerts); quadInv( verts[4], verts[5], verts[6], verts[7], 3, Index, points, texVerts); quadInv( verts[4], verts[1], verts[0], verts[5], 4, Index, points, texVerts); quadInv( verts[5], verts[0], verts[3], verts[6], 5, Index, points, texVerts); quadInv( verts[4], verts[7], verts[2], verts[1], 6, Index, points, texVerts); } Backface function (since this is the inside of the cube): void quadInv( const point4& a, const point4& b, const point4& c, const point4& d , int& Index, point4* points, vec3* texVerts) { quad( a, d, c, b, Index, points, texVerts, a.to_3(), b.to_3(), c.to_3(), d.to_3()); } And the quad drawing function: void quad( const point4& a, const point4& b, const point4& c, const point4& d, int& Index, point4* points, vec3* texVerts, const vec3& tex_a, const vec3& tex_b, const vec3& tex_c, const vec3& tex_d) { texVerts[Index] = tex_a.normalized(); points[Index] = a; Index++; texVerts[Index] = tex_b.normalized(); points[Index] = b; Index++; texVerts[Index] = tex_c.normalized(); points[Index] = c; Index++; texVerts[Index] = tex_a.normalized(); points[Index] = a; Index++; texVerts[Index] = tex_c.normalized(); points[Index] = c; Index++; texVerts[Index] = tex_d.normalized(); points[Index] = d; Index++; } Edit: I forgot to mention, in the image, the camera is pointed directly at the back face of the cube. You can kind of see the diagonals leading out of the corners, if you squint.

    Read the article

  • YUV Textures and Shaders

    - by Luca
    I've always used RGB textures. Now comes up the need of use of YUV textures (a set of three texture, specifying 1 luminance and 2 chrominance channels). Of course the YUV texture could be converted on CPU, getting the RGB texture usable as usual... but I need to get RGB pixel directly on GPU, to avoid unnecessary processor load... The problem became strange, since I require to specifyin the shader source, because a single texture, the following items: Three samplers uniforms, one for each channel Two integer uniforms, for specifying the chrominance channels sampling a mat3 uniform, for specific YUV to RGB conversion matrix. This should be done for each YUV texture... Is it possible to "compress" required uniforms, and getting RGB values quite easily? Actually i think this could aid: Texture sizes, including mipmaps, could be queried. With this, its possible to save the two integer uniforms, since the uniform values are derived the ratio between texture extents The mat3 uniforms could be collected as globals, and with preprocessor could be selected. But what design should I use for specify three (related) textures? Is it possible to use textures levels for accessing multiple textures? Texture arrays could be usable? And what about using rectangle textures, which doesn't supports mipmaps? Maybe a shader abstraction (struct definition and related function) could aid? Thank you.

    Read the article

  • OpenGL-ES: Change (multiply) color when using color arrays?

    - by arberg
    Following the ideas in OpenGL ES iPhone - drawing anti aliased lines, I am trying to draw stroked anti-aliased lines and I am successful so far. After line is draw by the finger, I wish to fade the path, that is I need to change the opacity (color) of the entire path. I have computed a large array of vertex positions, vertex colors, texture coordinates, and indices and then I give these to opengl but I would like reduce the opacity of all the drawn triangles without having to change each of the color coordinates. Normally I would use glColor4f(r,g,b,a) before calling drawElements, but it has no effect due to the color array. I am working on Android, but I believe it shouldn't make the big difference, as long as it is OpenGL-ES 1.1 (or 1.0). I have the following code : gl.glEnable(GL10.GL_BLEND); gl.glBlendFunc(GL10.GL_ONE, GL10.GL_ONE_MINUS_SRC_ALPHA); gl.glEnableClientState(GL10.GL_COLOR_ARRAY); gl.glShadeModel(GL10.GL_SMOOTH); gl.glEnableClientState(GL10.GL_VERTEX_ARRAY); gl.glEnableClientState(GL10.GL_TEXTURE_COORD_ARRAY); gl.glEnable(GL10.GL_TEXTURE_2D); // Should set rgb to greyish, and alpha to half-transparent, the greyish is // just there to make the question more general its the alpha i'm interested in gl.glColor4f(.75f, .75f, .75f, 0.5f); gl.glVertexPointer(mVertexSize, GL10.GL_FLOAT, 0, mVertexBuffer); gl.glColorPointer(4, GL10.GL_FLOAT, 0, mColorBuffer); gl.glTexCoordPointer(2, GL10.GL_FLOAT, 0, mTexCoordBuffer); gl.glDrawElements(GL10.GL_TRIANGLES, indexCount, GL10.GL_UNSIGNED_SHORT, mIndexBuffer.position(startIndex)); If I disable the color array gl.glEnableClientState(GL10.GL_COLOR_ARRAY);, then the glColor4f works, if I enable the color array it does nothing. Is there any way in OpenGl-ES to change the coloring without changing all the color coordinates? I think that in OpenGl one might use a fragment shader, but it seems OpenGL does not have a fragment shader (not that I know how to use one).

    Read the article

  • GPGPU programming with OpenGL ES 2.0

    - by Albus Dumbledore
    I am trying to do some image processing on the GPU, e.g. median, blur, brightness, etc. The general idea is to do something like this framework from GPU Gems 1. I am able to write the GLSL fragment shader for processing the pixels as I've been trying out different things in an effect designer app. I am not sure however how I should do the other part of the task. That is, I'd like to be working on the image in image coords and then outputting the result to a texture. I am aware of the gl_FragCoords variable. As far as I understand it it goes like that: I need to set up a view (an orthographic one maybe?) and a quad in such a way so that the pixel shader would be applied once to each pixel in the image and so that it would be rendering to a texture or something. But how can I achieve that considering there's depth that may make things somewhat awkward to me... I'd be very grateful if anyone could help me with this rather simple task as I am really frustrated with myself. UPDATE It seems I'll have to use an FBO, getting one like this: glBindFramebuffer(...)

    Read the article

  • How to modulate every texture unit in OpenGL ES 1.1?

    - by Jesse Beder
    I have two textures and a "blend factor", and I'd like to mix them, modulated by the current color; in effect, I want to use the following shader: gl_FragColor = gl_Color * mix(tex0, tex1, blendFactor); I'm using OpenGL ES 1.1, targeting all versions of the iPhone, so I can't use shaders, and I have two texture units. My best attempt is: // texture 0 glActiveTexture(GL_TEXTURE0); image1.Bind(); glTexEnvi(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_MODULATE); // texture 1 glActiveTexture(GL_TEXTURE1); image2.Bind(); glTexEnvi(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_COMBINE); glTexEnvi(GL_TEXTURE_ENV, GL_COMBINE_RGB, GL_INTERPOLATE); glTexEnvi(GL_TEXTURE_ENV, GL_SRC0_RGB, GL_PREVIOUS); glTexEnvi(GL_TEXTURE_ENV, GL_OPERAND0_RGB, GL_SRC_COLOR); glTexEnvi(GL_TEXTURE_ENV, GL_SRC1_RGB, GL_TEXTURE); glTexEnvi(GL_TEXTURE_ENV, GL_OPERAND1_RGB, GL_SRC_COLOR); glTexEnvi(GL_TEXTURE_ENV, GL_SRC2_RGB, GL_CONSTANT); glTexEnvi(GL_TEXTURE_ENV, GL_OPERAND2_RGB, GL_SRC_ALPHA); glTexEnvi(GL_TEXTURE_ENV, GL_COMBINE_ALPHA, GL_INTERPOLATE); glTexEnvi(GL_TEXTURE_ENV, GL_SRC0_ALPHA, GL_PREVIOUS); glTexEnvi(GL_TEXTURE_ENV, GL_OPERAND0_ALPHA, GL_SRC_ALPHA); glTexEnvi(GL_TEXTURE_ENV, GL_SRC1_ALPHA, GL_TEXTURE); glTexEnvi(GL_TEXTURE_ENV, GL_OPERAND1_ALPHA, GL_SRC_ALPHA); glTexEnvi(GL_TEXTURE_ENV, GL_SRC2_ALPHA, GL_CONSTANT); glTexEnvi(GL_TEXTURE_ENV, GL_OPERAND2_ALPHA, GL_SRC_ALPHA); const float factor[] = { 0, 0, 0, blendFactor }; glTexEnvfv(GL_TEXTURE_ENV, GL_TEXTURE_ENV_COLOR, factor); This has the effect of the shader: gl_FragColor = mix(gl_Color * tex0, tex1, blendFactor); but I don't see how to module texture 1 by the color. Is there any way to have the color provided by a texture unit automatically modulated by the incoming primary color? Or any other way to do what I want that I'm missing? Multiple passes are definitely allowed, but they have to have the proper blend effect; I have glBlend(GL_ONE, GL_ONE_MINUS_SRC_ALPHA); enabled, so it can be tricky to get right with multiple passes.

    Read the article

  • Problem Loading multiple textures using multiple shaders with GLSL

    - by paj777
    I am trying to use multiple textures in the same scene but no matter what I try the same texture is loaded for each object. So this what I am doing at the moment, I initialise each shader: rightWall.SendShaders("wall.vert","wall.frag","brick3.bmp", "wallTex", 0); demoFloor.SendShaders("floor.vert","floor.frag","dirt1.bmp", "floorTex", 1); The code in SendShaders is: GLuint vert,frag; glEnable(GL_DEPTH_TEST); glEnable(GL_TEXTURE_2D); char *vs = NULL,*fs = NULL; vert = glCreateShader(GL_VERTEX_SHADER); frag = glCreateShader(GL_FRAGMENT_SHADER); vs = textFileRead(vertFile); fs = textFileRead(fragFile); const char * ff = fs; const char * vv = vs; glShaderSource(vert, 1, &vv, NULL); glShaderSource(frag, 1, &ff, NULL); free(vs); free(fs); glCompileShader(vert); glCompileShader(frag); program = glCreateProgram(); glAttachShader(program, frag); glAttachShader(program, vert); glLinkProgram(program); glUseProgram(program); LoadGLTexture(textureImage, texture); GLint location = glGetUniformLocation(program, textureName); glActiveTexture(GL_TEXTURE0); glBindTexture(GL_TEXTURE_2D, texture); glUniform1i(location, 0); And then in the main loop: rightWall.UseShader(); rightWall.Draw(); demoFloor.UseShader(); demoFloor.Draw(); Which ever shader is initialised last is the texture which is used for both objects. Thank you for your time and I appreciate any comments.

    Read the article

  • C++ Pointer member function with templates assignment with a member function of another class

    - by Agusti
    Hi, I have this class: class IShaderParam{ public: std::string name_value; }; template<class TParam> class TShaderParam:public IShaderParam{ public: void (TShaderParam::*send_to_shader)( const TParam&,const std::string&); TShaderParam():send_to_shader(NULL){} TParam value; void up_to_shader(); }; typedef TShaderParam<float> FloatShaderParam; typedef TShaderParam<D3DXVECTOR3> Vec3ShaderParam; In another class, I have a vector of IShaderParams* and functions that i want to send to "send_to_shader". I'm trying assign the reference of these functions like this: Vec3ShaderParam *_param = new Vec3ShaderParam; _param-send_to_shader = &TShader::setVector3; This is the function: void TShader::setVector3(const D3DXVECTOR3 &vec, const std::string &name){ //... } And this is the class with IshaderParams*: class TShader{ std::vector params; public: Shader effect; std::string technique_name; TShader(std::string& afilename):effect(NULL){}; ~TShader(); void setVector3(const D3DXVECTOR3 &vec, const std::string &name); When I compile the project with Visual Studio C++ Express 2008 I recieve this error: Error 2 error C2440: '=' :can't make the conversion 'void (__thiscall TShader::* )(const D3DXVECTOR3 &,const std::string &)' a 'void (__thiscall TShaderParam::* )(const TParam &,const std::string &)' c:\users\isagoras\documents\mcv\afoc\shader.cpp 127 Can I do the assignment? No? I don't know how :-S Yes, I know that I can achieve the same objective with other techniques, but I want to know how can I do this..

    Read the article

  • HLSL - Combining textures

    - by b34r
    Hi All, I'm trying to combine two textures in HLSL - specifically, I want to take the alpha values from a base image, and the color data from an overlay image. My pixel shader for this looks like this: float4 PixelShaderFunction(VertexOut input) : COLOR0 { float4 baseColor = tex2D( BaseSampler, input.baseCoords.xy ).rgba; float4 overlayColor = tex2D( OverlaySampler, input.overlayCoords.xy ).rgba; float4 color; color.r = overlayColor.r; color.g = overlayColor.g; color.b = overlayColor.b; color.a = baseColor.a; return color.rgba; } and my blend state looks like this: BlendState bs = new BlendState(); bs.AlphaSourceBlend = Blend.SourceAlpha; bs.AlphaDestinationBlend = Blend.DestinationAlpha; bs.ColorSourceBlend = Blend.SourceColor; bs.ColorDestinationBlend = Blend.DestinationColor; What this leaves me with is a washed out version of what should be the overlay color. I've tried numerous permutations of the BlendState settings, and played with the pixel shader math quite a bit, but to no avail. Can anyone point me in the right direction? Thanks in advance =)

    Read the article

  • SSH Login to an EC2 instance failing with previously working keys...

    - by Matthew Savage
    We recently had an issues where I had rebooted our EC2 instance (Ubuntu x86_64, version 9.10 server) and due to an EC2 issue the instance needed to be stopped and was down for a few days. Now I have been able to bring the instance back online I cannot connect to SSH using the keypair which previously worked. Unfortunately SSH is the only way to get into this server, and while I have another system running in its place there are a number of things I would like to try and retrieve from the machine. Running SSH in verbose mode yields the following: [Broc-MBP.local]: Broc:~/.ssh ? ssh -i ~/.ssh/EC2Keypair.pem -l ubuntu ec2-xxx.compute-1.amazonaws.com -vvv OpenSSH_5.2p1, OpenSSL 0.9.8l 5 Nov 2009 debug1: Reading configuration data /Users/Broc/.ssh/config debug1: Reading configuration data /etc/ssh_config debug2: ssh_connect: needpriv 0 debug1: Connecting to ec2-xxx.compute-1.amazonaws.com [184.73.109.130] port 22. debug1: Connection established. debug3: Not a RSA1 key file /Users/Broc/.ssh/EC2Keypair.pem. debug2: key_type_from_name: unknown key type '-----BEGIN' debug3: key_read: missing keytype debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug2: key_type_from_name: unknown key type '-----END' debug3: key_read: missing keytype debug1: identity file /Users/Broc/.ssh/EC2Keypair.pem type -1 debug3: Not a RSA1 key file /Users/Broc/.ssh/id_rsa. debug2: key_type_from_name: unknown key type '-----BEGIN' debug3: key_read: missing keytype debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug2: key_type_from_name: unknown key type '-----END' debug3: key_read: missing keytype debug1: identity file /Users/Broc/.ssh/id_rsa type 1 debug1: Remote protocol version 2.0, remote software version OpenSSH_5.1p1 Debian-6ubuntu2 debug1: match: OpenSSH_5.1p1 Debian-6ubuntu2 pat OpenSSH* debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_5.2 debug2: fd 3 setting O_NONBLOCK debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug2: kex_parse_kexinit: diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1 debug2: kex_parse_kexinit: ssh-rsa,ssh-dss debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected] debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected] debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: none,[email protected],zlib debug2: kex_parse_kexinit: none,[email protected],zlib debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: first_kex_follows 0 debug2: kex_parse_kexinit: reserved 0 debug2: kex_parse_kexinit: diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1 debug2: kex_parse_kexinit: ssh-rsa,ssh-dss debug2: kex_parse_kexinit: aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour128,arcfour256,arcfour,aes192-cbc,aes256-cbc,[email protected],aes128-ctr,aes192-ctr,aes256-ctr debug2: kex_parse_kexinit: aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour128,arcfour256,arcfour,aes192-cbc,aes256-cbc,[email protected],aes128-ctr,aes192-ctr,aes256-ctr debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: none,[email protected] debug2: kex_parse_kexinit: none,[email protected] debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: first_kex_follows 0 debug2: kex_parse_kexinit: reserved 0 debug2: mac_setup: found hmac-md5 debug1: kex: server->client aes128-ctr hmac-md5 none debug2: mac_setup: found hmac-md5 debug1: kex: client->server aes128-ctr hmac-md5 none debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP debug2: dh_gen_key: priv key bits set: 123/256 debug2: bits set: 500/1024 debug1: SSH2_MSG_KEX_DH_GEX_INIT sent debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY debug3: check_host_in_hostfile: filename /Users/Broc/.ssh/known_hosts debug3: check_host_in_hostfile: match line 106 debug3: check_host_in_hostfile: filename /Users/Broc/.ssh/known_hosts debug3: check_host_in_hostfile: match line 106 debug1: Host 'ec2-xxx.compute-1.amazonaws.com' is known and matches the RSA host key. debug1: Found key in /Users/Broc/.ssh/known_hosts:106 debug2: bits set: 521/1024 debug1: ssh_rsa_verify: signature correct debug2: kex_derive_keys debug2: set_newkeys: mode 1 debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug2: set_newkeys: mode 0 debug1: SSH2_MSG_NEWKEYS received debug1: SSH2_MSG_SERVICE_REQUEST sent debug2: service_accept: ssh-userauth debug1: SSH2_MSG_SERVICE_ACCEPT received debug2: key: /Users/Broc/.ssh/id_rsa (0x100125f70) debug2: key: /Users/Broc/.ssh/EC2Keypair.pem (0x0) debug1: Authentications that can continue: publickey debug3: start over, passed a different list publickey debug3: preferred publickey,keyboard-interactive,password debug3: authmethod_lookup publickey debug3: remaining preferred: keyboard-interactive,password debug3: authmethod_is_enabled publickey debug1: Next authentication method: publickey debug1: Offering public key: /Users/Broc/.ssh/id_rsa debug3: send_pubkey_test debug2: we sent a publickey packet, wait for reply debug1: Authentications that can continue: publickey debug1: Trying private key: /Users/Broc/.ssh/EC2Keypair.pem debug1: read PEM private key done: type RSA debug3: sign_and_send_pubkey debug2: we sent a publickey packet, wait for reply debug1: Authentications that can continue: publickey debug2: we did not send a packet, disable method debug1: No more authentication methods to try. Permission denied (publickey). [Broc-MBP.local]: Broc:~/.ssh ? So, right now I'm really at a loss and not sure what to do. While I've already got another system taking the place of this one I'd really like to have access back :|

    Read the article

  • C# Neural Networks with Encog

    - by JoshReuben
    Neural Networks ·       I recently read a book Introduction to Neural Networks for C# , by Jeff Heaton. http://www.amazon.com/Introduction-Neural-Networks-C-2nd/dp/1604390093/ref=sr_1_2?ie=UTF8&s=books&qid=1296821004&sr=8-2-spell. Not the 1st ANN book I've perused, but a nice revision.   ·       Artificial Neural Networks (ANNs) are a mechanism of machine learning – see http://en.wikipedia.org/wiki/Artificial_neural_network , http://en.wikipedia.org/wiki/Category:Machine_learning ·       Problems Not Suited to a Neural Network Solution- Programs that are easily written out as flowcharts consisting of well-defined steps, program logic that is unlikely to change, problems in which you must know exactly how the solution was derived. ·       Problems Suited to a Neural Network – pattern recognition, classification, series prediction, and data mining. Pattern recognition - network attempts to determine if the input data matches a pattern that it has been trained to recognize. Classification - take input samples and classify them into fuzzy groups. ·       As far as machine learning approaches go, I thing SVMs are superior (see http://en.wikipedia.org/wiki/Support_vector_machine ) - a neural network has certain disadvantages in comparison: an ANN can be overtrained, different training sets can produce non-deterministic weights and it is not possible to discern the underlying decision function of an ANN from its weight matrix – they are black box. ·       In this post, I'm not going to go into internals (believe me I know them). An autoassociative network (e.g. a Hopfield network) will echo back a pattern if it is recognized. ·       Under the hood, there is very little maths. In a nutshell - Some simple matrix operations occur during training: the input array is processed (normalized into bipolar values of 1, -1) - transposed from input column vector into a row vector, these are subject to matrix multiplication and then subtraction of the identity matrix to get a contribution matrix. The dot product is taken against the weight matrix to yield a boolean match result. For backpropogation training, a derivative function is required. In learning, hill climbing mechanisms such as Genetic Algorithms and Simulated Annealing are used to escape local minima. For unsupervised training, such as found in Self Organizing Maps used for OCR, Hebbs rule is applied. ·       The purpose of this post is not to mire you in technical and conceptual details, but to show you how to leverage neural networks via an abstraction API - Encog   Encog ·       Encog is a neural network API ·       Links to Encog: http://www.encog.org , http://www.heatonresearch.com/encog, http://www.heatonresearch.com/forum ·       Encog requires .Net 3.5 or higher – there is also a Silverlight version. Third-Party Libraries – log4net and nunit. ·       Encog supports feedforward, recurrent, self-organizing maps, radial basis function and Hopfield neural networks. ·       Encog neural networks, and related data, can be stored in .EG XML files. ·       Encog Workbench allows you to edit, train and visualize neural networks. The Encog Workbench can generate code. Synapses and layers ·       the primary building blocks - Almost every neural network will have, at a minimum, an input and output layer. In some cases, the same layer will function as both input and output layer. ·       To adapt a problem to a neural network, you must determine how to feed the problem into the input layer of a neural network, and receive the solution through the output layer of a neural network. ·       The Input Layer - For each input neuron, one double value is stored. An array is passed as input to a layer. Encog uses the interface INeuralData to hold these arrays. The class BasicNeuralData implements the INeuralData interface. Once the neural network processes the input, an INeuralData based class will be returned from the neural network's output layer. ·       convert a double array into an INeuralData object : INeuralData data = new BasicNeuralData(= new double[10]); ·       the Output Layer- The neural network outputs an array of doubles, wraped in a class based on the INeuralData interface. ·        The real power of a neural network comes from its pattern recognition capabilities. The neural network should be able to produce the desired output even if the input has been slightly distorted. ·       Hidden Layers– optional. between the input and output layers. very much a “black box”. If the structure of the hidden layer is too simple it may not learn the problem. If the structure is too complex, it will learn the problem but will be very slow to train and execute. Some neural networks have no hidden layers. The input layer may be directly connected to the output layer. Further, some neural networks have only a single layer. A single layer neural network has the single layer self-connected. ·       connections, called synapses, contain individual weight matrixes. These values are changed as the neural network learns. Constructing a Neural Network ·       the XOR operator is a frequent “first example” -the “Hello World” application for neural networks. ·       The XOR Operator- only returns true when both inputs differ. 0 XOR 0 = 0 1 XOR 0 = 1 0 XOR 1 = 1 1 XOR 1 = 0 ·       Structuring a Neural Network for XOR  - two inputs to the XOR operator and one output. ·       input: 0.0,0.0 1.0,0.0 0.0,1.0 1.0,1.0 ·       Expected output: 0.0 1.0 1.0 0.0 ·       A Perceptron - a simple feedforward neural network to learn the XOR operator. ·       Because the XOR operator has two inputs and one output, the neural network will follow suit. Additionally, the neural network will have a single hidden layer, with two neurons to help process the data. The choice for 2 neurons in the hidden layer is arbitrary, and often comes down to trial and error. ·       Neuron Diagram for the XOR Network ·       ·       The Encog workbench displays neural networks on a layer-by-layer basis. ·       Encog Layer Diagram for the XOR Network:   ·       Create a BasicNetwork - Three layers are added to this network. the FinalizeStructure method must be called to inform the network that no more layers are to be added. The call to Reset randomizes the weights in the connections between these layers. var network = new BasicNetwork(); network.AddLayer(new BasicLayer(2)); network.AddLayer(new BasicLayer(2)); network.AddLayer(new BasicLayer(1)); network.Structure.FinalizeStructure(); network.Reset(); ·       Neural networks frequently start with a random weight matrix. This provides a starting point for the training methods. These random values will be tested and refined into an acceptable solution. However, sometimes the initial random values are too far off. Sometimes it may be necessary to reset the weights again, if training is ineffective. These weights make up the long-term memory of the neural network. Additionally, some layers have threshold values that also contribute to the long-term memory of the neural network. Some neural networks also contain context layers, which give the neural network a short-term memory as well. The neural network learns by modifying these weight and threshold values. ·       Now that the neural network has been created, it must be trained. Training a Neural Network ·       construct a INeuralDataSet object - contains the input array and the expected output array (of corresponding range). Even though there is only one output value, we must still use a two-dimensional array to represent the output. public static double[][] XOR_INPUT ={ new double[2] { 0.0, 0.0 }, new double[2] { 1.0, 0.0 }, new double[2] { 0.0, 1.0 }, new double[2] { 1.0, 1.0 } };   public static double[][] XOR_IDEAL = { new double[1] { 0.0 }, new double[1] { 1.0 }, new double[1] { 1.0 }, new double[1] { 0.0 } };   INeuralDataSet trainingSet = new BasicNeuralDataSet(XOR_INPUT, XOR_IDEAL); ·       Training is the process where the neural network's weights are adjusted to better produce the expected output. Training will continue for many iterations, until the error rate of the network is below an acceptable level. Encog supports many different types of training. Resilient Propagation (RPROP) - general-purpose training algorithm. All training classes implement the ITrain interface. The RPROP algorithm is implemented by the ResilientPropagation class. Training the neural network involves calling the Iteration method on the ITrain class until the error is below a specific value. The code loops through as many iterations, or epochs, as it takes to get the error rate for the neural network to be below 1%. Once the neural network has been trained, it is ready for use. ITrain train = new ResilientPropagation(network, trainingSet);   for (int epoch=0; epoch < 10000; epoch++) { train.Iteration(); Debug.Print("Epoch #" + epoch + " Error:" + train.Error); if (train.Error > 0.01) break; } Executing a Neural Network ·       Call the Compute method on the BasicNetwork class. Console.WriteLine("Neural Network Results:"); foreach (INeuralDataPair pair in trainingSet) { INeuralData output = network.Compute(pair.Input); Console.WriteLine(pair.Input[0] + "," + pair.Input[1] + ", actual=" + output[0] + ",ideal=" + pair.Ideal[0]); } ·       The Compute method accepts an INeuralData class and also returns a INeuralData object. Neural Network Results: 0.0,0.0, actual=0.002782538818034049,ideal=0.0 1.0,0.0, actual=0.9903741937121177,ideal=1.0 0.0,1.0, actual=0.9836807956566187,ideal=1.0 1.0,1.0, actual=0.0011646072586172778,ideal=0.0 ·       the network has not been trained to give the exact results. This is normal. Because the network was trained to 1% error, each of the results will also be within generally 1% of the expected value.

    Read the article

  • Operator of the week - Assert

    - by Fabiano Amorim
    Well my friends, I was wondering how to help you in a practical way to understand execution plans. So I think I'll talk about the Showplan Operators. Showplan Operators are used by the Query Optimizer (QO) to build the query plan in order to perform a specified operation. A query plan will consist of many physical operators. The Query Optimizer uses a simple language that represents each physical operation by an operator, and each operator is represented in the graphical execution plan by an icon. I'll try to talk about one operator every week, but so as to avoid having to continue to write about these operators for years, I'll mention only of those that are more common: The first being the Assert. The Assert is used to verify a certain condition, it validates a Constraint on every row to ensure that the condition was met. If, for example, our DDL includes a check constraint which specifies only two valid values for a column, the Assert will, for every row, validate the value passed to the column to ensure that input is consistent with the check constraint. Assert  and Check Constraints: Let's see where the SQL Server uses that information in practice. Take the following T-SQL: IF OBJECT_ID('Tab1') IS NOT NULL   DROP TABLE Tab1 GO CREATE TABLE Tab1(ID Integer, Gender CHAR(1))  GO  ALTER TABLE TAB1 ADD CONSTRAINT ck_Gender_M_F CHECK(Gender IN('M','F'))  GO INSERT INTO Tab1(ID, Gender) VALUES(1,'X') GO To the command above the SQL Server has generated the following execution plan: As we can see, the execution plan uses the Assert operator to check that the inserted value doesn't violate the Check Constraint. In this specific case, the Assert applies the rule, 'if the value is different to "F" and different to "M" than return 0 otherwise returns NULL'. The Assert operator is programmed to show an error if the returned value is not NULL; in other words, the returned value is not a "M" or "F". Assert checking Foreign Keys Now let's take a look at an example where the Assert is used to validate a foreign key constraint. Suppose we have this  query: ALTER TABLE Tab1 ADD ID_Genders INT GO  IF OBJECT_ID('Tab2') IS NOT NULL   DROP TABLE Tab2 GO CREATE TABLE Tab2(ID Integer PRIMARY KEY, Gender CHAR(1))  GO  INSERT INTO Tab2(ID, Gender) VALUES(1, 'F') INSERT INTO Tab2(ID, Gender) VALUES(2, 'M') INSERT INTO Tab2(ID, Gender) VALUES(3, 'N') GO  ALTER TABLE Tab1 ADD CONSTRAINT fk_Tab2 FOREIGN KEY (ID_Genders) REFERENCES Tab2(ID) GO  INSERT INTO Tab1(ID, ID_Genders, Gender) VALUES(1, 4, 'X') Let's look at the text execution plan to see what these Assert operators were doing. To see the text execution plan just execute SET SHOWPLAN_TEXT ON before run the insert command. |--Assert(WHERE:(CASE WHEN NOT [Pass1008] AND [Expr1007] IS NULL THEN (0) ELSE NULL END))      |--Nested Loops(Left Semi Join, PASSTHRU:([Tab1].[ID_Genders] IS NULL), OUTER REFERENCES:([Tab1].[ID_Genders]), DEFINE:([Expr1007] = [PROBE VALUE]))           |--Assert(WHERE:(CASE WHEN [Tab1].[Gender]<>'F' AND [Tab1].[Gender]<>'M' THEN (0) ELSE NULL END))           |    |--Clustered Index Insert(OBJECT:([Tab1].[PK]), SET:([Tab1].[ID] = RaiseIfNullInsert([@1]),[Tab1].[ID_Genders] = [@2],[Tab1].[Gender] = [Expr1003]), DEFINE:([Expr1003]=CONVERT_IMPLICIT(char(1),[@3],0)))           |--Clustered Index Seek(OBJECT:([Tab2].[PK]), SEEK:([Tab2].[ID]=[Tab1].[ID_Genders]) ORDERED FORWARD) Here we can see the Assert operator twice, first (looking down to up in the text plan and the right to left in the graphical plan) validating the Check Constraint. The same concept showed above is used, if the exit value is "0" than keep running the query, but if NULL is returned shows an exception. The second Assert is validating the result of the Tab1 and Tab2 join. It is interesting to see the "[Expr1007] IS NULL". To understand that you need to know what this Expr1007 is, look at the Probe Value (green text) in the text plan and you will see that it is the result of the join. If the value passed to the INSERT at the column ID_Gender exists in the table Tab2, then that probe will return the join value; otherwise it will return NULL. So the Assert is checking the value of the search at the Tab2; if the value that is passed to the INSERT is not found  then Assert will show one exception. If the value passed to the column ID_Genders is NULL than the SQL can't show a exception, in that case it returns "0" and keeps running the query. If you run the INSERT above, the SQL will show an exception because of the "X" value, but if you change the "X" to "F" and run again, it will show an exception because of the value "4". If you change the value "4" to NULL, 1, 2 or 3 the insert will be executed without any error. Assert checking a SubQuery: The Assert operator is also used to check one subquery. As we know, one scalar subquery can't validly return more than one value: Sometimes, however, a  mistake happens, and a subquery attempts to return more than one value . Here the Assert comes into play by validating the condition that a scalar subquery returns just one value. Take the following query: INSERT INTO Tab1(ID_TipoSexo, Sexo) VALUES((SELECT ID_TipoSexo FROM Tab1), 'F')    INSERT INTO Tab1(ID_TipoSexo, Sexo) VALUES((SELECT ID_TipoSexo FROM Tab1), 'F')    |--Assert(WHERE:(CASE WHEN NOT [Pass1016] AND [Expr1015] IS NULL THEN (0) ELSE NULL END))        |--Nested Loops(Left Semi Join, PASSTHRU:([tempdb].[dbo].[Tab1].[ID_TipoSexo] IS NULL), OUTER REFERENCES:([tempdb].[dbo].[Tab1].[ID_TipoSexo]), DEFINE:([Expr1015] = [PROBE VALUE]))              |--Assert(WHERE:([Expr1017]))             |    |--Compute Scalar(DEFINE:([Expr1017]=CASE WHEN [tempdb].[dbo].[Tab1].[Sexo]<>'F' AND [tempdb].[dbo].[Tab1].[Sexo]<>'M' THEN (0) ELSE NULL END))              |         |--Clustered Index Insert(OBJECT:([tempdb].[dbo].[Tab1].[PK__Tab1__3214EC277097A3C8]), SET:([tempdb].[dbo].[Tab1].[ID_TipoSexo] = [Expr1008],[tempdb].[dbo].[Tab1].[Sexo] = [Expr1009],[tempdb].[dbo].[Tab1].[ID] = [Expr1003]))              |              |--Top(TOP EXPRESSION:((1)))              |                   |--Compute Scalar(DEFINE:([Expr1008]=[Expr1014], [Expr1009]='F'))              |                        |--Nested Loops(Left Outer Join)              |                             |--Compute Scalar(DEFINE:([Expr1003]=getidentity((1856985942),(2),NULL)))              |                             |    |--Constant Scan              |                             |--Assert(WHERE:(CASE WHEN [Expr1013]>(1) THEN (0) ELSE NULL END))              |                                  |--Stream Aggregate(DEFINE:([Expr1013]=Count(*), [Expr1014]=ANY([tempdb].[dbo].[Tab1].[ID_TipoSexo])))             |                                       |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[Tab1].[PK__Tab1__3214EC277097A3C8]))              |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[Tab2].[PK__Tab2__3214EC27755C58E5]), SEEK:([tempdb].[dbo].[Tab2].[ID]=[tempdb].[dbo].[Tab1].[ID_TipoSexo]) ORDERED FORWARD)  You can see from this text showplan that SQL Server as generated a Stream Aggregate to count how many rows the SubQuery will return, This value is then passed to the Assert which then does its job by checking its validity. Is very interesting to see that  the Query Optimizer is smart enough be able to avoid using assert operators when they are not necessary. For instance: INSERT INTO Tab1(ID_TipoSexo, Sexo) VALUES((SELECT ID_TipoSexo FROM Tab1 WHERE ID = 1), 'F') INSERT INTO Tab1(ID_TipoSexo, Sexo) VALUES((SELECT TOP 1 ID_TipoSexo FROM Tab1), 'F')  For both these INSERTs, the Query Optimiser is smart enough to know that only one row will ever be returned, so there is no need to use the Assert. Well, that's all folks, I see you next week with more "Operators". Cheers, Fabiano

    Read the article

  • Windows Azure Evolution - Web Sites (aka Antares) Part 1

    - by Shaun
    This is the 3rd post of my Windows Azure Evolution series, focus on the new features and enhancement which was alone with the Windows Azure Platform Upgrade June 2012, announced at the MEET Windows Azure event on 7th June. In the first post I introduced the new preview developer portal and how to works for the existing features such as cloud services, storages and SQL databases. In the second one I talked about the Windows Azure .NET SDK 1.7 on the latest Visual Studio 2012 RC on Windows 8. From this one I will begin to introduce some new features. Now let’s have a look on the first one of them, Windows Azure Web Sites.   Overview Windows Azure Web Sites (WAWS), as known as Antares, was a new feature still in preview stage in this upgrade. It allows people to quickly and easily deploy websites to a highly scalable cloud environment, uses the languages and open source apps of the choice then deploy such as FTP, Git and TFS. It also can be integrated with Windows Azure services like SQL Database, Caching, CDN and Storage easily. After read its introduction we may have a question: since we can deploy a website from both cloud service web role and web site, what’s the different between them? So, let’s have a quick compare.   CLOUD SERVICE WEB SITE OS Windows Server Windows Server Virtualization Windows Azure Virtual Machine Windows Azure Virtual Machine Host IIS IIS Platform ASP.NET WebForm, ASP.NET MVC, WCF ASP.NET WebForm, ASP.NET MVC, PHP Language C#, VB.NET C#, VB.NET, PHP Database SQL Database SQL Database, MySQL Architecture Multi layered, background worker, message queuing, etc.. Simple website with backend database. VS Project Windows Azure Cloud Service ASP.NET Web Form, ASP.NET MVC, etc.. Out-of-box Gallery (none) Drupal, DotNetNuke, WordPress, etc.. Deployment Package upload, Visual Studio publish FTP, Git, TFS, WebMatrix Compute Mode Dedicate VM Shared Across VMs, Dedicate VM Scale Scale up, scale out Scale up, scale out As you can see, there are many difference between the cloud service and web site, but the main point is that, the cloud service focus on those complex architecture web application. For example, if you want to build a website with frontend layer, middle business layer and data access layer, with some background worker process connected through the message queue, then you should better use cloud service, since it provides full control of your code and application. But if you just want to build a personal blog or a  business portal, then you can use the web site. Since the web site have many galleries, you can create them even without any coding and configuration. David Pallmann have an awesome figure explains the benefits between the could service, web site and virtual machine.   Create a Personal Blog in Web Site from Gallery As I mentioned above, one of the big feature in WAWS is to build a website from an existing gallery, which means we don’t need to coding and configure. What we need to do is open the windows azure developer portal and click the NEW button, select WEB SITE and FROM GALLERY. In the popping up windows there are many websites we can choose to use. For example, for personal blog there are Orchard CMS, WordPress; for CMS there are DotNetNuke, Drupal 7, mojoPortal. Let’s select WordPress and click the next button. The next step is to configure the web site. We will need to specify the DNS name and select the subscription and region. Since the WordPress uses MySQL as its backend database, we also need to create a MySQL database as well. Windows Azure Web Sites utilize ClearDB to host the MySQL databases. You cannot create a MySQL database directly from SQL Databases section. Finally, since we selected to create a new MySQL database we need to specify the database name and region in the last step. Also we need to accept the ClearDB’s terms as well. Then windows azure platform will download the WordPress codes and deploy the MySQL database and website. Then it will be ready to use. Select the website and click the BROWSE button, the WordPress administration page will be shown. After configured the WordPress here is my personal web blog on the cloud. It took me no more than 10 minutes to establish without any coding.   Monitor, Configure, Scale and Linked Resources Let’s click into the website I had just created in the portal and have a look on what we can do. In the website details page where are five sections. - Dashboard The overall information about this website, such as the basic usage status, public URL, compute mode, FTP address, subscription and links that we can specify the deployment credentials, TFS and Git publish setting, etc.. - Monitor Some status information such as the CPU usage, memory usage etc., errors, etc.. We can add more metrics by clicking the ADD METRICS button and the bottom as well. - Configure Here we can set the configurations of our website such as the .NET and PHP runtime version, diagnostics settings, application settings and the IIS default documents. - Scale This is something interesting. In WAWS there are two compute mode or called web site mode. One is “shared”, which means our website will be shared with other web sites in a group of windows azure virtual machines. Each web site have its own process (w3wp.exe) with some sandbox technology to isolate from others. When we need to scaling-out our web site in shared mode, we actually increased the working process count. Hence in shared mode we cannot specify the virtual machine size since they are shared across all web sites. This is a little bit different than the scaling mode of the cloud service (hosted service web role and worker role). The other mode called “dedicate”, which means our web site will use the whole windows azure virtual machine. This is the same hosting behavior as cloud service web role. In web role it will be deployed on the virtual machines we specified and all of them are only used by us. In web sites dedicate mode, it’s the same. In this mode when we scaling-out our web site we will use more virtual machines, and each of them will only host our own website. And we can specify the virtual machine size in this mode. In the developer portal we can select which mode we are using from the scale section. In shared mode we can only specify the instance count, but in dedicate mode we can specify the instance size as well as the instance count. - Linked Resource The MySQL database created alone with the creation of our WordPress web site is a linked resource. We can add more linked resources in this section.   Pricing For the web site itself, since this feature is in preview period if you are using shared mode, then you will get free up to 10 web sites. But if you are using dedicate mode, the price would be the virtual machines you are using. For example, if you are using dedicate and configured two middle size virtual machines then you will pay $230.40 per month. If there is SQL Database linked to your web site then they will be charged separately based on the Pay-As-You-Go price. For example a 1GB web edition database costs $9.99 per month. And the bandwidth will be charged as well. For example 10GB outbound data transfer costs $1.20 per month. For more information about the pricing please have a look at the windows azure pricing page.   Summary Windows Azure Web Sites gives us easier and quicker way to create, develop and deploy website to window azure platform. Comparing with the cloud service web role, the WAWS have many out-of-box gallery we can use directly. So if you just want to build a blog, CMS or business portal you don’t need to learn ASP.NET, you don’t need to learn how to configure DotNetNuke, you don’t need to learn how to prepare PHP and MySQL. By using WAWS gallery you can establish a website within 10 minutes without any lines of code. But in some cases we do need to code by ourselves. We may need to tweak the layout of our pages, or we may have a traditional ASP.NET or PHP web application which needed to migrated to the cloud. Besides the gallery WAWS also provides many features to download, upload code. It also provides the feature to integrate with some version control services such as TFS and Git. And it also provides the deploy approaches through FTP and Web Deploy. In the next post I will demonstrate how to use WebMatrix to download and modify the website, and how to use TFS and Git to deploy automatically one our code changes committed.   Hope this helps, Shaun All documents and related graphics, codes are provided "AS IS" without warranty of any kind. Copyright © Shaun Ziyan Xu. This work is licensed under the Creative Commons License.

    Read the article

  • Why doesn't my implementation of ElGamal work for long text strings?

    - by angstrom91
    I'm playing with the El Gamal cryptosystem, and my goal is to be able to encipher and decipher long sequences of text. I have come up with a method that works for short sequences, but does not work for long sequences, and I cannot figure out why. El Gamal requires the plaintext to be an integer. I have turned my string into a byte[] using the .getBytes() method for Strings, and then created a BigInteger out of the byte[]. After encryption/decryption, I turn the BigInteger into a byte[] using the .toByteArray() method for BigIntegers, and then create a new String object from the byte[]. This works perfectly when i call ElGamalEncipher with strings up to 129 characters. With 130 or more characters, the output produced from ElGamalDecipher is garbled. Can someone suggest how to solve this issue? Is this an issue with my method of turning the string into a BigInteger? If so, is there a better way to turn my string of text into a BigInteger and back? Below is my encipher/decipher code with a program to demonstrate the problem. import java.math.BigInteger; public class Main { static BigInteger P = new BigInteger("15893293927989454301918026303382412" + "2586402937727056707057089173871237566896685250125642378268385842" + "6917261652781627945428519810052550093673226849059197769795219973" + "9423619267147615314847625134014485225178547696778149706043781174" + "2873134844164791938367765407368476144402513720666965545242487520" + "288928241768306844169"); static BigInteger G = new BigInteger("33234037774370419907086775226926852" + "1714093595439329931523707339920987838600777935381196897157489391" + "8360683761941170467795379762509619438720072694104701372808513985" + "2267495266642743136795903226571831274837537691982486936010899433" + "1742996138863988537349011363534657200181054004755211807985189183" + "22832092343085067869"); static BigInteger R = new BigInteger("72294619754760174015019300613282868" + "7219874058383991405961870844510501809885568825032608592198728334" + "7842806755320938980653857292210955880919036195738252708294945320" + "3969657021169134916999794791553544054426668823852291733234236693" + "4178738081619274342922698767296233937873073756955509269717272907" + "8566607940937442517"); static BigInteger A = new BigInteger("32189274574111378750865973746687106" + "3695160924347574569923113893643975328118502246784387874381928804" + "6865920942258286938666201264395694101012858796521485171319748255" + "4630425677084511454641229993833255506759834486100188932905136959" + "7287419551379203001848457730376230681693887924162381650252270090" + "28296990388507680954"); public static void main(String[] args) { FewChars(); System.out.println(); ManyChars(); } public static void FewChars() { //ElGamalEncipher(String plaintext, BigInteger p, BigInteger g, BigInteger r) BigInteger[] cipherText = ElGamal.ElGamalEncipher("This is a string " + "of 129 characters which works just fine . This is a string " + "of 129 characters which works just fine . This is a s", P, G, R); System.out.println("This is a string of 129 characters which works " + "just fine . This is a string of 129 characters which works " + "just fine . This is a s"); //ElGamalDecipher(BigInteger c, BigInteger d, BigInteger a, BigInteger p) System.out.println("The decrypted text is: " + ElGamal.ElGamalDecipher(cipherText[0], cipherText[1], A, P)); } public static void ManyChars() { //ElGamalEncipher(String plaintext, BigInteger p, BigInteger g, BigInteger r) BigInteger[] cipherText = ElGamal.ElGamalEncipher("This is a string " + "of 130 characters which doesn’t work! This is a string of " + "130 characters which doesn’t work! This is a string of ", P, G, R); System.out.println("This is a string of 130 characters which doesn’t " + "work! This is a string of 130 characters which doesn’t work!" + " This is a string of "); //ElGamalDecipher(BigInteger c, BigInteger d, BigInteger a, BigInteger p) System.out.println("The decrypted text is: " + ElGamal.ElGamalDecipher(cipherText[0], cipherText[1], A, P)); } } import java.math.BigInteger; import java.security.SecureRandom; public class ElGamal { public static BigInteger[] ElGamalEncipher(String plaintext, BigInteger p, BigInteger g, BigInteger r) { // returns a BigInteger[] cipherText // cipherText[0] is c // cipherText[1] is d SecureRandom sr = new SecureRandom(); BigInteger[] cipherText = new BigInteger[2]; BigInteger pText = new BigInteger(plaintext.getBytes()); // 1: select a random integer k such that 1 <= k <= p-2 BigInteger k = new BigInteger(p.bitLength() - 2, sr); // 2: Compute c = g^k(mod p) BigInteger c = g.modPow(k, p); // 3: Compute d= P*r^k = P(g^a)^k(mod p) BigInteger d = pText.multiply(r.modPow(k, p)).mod(p); // C =(c,d) is the ciphertext cipherText[0] = c; cipherText[1] = d; return cipherText; } public static String ElGamalDecipher(BigInteger c, BigInteger d, BigInteger a, BigInteger p) { //returns the plaintext enciphered as (c,d) // 1: use the private key a to compute the least non-negative residue // of an inverse of (c^a)' (mod p) BigInteger z = c.modPow(a, p).modInverse(p); BigInteger P = z.multiply(d).mod(p); byte[] plainTextArray = P.toByteArray(); return new String(plainTextArray); } }

    Read the article

  • Turn Windows 7 desktop into grayscale

    - by Axarydax
    Is there a universal way to turn Windows 7 Aero desktop, all windows, etc into grayscale? I had an option to do this in Intel video drivers on my old laptop, now I have NVIDIA and don't know how to replicate this effect. There might be a shader or filter that would cause Aero to render stuff into grayscale, but I was unable to find anything. Or must it be explicitly supported by video driver as a bonus?

    Read the article

  • concurrency::accelerator_view

    - by Daniel Moth
    Overview We saw previously that accelerator represents a target for our C++ AMP computation or memory allocation and that there is a notion of a default accelerator. We ended that post by introducing how one can obtain accelerator_view objects from an accelerator object through the accelerator class's default_view property and the create_view method. The accelerator_view objects can be thought of as handles to an accelerator. You can also construct an accelerator_view given another accelerator_view (through the copy constructor or the assignment operator overload). Speaking of operator overloading, you can also compare (for equality and inequality) two accelerator_view objects between them to determine if they refer to the same underlying accelerator. We'll see later that when we use concurrency::array objects, the allocation of data takes place on an accelerator at array construction time, so there is a constructor overload that accepts an accelerator_view object. We'll also see later that a new concurrency::parallel_for_each function overload can take an accelerator_view object, so it knows on what target to execute the computation (represented by a lambda that the parallel_for_each also accepts). Beyond normal usage, accelerator_view is a quality of service concept that offers isolation to multiple "consumers" of an accelerator. If in your code you are accessing the accelerator from multiple threads (or, in general, from different parts of your app), then you'll want to create separate accelerator_view objects for each thread. flush, wait, and queuing_mode When you create an accelerator_view via the create_view method of the accelerator, you pass in an option of immediate or deferred, which are the two members of the queuing_mode enum. At any point you can access this value from the queuing_mode property of the accelerator_view. When the queuing_mode value is immediate (which is the default), any commands sent to the device such as kernel invocations and data transfers (e.g. parallel_for_each and copy, as we'll see in future posts), will get submitted as soon as the runtime sees fit (that is the definition of immediate). When the value of queuing_mode is deferred, the commands will be batched up. To send all buffered commands to the device for execution, there is a non-blocking flush method that you can call. If you wish to block until all the commands have been sent, there is a wait method you can call. Deferring is a more advanced scenario aimed at performance gains when you are submitting many device commands and you want to avoid the tiny overhead of flushing/submitting each command separately. Querying information Just like accelerator, accelerator_view exposes the is_debug and version properties. In fact, you can always access the accelerator object from the accelerator property on the accelerator_view class to access the accelerator interface we looked at previously. Interop with D3D (aka DX) In a later post I'll show an example of an app that uses C++ AMP to compute data that is used in pixel shaders. In those scenarios, you can benefit by integrating C++ AMP into your graphics pipeline and one of the building blocks for that is being able to use the same device context from both the compute kernel and the other shaders. You can do that by going from accelerator_view to device context (and vice versa), through part of our interop API in amp.h: *get_device, create_accelerator_view. More on those in a later post. Comments about this post by Daniel Moth welcome at the original blog.

    Read the article

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

  • Triangle Line-Segment Intersection - detecting near misses

    - by Will
    A ray is a very poor approximation of a player! I think approximating a player with a sphere traveling a straight line each game tick will solve my problems of the player intersecting edges of scenery because their line segment missed it yet their own model is not infinitely thin... I have a 3D triangle and a line segment. I have the normal triangle-line-segment intersection code which I admit I have only a woolly grasp of. To model movement and compute collisions of the player I have to determine if a line passes within sphere-radius of a triangle. But I can find no convenient line near-miss intersection code! Here's the classic triangle intersection ### commented ### code with my starting assumptions: function triangle_ray_intersection(a,b,c,ray_origin,ray_dir,ray_radius) { // http://softsurfer.com/Archive/algorithm_0105/algorithm_0105.htm#intersect_RayTriangle%28%29 // get triangle edge vectors and plane normal var u = vec3_sub(b,a); var v = vec3_sub(c,a); var n = vec3_cross(u,v); if(n[0]==0 && n[1]==0 && n[2]==0) return null; // triangle is degenerate var w0 = vec3_sub(ray_origin,a); var j = vec3_dot(n,ray_dir); if(Math.abs(j) < 0.00000001) { //### if parallel, might still pass within ray_radius of it return null; // parallel, disjoint or on plane } var i = -vec3_dot(n,w0); // get intersect point of ray with triangle plane var k = i / j; if(k < 0.0) return null; // ray goes away from triangle //### as its a line segment, k > 1+ray_radius means no intersect var hit = vec3_add(ray_origin,vec3_scale(ray_dir,k)); // intersect point of ray and plane // is I inside T? //### here I'm a bit lost; this is presumably computing barycentric coordinates? var uu = vec3_dot(u,u); var uv = vec3_dot(u,v); var vv = vec3_dot(v,v); var w = vec3_sub(hit,a); var wu = vec3_dot(w,u); var wv = vec3_dot(w,v); var D = uv * uv - uu * vv; var s = (uv * wv - vv * wu) / D; //### therefore, compute if its within ray_radius scaled to the 0..1 of barycentric coordinates? if(s<0.0 || s>1.0) return null; // I is outside T var t = (uv * wu - uu * wv) / D; if(t<0.0 || (s+t)>1.0) return null; // I is outside T //### finally, if it passses a barycentric test it might still be too far //### to a point; must check that its distance from a corner is within ray_radius too if more than one barycentric coord is >1 //### so we have rounded corners... return [hit,n]; // I is in T } Given the distance between the point of plane intersection and each corner, I ought to be able to determine distance at world scale of how far beyond the edge - beyond 1.0 in barycentric coordinates for each axis - that point is... At this point my head explodes! Is this the right track? What's the actual code? UPDATE: you can earn 100 pts on SO if you answer this question there...! How can you determine if a line segment passes within some distance of a triangle?

    Read the article

  • Why do my pyramids fade black and then back to colour again

    - by geminiCoder
    I have the following vertecies and norms GLfloat verts[36] = { -0.5, 0, 0.5, 0, 0, -0.5, 0.5, 0, 0.5, 0, 0, -0.5, 0.5, 0, 0.5, 0, 1, 0, -0.5, 0, 0.5, 0, 0, -0.5, 0, 1, 0, 0.5, 0, 0.5, -0.5, 0, 0.5, 0, 1, 0 }; GLfloat norms[36] = { 0, -1, 0, 0, -1, 0, 0, -1, 0, -1, 0.25, 0.5, -1, 0.25, 0.5, -1, 0.25, 0.5, 1, 0.25, -0.5, 1, 0.25, -0.5, 1, 0.25, -0.5, 0, -0.5, -1, 0, -0.5, -1, 0, -0.5, -1 }; I am writing my fists Open GL game, But I need to know for sure if my Normals are correct as the colours aren't rendering correctly. my Pyramids are coloured then fade to black every half rotation then back again. My app so far is based on the boiler plate code provided by apple. heres my modified setUp Method [EAGLContext setCurrentContext:self.context]; [self loadShaders]; self.effect = [[GLKBaseEffect alloc] init]; self.effect.light0.enabled = GL_TRUE; self.effect.light0.diffuseColor = GLKVector4Make(1.0f, 0.4f, 0.4f, 1.0f); glEnable(GL_DEPTH_TEST); glGenVertexArraysOES(1, &_vertexArray); //create vertex array glBindVertexArrayOES(_vertexArray); glGenBuffers(1, &_vertexBuffer); glBindBuffer(GL_ARRAY_BUFFER, _vertexBuffer); glBufferData(GL_ARRAY_BUFFER, sizeof(verts) + sizeof(norms), NULL, GL_STATIC_DRAW); //create vertex buffer big enough for both verts and norms and pass NULL as data.. uint8_t *ptr = (uint8_t *)glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES); //map buffer to pass data to it memcpy(ptr, verts, sizeof(verts)); //copy verts memcpy(ptr+sizeof(verts), norms, sizeof(norms)); //copy norms to position after verts glUnmapBufferOES(GL_ARRAY_BUFFER); glEnableVertexAttribArray(GLKVertexAttribPosition); glVertexAttribPointer(GLKVertexAttribPosition, 3, GL_FLOAT, GL_FALSE, 0, BUFFER_OFFSET(0)); //tell GL where verts are in buffer glEnableVertexAttribArray(GLKVertexAttribNormal); glVertexAttribPointer(GLKVertexAttribNormal, 3, GL_FLOAT, GL_FALSE, 0, BUFFER_OFFSET(sizeof(verts))); //tell GL where norms are in buffer glBindVertexArrayOES(0); And the update method. - (void)update { float aspect = fabsf(self.view.bounds.size.width / self.view.bounds.size.height); GLKMatrix4 projectionMatrix = GLKMatrix4MakePerspective(GLKMathDegreesToRadians(65.0f), aspect, 0.1f, 100.0f); self.effect.transform.projectionMatrix = projectionMatrix; GLKMatrix4 baseModelViewMatrix = GLKMatrix4MakeTranslation(0.0f, 0.0f, -4.0f); baseModelViewMatrix = GLKMatrix4Rotate(baseModelViewMatrix, _rotation, 0.0f, 1.0f, 0.0f); // Compute the model view matrix for the object rendered with GLKit GLKMatrix4 modelViewMatrix = GLKMatrix4MakeTranslation(0.0f, 0.0f, -1.5f); modelViewMatrix = GLKMatrix4Rotate(modelViewMatrix, _rotation, 1.0f, 1.0f, 1.0f); modelViewMatrix = GLKMatrix4Multiply(baseModelViewMatrix, modelViewMatrix); self.effect.transform.modelviewMatrix = modelViewMatrix; // Compute the model view matrix for the object rendered with ES2 modelViewMatrix = GLKMatrix4MakeTranslation(0.0f, 0.0f, 1.5f); modelViewMatrix = GLKMatrix4Rotate(modelViewMatrix, _rotation, 1.0f, 1.0f, 1.0f); modelViewMatrix = GLKMatrix4Multiply(baseModelViewMatrix, modelViewMatrix); _normalMatrix = GLKMatrix3InvertAndTranspose(GLKMatrix4GetMatrix3(modelViewMatrix), NULL); _modelViewProjectionMatrix = GLKMatrix4Multiply(projectionMatrix, modelViewMatrix); _rotation += self.timeSinceLastUpdate * 0.5f; } But providing I understand this correct one pyramid is using the GLKit base effect shaders and the other the shaders which are included in the project. So for both of them to have the same error, I thought it would be the Norms?

    Read the article

  • OpenGL or OpenGL ES

    - by zxspectrum
    What should I learn? OpenGL 4.1 or OpenGL ES 2.0? I will be developing desktop applications using Qt but I may start developing mobile applications in a few months, too. I don't know anything about 3D, 3D math, etc and I'd rather spend 100 bucks in a good book than 1 week digging websites and going through trial and error. One problem I see with OpenGL 4.1 is as far as I know there is no book yet (the most recent ones are for OpenGL 3.3 or 4.0), while there are books on OpenGL ES 2.0. On the other hand, from my naive point of view, OpenGL 4.1 seems like OpenGL ES 2.0 + additions, so it looks like it would be easier/better to first learn OpenGL ES 2.0, then go for the shader language, etc Please, don't tell me to use NeHe (it's generally agreed it's full of bad/old practices), the Durian tutorial, etc. Thanks

    Read the article

  • Create a trailing, ghosting effect of a sprite

    - by Neeko
    I want to create a trailing, ghosting like effect of a sprite that's moving fast. Something very similar to this image of Sonic (apologies of bad quality, it's the only example I could find of the effect I'm looking to achieve) However, I don't want to do this at the sprite sheet level, to avoid having to essentially double (or possibly quadruple) the amount of sprites in my atlas. It's also very labor intensive. So is there any other way to achieve this effect? Possibly by some shader voodoo magic? I am using Unity and 2D Toolkit, if that helps.

    Read the article

  • Where are some good resources to learn Game Development with OpenGL ES 2.X

    - by Mahbubur R Aaman
    Background: From http://www.khronos.org/opengles/2_X/ OpenGL ES 2.0 combines a version of the OpenGL Shading Language for programming vertex and fragment shaders that has been adapted for embedded platforms, together with a streamlined API from OpenGL ES 1.1 that has removed any fixed functionality that can be easily replaced by shader programs, to minimize the cost and power consumption of advanced programmable graphics subsystems. Related Resources The OpenGL ES 2.0 specification, header files, and optional extension specifications The OpenGL ES 2.0 Online Manual Pages The OpenGL ES 3.0 Shading LanguageOnline Reference Pages The OpenGL ES 2.0 Quick Reference Card OpenGL ES 1.X OpenGL ES 2.0 From http://www.cocos2d-iphone.org/archives/2003 Cocos2d Version 2 released and one of primary key point noted as OpenGL ES 2.0 support From http://www.h-online.com/open/news/item/Compiz-now-supports-OpenGL-ES-2-0-1674605.html Compiz now supports OpenGL ES 2.0 My Question : Being as a Game Developer ( I have to work with several game engine Cocos2d, Unity). I need several resources to cope up with OpenGL ES 2.X for better outcome while developing games?

    Read the article

  • Opengl glVertexAttrib4fv doesn't work?

    - by Naor
    This is my vertex shader: static const GLchar * vertex_shader_source[] = { "#version 430 core \n" "layout (location = 0) in vec4 offset; \n" "void main(void) \n" "{ \n" " const vec4 vertices[3] = vec4[3](vec4( 0.25, -0.25, 0.5, 1.0),\n" " vec4(-0.25, -0.25, 0.5, 1.0), \n" " vec4( 0.25, 0.25, 0.5, 1.0)); \n" " gl_Position = vertices[gl_VertexID] + offset; \n" "} \n" }; and this is what im trying to do: glUseProgram(rendering_program); GLfloat attrib[] = { (float)sin(currentTime) * 0.5f, (float)cos(currentTime) * 0.6f, 0.0f, 0.0f }; glVertexAttrib4fv(0, attrib); glDrawArrays(GL_TRIANGLES, 0, 3); currentTime - The number in seconds since the program has started. Expected result - Triangle moving around the window. Its from the SuperBible book (sixth edition), this is the full code:http://pastebin.com/xA3eCKz1 The triangle should move across the screen but it doesn't.

    Read the article

  • Google I/O 2011: 3D Graphics on Android: Lessons learned from Google Body

    Google I/O 2011: 3D Graphics on Android: Lessons learned from Google Body Nico Weber Google originally built Google Body, a 3D application that renders the human body in incredible detail, for WebGL-capable browsers running on high-end bPCs. To bring the app to Android at a high resolution and frame rate, Nico Weber and Won Chun had a close encounter with Android's graphics stack. In this session Nico will present their findings as best practices for high-end 3D graphics using OpenGL ES 2.0 on Android. The covered topics range from getting accelerated pixels on the screen to fast resource loading, performance guidelines, texture compression, mipmapping, recommended vertex attribute formats, and shader handling. The talk also touches on related topics such as SDK vs NDK, picking, and resource loading. From: GoogleDevelopers Views: 6077 29 ratings Time: 56:09 More in Science & Technology

    Read the article

  • Modelling photo-realistic grass in realtime

    - by sebf
    Hello, I see a number of tutorials on how to create good looking grasses when creating 3D renders but can't think how to model it for realtime/use in a game's scenery. Sure simple models with alpha cutouts can be used to create plants and trees in really awesome scenery but what about a lawn? Are there any good tricks to achieve this effect? I tried with a simple 4 sided box and a small texture and the number of objects needed for a decent appearance made Max crawl to a halt. (I am thinking it may be possible with a shader but that is a whole other area so thought I would just ask about anyones experience with modelling it here) Thanks!

    Read the article

  • Adding VFACE semantic causes overlapping output semantics error

    - by user1423893
    My pixel shader input is a follows struct VertexShaderOut { float4 Position : POSITION0; float2 TextureCoordinates : TEXCOORD0; float4 PositionClone : TEXCOORD1; // Final position values must be cloned to be used in PS calculations float3 Normal : TEXCOORD2; //float3x3 TBN : TEXCOORD3; float CullFace : VFACE; // A negative value faces backwards (-1), while a positive value (+1) faces the camera (requires ps_3_0) }; I'm using ps_3_0 and I wish to utilise the VFACE semantic for correct lighting of normals depending on the cull mode. If I add the VFACE semantic then I get the following errors: error X5639: dcl usage+index: position,0 has already been specified for an output register error X4504: overlapping output semantics Why would this occur? I can't see why there would be too much data.

    Read the article

< Previous Page | 27 28 29 30 31 32 33 34 35 36 37 38  | Next Page >