large - Page 44 - Developer IT

XML streaming with XProc.

- by Pierre

Hi all, I'm playing with xproc, the XML pipeline language and http://xmlcalabash.com/. I'd like to find an example for streaming large xml documents. for example, given the following huge xml document: <Books> <Book> <title>Book-1</title> </Book> <Book> <title>Book-2</title> </Book> <Book> <title>Book-3</title> </Book>  <Book> <title>Book-N</title> </Book> </Books> How should I proceed to loop (streaming) over x-N documents like <Books> <Book> <title>Book-x</title> </Book> </Books> and treat each document with a xslt ? is it possible with xproc ?

Read the article

Unable to load huge XML document (incorrectly suppose it's due to the XSLT processing)

- by krisvandenbergh

I'm trying to match certain elements using XSLT. My input document is very large and the source XML fails to load after processing the following code (consider especially the first line). <xsl:template match="XMI/XMI.content/Model_Management.Model/Foundation.Core.Namespace.ownedElement/Model_Management.Package/Foundation.Core.Namespace.ownedElement"> <rdf:RDF> <rdf:Description rdf:about=""> <xsl:for-each select="Foundation.Core.Class"> <xsl:for-each select="Foundation.Core.ModelElement.name"> <owl:Class rdf:ID="@Foundation.Core.ModelElement.name" /> </xsl:for-each> </xsl:for-each> </rdf:Description> </rdf:RDF> </xsl:template> Apparently the XSLT fails to load after "Model_Management.Model". The PHP code is as follows: if ($xml->loadXML($source_xml) == false) { die('Failed to load source XML: ' . $http_file); } It then fails to perform loadXML and immediately dies. I think there are two options now. 1) I should set a maximum executing time. Frankly, I don't know how that I do this for the built-in PHP 5 XSLT processor. 2) Think about another way to match. What would be the best way to deal with this? The input document can be found at http://krisvandenbergh.be/uml_pricing.xml Any help would be appreciated! Thanks.

Read the article

Getting started with massive data

- by Max

I'm a math guy and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from? Hadoop/MapReduce is one obvious start. Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?) I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with? Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale. Any other suggestions on things to learn would be great!

Read the article

Copying a foreign Subversion repository to keep under dependencies

- by Jonathan Sternberg

I want to keep dependencies for my project in our own repository, that way we have consistent libraries for the entire team to work with. For example, I want our project to use the Boost libraries. I've seen this done in the past with putting dependencies under a "vendor" or "dependencies" folder. But I still want to be able to update these dependencies. If a new feature appears in a library and we need it, I want to just be able to update that repository within our own repository. I don't want to have to recopy it and put it under version control again. I'd also like for us to have the ability to change dependencies if a small change is needed without stopping us from ever updating the library. I want the ability to do something like 'svn cp', then be able to 'svn merge' in the future. I just tried this with the boost trunk, but I'm not able to get any history using 'svn log' on the copy I made. How do I do this? What is usually done for large projects with dependencies?

Read the article

Database structure for ecommerce site

- by imanc

Hey Guys, I have been tasked with designing an ecommerce solution. The aspect that is causing me the most problems is the database. Currently the site consists of 10+ country based shops each with their own database (all residing on the same mysql instance). For the new site I'd rather all these shop databases be merged into one database so that all tables (products, orders, customers etc.) have a shop_id field. From a programming perspective this seems to make the most sense as we won't have to manage data across multiple databases. Currently the entire site generates about 120k orders a year, but is experiencing fairly heavy growth and we need to design a solution that will scale. In 5 years there may be more than a million orders per year and a database that contains 5 years order history (archiving maybe a solution here). The question is - do we use a single database, or do we keep the database-per-shop structure? I am currently trying to find supporting evidence for either avenue. The company I am designing the solution for prefer the per-shop database structure because they believe it will allow the sites to scale. But my argument is that the shop's database probably won't get that busy over the next few years that they exceed the capacity of a mysql database and a "no expenses spared" hardware set-up. I am wondering if anyone has any advice either way? Does anyone have experience with websites / ecommerce sites that have tables containing millions of records? I know there is probably not a clear answer here, but at what stage do we have too many records or too large table files to have a fast loading site? Also, if anyone has any advice on sources of information - books, websites, etc. where I can do further research, it would be highly appreciated! Cheers, imanc

Read the article

Read/Write/Find/Replace huge csv file

- by notapipe

I have a huge (4,5 GB) csv file.. I need to perform basic cut and paste, replace operations for some columns.. the data is pretty well organized.. the only problem is I cannot play with it with Excel because of the size (2000 rows, 550000 columns). here is some part of the data: ID,Affection,Sex,DRB1_1,DRB1_2,SENum,SEStatus,AntiCCP,RFUW,rs3094315,rs12562034,rs3934834,rs9442372,rs3737728 D0024949,0,F,0101,0401,SS,yes,?,?,A_A,A_A,G_G,G_G D0024302,0,F,0101,7,SN,yes,?,?,A_A,G_G,A_G,?_? D0023151,0,F,0101,11,SN,yes,?,?,A_A,G_G,G_G,G_G I need to remove 4th, 5th, 6th, 7th, 8th and 9th columns; I need to find every _ character from column 10 onwards and replace it with a space ( ) character; I need to replace every ? with zero (0); I need to replace every comma with a tab; I need to remove first row (that has column names; I need to replace every 0 with 1, every 1 with 2 and every ? with 0 in 2nd column; I need to replace F with 2, M with 1 and ? with 0 in 3rd column; so that in the resulting file the output reads: D0024949 1 2 A A A A G G G G D0024302 1 2 A A G G A G 0 0 D0023151 1 2 A A G G G G G G (both input and output should read one line per row, ne extra blank row) Is there a memory efficient way of doing that with java(and I need a code to do that) or a usable tool for playing with this large data so that I can easily apply Excel functionality..

Read the article

Improving File Read Performance (single file, C++, Windows)

- by david

I have large (hundreds of MB or more) files that I need to read blocks from using C++ on Windows. Currently the relevant functions are: errorType LargeFile::read( void* data_out, __int64 start_position, __int64 size_bytes ) const { if( !m_open ) { // return error } else { seekPosition( start_position ); DWORD bytes_read; BOOL result = ReadFile( m_file, data_out, DWORD( size_bytes ), &bytes_read, NULL ); if( size_bytes != bytes_read || result != TRUE ) { // return error } } // return no error } void LargeFile::seekPosition( __int64 position ) const { LARGE_INTEGER target; target.QuadPart = LONGLONG( position ); SetFilePointerEx( m_file, target, NULL, FILE_BEGIN ); } The performance of the above does not seem to be very good. Reads are on 4K blocks of the file. Some reads are coherent, most are not. A couple questions: Is there a good way to profile the reads? What things might improve the performance? For example, would sector-aligning the data be useful? I'm relatively new to file i/o optimization, so suggestions or pointers to articles/tutorials would be helpful.

Read the article

MySQL - What is wrong with this query or my database? Terrible performance.

- by Moss

SELECT * from `employees` a LEFT JOIN (SELECT phone1 p1, count(*) c, FROM `employees` GROUP BY phone1) b ON a.phone1 = b.p1; I'm not sure if it is this query in particular that has the problem. I have been getting terrible performance in general with this database. The table in question has 120,000 rows. I have tried this particular query remotely and locally with the MyISAM and InnoDB engines, with different types of joins, and with and without an index on phone1. I can get this to complete in about 4 minutes on a 10,000 row table successfully but performance drops exponentially with larger tables. Remotely it will lose connection to the server and locally it brings my system to its knees and seems to go on forever. This query is only a smaller step I was trying to do when a larger query couldn't complete. Maybe I should explain the whole scenario. I have one big flat ugly table that lists a bunch of people and their contact info and the info of the companies they work for. I'm trying to normalize the database and intelligently determine which phone numbers apply to individual people and which apply to an office location. My reasoning is that if a phone number occurs multiple times and the number of occurrence equals the number of times that the street address it is attached to occurs then it must be an office number. So the first step is to count each phone number grouping by phone number. Normally if you just use COUNT()...GROUP BY it will only list the first record it finds in that group so I figured I have to join the full table to the count table where the phone number matches. This does work but as I said I can't successfully complete it on any table much larger than 10,000 rows. This seems pathetic and this doesn't seem like a crazy query to do. Is there a better way to achieve what I want or do I have to break my large table into 12 pieces or is there something wrong with the table or db?

Read the article

Given a trace of packets, how would you group them into flows?

- by zxcvbnm

I've tried it these ways so far: 1) Make a hash with the source IP/port and destination IP/port as keys. Each position in the hash is a list of packets. The hash is then saved in a file, with each flow separated by some special characters/line. Problem: Not enough memory for large traces. 2) Make a hash with the same key as above, but only keep in memory the file handles. Each packet is then put into the hash[key] that points to the right file. Problems: Too many flows/files (~200k) and it might run out of memory as well. 3) Hash the source IP/port and destination IP/port, then put the info inside a file. The difference between 2 and 3 is that here the files are opened and closed for each operation, so I don't have to worry about running out of memory because I opened too many at the same time. Problems: WAY too slow, same number of files as 2 so also impractical. 4) Make a hash of the source IP/port pairs and then iterate over the whole trace for each flow. Take the packets that are part of that flow and place them into the output file. Problem: Suppose I have a 60 MB trace that has 200k flows. This way, I would process, say, a 60 MB file 200k times. Maybe removing the packets as I iterate would make it not so painful, but so far I'm not sure this would be a good solution. 5) Split them by IP source/destination and then create a single file for each one, separating the flows by special characters. Still too many files (+50k). Right now I'm using Ruby to do it, which might've been a bad idea, I guess. Currently I've filtered the traces with tshark so that they only have relevant info, so I can't really make them any smaller. I thought about loading everything in memory as described in 1) using C#/Java/C++, but I was wondering if there wouldn't be a better approach here, especially since I might also run out of memory later on even with a more efficient language if I have to use larger traces. In summary, the problem I'm facing is that I either have too many files or that I run out of memory. I've also tried searching for some tool to filter the info, but I don't think there is one. The ones I've found only return some statistics and wouldn't scan for every flow as I need.

Read the article

Windows Azure Service Bus Splitter and Aggregator

- by Alan Smith

This article will cover basic implementations of the Splitter and Aggregator patterns using the Windows Azure Service Bus. The content will be included in the next release of the “Windows Azure Service Bus Developer Guide”, along with some other patterns I am working on. I’ve taken the pattern descriptions from the book “Enterprise Integration Patterns” by Gregor Hohpe. I bought a copy of the book in 2004, and recently dusted it off when I started to look at implementing the patterns on the Windows Azure Service Bus. Gregor has also presented an session in 2011 “Enterprise Integration Patterns: Past, Present and Future” which is well worth a look. I’ll be covering more patterns in the coming weeks, I’m currently working on Wire-Tap and Scatter-Gather. There will no doubt be a section on implementing these patterns in my “SOA, Connectivity and Integration using the Windows Azure Service Bus” course. There are a number of scenarios where a message needs to be divided into a number of sub messages, and also where a number of sub messages need to be combined to form one message. The splitter and aggregator patterns provide a definition of how this can be achieved. This section will focus on the implementation of basic splitter and aggregator patens using the Windows Azure Service Bus direct programming model. In BizTalk Server receive pipelines are typically used to implement the splitter patterns, with sequential convoy orchestrations often used to aggregate messages. In the current release of the Service Bus, there is no functionality in the direct programming model that implements these patterns, so it is up to the developer to implement them in the applications that send and receive messages. Splitter A message splitter takes a message and spits the message into a number of sub messages. As there are different scenarios for how a message can be split into sub messages, message splitters are implemented using different algorithms. The Enterprise Integration Patterns book describes the splatter pattern as follows: How can we process a message if it contains multiple elements, each of which may have to be processed in a different way? Use a Splitter to break out the composite message into a series of individual messages, each containing data related to one item. The Enterprise Integration Patterns website provides a description of the Splitter pattern here. In some scenarios a batch message could be split into the sub messages that are contained in the batch. The splitting of a message could be based on the message type of sub-message, or the trading partner that the sub message is to be sent to. Aggregator An aggregator takes a stream or related messages and combines them together to form one message. The Enterprise Integration Patterns book describes the aggregator pattern as follows: How do we combine the results of individual, but related messages so that they can be processed as a whole? Use a stateful filter, an Aggregator, to collect and store individual messages until a complete set of related messages has been received. Then, the Aggregator publishes a single message distilled from the individual messages. The Enterprise Integration Patterns website provides a description of the Aggregator pattern here. A common example of the need for an aggregator is in scenarios where a stream of messages needs to be combined into a daily batch to be sent to a legacy line-of-business application. The BizTalk Server EDI functionality provides support for batching messages in this way using a sequential convoy orchestration. Scenario The scenario for this implementation of the splitter and aggregator patterns is the sending and receiving of large messages using a Service Bus queue. In the current release, the Windows Azure Service Bus currently supports a maximum message size of 256 KB, with a maximum header size of 64 KB. This leaves a safe maximum body size of 192 KB. The BrokeredMessage class will support messages larger than 256 KB; in fact the Size property is of type long, implying that very large messages may be supported at some point in the future. The 256 KB size restriction is set in the service bus components that are deployed in the Windows Azure data centers. One of the ways of working around this size restriction is to split large messages into a sequence of smaller sub messages in the sending application, send them via a queue, and then reassemble them in the receiving application. This scenario will be used to demonstrate the pattern implementations. Implementation The splitter and aggregator will be used to provide functionality to send and receive large messages over the Windows Azure Service Bus. In order to make the implementations generic and reusable they will be implemented as a class library. The splitter will be implemented in the LargeMessageSender class and the aggregator in the LargeMessageReceiver class. A class diagram showing the two classes is shown below. Implementing the Splitter The splitter will take a large brokered message, and split the messages into a sequence of smaller sub-messages that can be transmitted over the service bus messaging entities. The LargeMessageSender class provides a Send method that takes a large brokered message as a parameter. The implementation of the class is shown below; console output has been added to provide details of the splitting operation. public class LargeMessageSender { private static int SubMessageBodySize = 192 * 1024; private QueueClient m_QueueClient; public LargeMessageSender(QueueClient queueClient) { m_QueueClient = queueClient; } public void Send(BrokeredMessage message) { // Calculate the number of sub messages required. long messageBodySize = message.Size; int nrSubMessages = (int)(messageBodySize / SubMessageBodySize); if (messageBodySize % SubMessageBodySize != 0) { nrSubMessages++; } // Create a unique session Id. string sessionId = Guid.NewGuid().ToString(); Console.WriteLine("Message session Id: " + sessionId); Console.Write("Sending {0} sub-messages", nrSubMessages); Stream bodyStream = message.GetBody<Stream>(); for (int streamOffest = 0; streamOffest < messageBodySize; streamOffest += SubMessageBodySize) { // Get the stream chunk from the large message long arraySize = (messageBodySize - streamOffest) > SubMessageBodySize ? SubMessageBodySize : messageBodySize - streamOffest; byte[] subMessageBytes = new byte[arraySize]; int result = bodyStream.Read(subMessageBytes, 0, (int)arraySize); MemoryStream subMessageStream = new MemoryStream(subMessageBytes); // Create a new message BrokeredMessage subMessage = new BrokeredMessage(subMessageStream, true); subMessage.SessionId = sessionId; // Send the message m_QueueClient.Send(subMessage); Console.Write("."); } Console.WriteLine("Done!"); }} The LargeMessageSender class is initialized with a QueueClient that is created by the sending application. When the large message is sent, the number of sub messages is calculated based on the size of the body of the large message. A unique session Id is created to allow the sub messages to be sent as a message session, this session Id will be used for correlation in the aggregator. A for loop in then used to create the sequence of sub messages by creating chunks of data from the stream of the large message. The sub messages are then sent to the queue using the QueueClient. As sessions are used to correlate the messages, the queue used for message exchange must be created with the RequiresSession property set to true. Implementing the Aggregator The aggregator will receive the sub messages in the message session that was created by the splitter, and combine them to form a single, large message. The aggregator is implemented in the LargeMessageReceiver class, with a Receive method that returns a BrokeredMessage. The implementation of the class is shown below; console output has been added to provide details of the splitting operation. public class LargeMessageReceiver { private QueueClient m_QueueClient; public LargeMessageReceiver(QueueClient queueClient) { m_QueueClient = queueClient; } public BrokeredMessage Receive() { // Create a memory stream to store the large message body. MemoryStream largeMessageStream = new MemoryStream(); // Accept a message session from the queue. MessageSession session = m_QueueClient.AcceptMessageSession(); Console.WriteLine("Message session Id: " + session.SessionId); Console.Write("Receiving sub messages"); while (true) { // Receive a sub message BrokeredMessage subMessage = session.Receive(TimeSpan.FromSeconds(5)); if (subMessage != null) { // Copy the sub message body to the large message stream. Stream subMessageStream = subMessage.GetBody<Stream>(); subMessageStream.CopyTo(largeMessageStream); // Mark the message as complete. subMessage.Complete(); Console.Write("."); } else { // The last message in the sequence is our completeness criteria. Console.WriteLine("Done!"); break; } } // Create an aggregated message from the large message stream. BrokeredMessage largeMessage = new BrokeredMessage(largeMessageStream, true); return largeMessage; } } The LargeMessageReceiver initialized using a QueueClient that is created by the receiving application. The receive method creates a memory stream that will be used to aggregate the large message body. The AcceptMessageSession method on the QueueClient is then called, which will wait for the first message in a message session to become available on the queue. As the AcceptMessageSession can throw a timeout exception if no message is available on the queue after 60 seconds, a real-world implementation should handle this accordingly. Once the message session as accepted, the sub messages in the session are received, and their message body streams copied to the memory stream. Once all the messages have been received, the memory stream is used to create a large message, that is then returned to the receiving application. Testing the Implementation The splitter and aggregator are tested by creating a message sender and message receiver application. The payload for the large message will be one of the webcast video files from http://www.cloudcasts.net/, the file size is 9,697 KB, well over the 256 KB threshold imposed by the Service Bus. As the splitter and aggregator are implemented in a separate class library, the code used in the sender and receiver console is fairly basic. The implementation of the main method of the sending application is shown below. static void Main(string[] args) { // Create a token provider with the relevant credentials. TokenProvider credentials = TokenProvider.CreateSharedSecretTokenProvider (AccountDetails.Name, AccountDetails.Key); // Create a URI for the serivce bus. Uri serviceBusUri = ServiceBusEnvironment.CreateServiceUri ("sb", AccountDetails.Namespace, string.Empty); // Create the MessagingFactory MessagingFactory factory = MessagingFactory.Create(serviceBusUri, credentials); // Use the MessagingFactory to create a queue client QueueClient queueClient = factory.CreateQueueClient(AccountDetails.QueueName); // Open the input file. FileStream fileStream = new FileStream(AccountDetails.TestFile, FileMode.Open); // Create a BrokeredMessage for the file. BrokeredMessage largeMessage = new BrokeredMessage(fileStream, true); Console.WriteLine("Sending: " + AccountDetails.TestFile); Console.WriteLine("Message body size: " + largeMessage.Size); Console.WriteLine(); // Send the message with a LargeMessageSender LargeMessageSender sender = new LargeMessageSender(queueClient); sender.Send(largeMessage); // Close the messaging facory. factory.Close(); } The implementation of the main method of the receiving application is shown below. static void Main(string[] args) { // Create a token provider with the relevant credentials. TokenProvider credentials = TokenProvider.CreateSharedSecretTokenProvider (AccountDetails.Name, AccountDetails.Key); // Create a URI for the serivce bus. Uri serviceBusUri = ServiceBusEnvironment.CreateServiceUri ("sb", AccountDetails.Namespace, string.Empty); // Create the MessagingFactory MessagingFactory factory = MessagingFactory.Create(serviceBusUri, credentials); // Use the MessagingFactory to create a queue client QueueClient queueClient = factory.CreateQueueClient(AccountDetails.QueueName); // Create a LargeMessageReceiver and receive the message. LargeMessageReceiver receiver = new LargeMessageReceiver(queueClient); BrokeredMessage largeMessage = receiver.Receive(); Console.WriteLine("Received message"); Console.WriteLine("Message body size: " + largeMessage.Size); string testFile = AccountDetails.TestFile.Replace(@"\In\", @"\Out\"); Console.WriteLine("Saving file: " + testFile); // Save the message body as a file. Stream largeMessageStream = largeMessage.GetBody<Stream>(); largeMessageStream.Seek(0, SeekOrigin.Begin); FileStream fileOut = new FileStream(testFile, FileMode.Create); largeMessageStream.CopyTo(fileOut); fileOut.Close(); Console.WriteLine("Done!"); } In order to test the application, the sending application is executed, which will use the LargeMessageSender class to split the message and place it on the queue. The output of the sender console is shown below. The console shows that the body size of the large message was 9,929,365 bytes, and the message was sent as a sequence of 51 sub messages. When the receiving application is executed the results are shown below. The console application shows that the aggregator has received the 51 messages from the message sequence that was creating in the sending application. The messages have been aggregated to form a massage with a body of 9,929,365 bytes, which is the same as the original large message. The message body is then saved as a file. Improvements to the Implementation The splitter and aggregator patterns in this implementation were created in order to show the usage of the patterns in a demo, which they do quite well. When implementing these patterns in a real-world scenario there are a number of improvements that could be made to the design. Copying Message Header Properties When sending a large message using these classes, it would be great if the message header properties in the message that was received were copied from the message that was sent. The sending application may well add information to the message context that will be required in the receiving application. When the sub messages are created in the splitter, the header properties in the first message could be set to the values in the original large message. The aggregator could then used the values from this first sub message to set the properties in the message header of the large message during the aggregation process. Using Asynchronous Methods The current implementation uses the synchronous send and receive methods of the QueueClient class. It would be much more performant to use the asynchronous methods, however doing so may well affect the sequence in which the sub messages are enqueued, which would require the implementation of a resequencer in the aggregator to restore the correct message sequence. Handling Exceptions In order to keep the code readable no exception handling was added to the implementations. In a real-world scenario exceptions should be handled accordingly.

Read the article

How does I/O work for large graph databases?

- by tjb1982

I should preface this by saying that I'm mostly a front end web developer, trained as a musician, but over the past few years I've been getting more and more into computer science. So one idea I have as a fun toy project to learn about data structures and C programming was to design and implement my own very simple database that would manage an adjacency list of posts. I don't want SQL (maybe I'll do my own query language? I'm just having fun). It should support ACID. It should be capable of storing 1TB let's say. So with that, I was trying to think of how a database even stores data, without regard to data structures necessarily. I'm working on linux, and I've read that in that world "everything is a file," including hardware (like /dev/*), so I think that that obviously has to apply to a database, too, and it clearly does--whether it's MySQL or PostgreSQL or Neo4j, the database itself is a collection of files you can see in the filesystem. That said, there would come a point in scale where loading the entire database into primary memory just wouldn't work, so it doesn't make sense to design it with that mindset (I assume). However, reading from secondary memory would be much slower and regardless some portion of the database has to be in primary memory in order for you to be able to do anything with it. I read this post: Why use a database instead of just saving your data to disk? And I found it difficult to understand how other databases, like SQLite or Neo4j, read and write from secondary memory and are still very fast (faster, it would seem, than simply writing files to the filesystem as the above question suggests). It seems the key is indexing. But even indexes need to be stored in secondary memory. They are inherently smaller than the database itself, but indexes in a very large database might be prohibitively large, too. So my question is how is I/O generally done with large databases like the one I described above that would be at least 1TB storing a big adjacency list? If indexing is more or less the answer, how exactly does indexing work--what data structures should be involved?

Read the article

Why do large IT projects tend to fail or have big cost/schedule overruns?

- by Pratik

I always read about large scale transformation or integration project that are total or almost total disaster. Even if they somehow manage to succeed the cost and schedule blow out is enormous. What is the real reason behind large projects being more prone to failure. Can agile be used in these sort of projects or traditional approach is still the best. One example from Australia is the Queensland Payroll project where they changed test success criteria to deliver the project. See some more failed projects in this SO question Have you got any personal experience to share?

Read the article

How can I create multiple identical AWS EC2 server instances with large amounts of persistent data?

- by mojones

I have a CPU-intensive data-processing application that I want to run across many (~100,000) input files. The application needs a large (~20GB) data file in order to run. What I would like to do is create an EC2 machine image that has my application and associated data files installed boot up a large number (e.g. 100) of instances of this image split my input files up into 100 batches and send one batch to be processed on each instance I am having trouble figuring out the best way to ensure that each instance has access to the large data file. The data file is too big to fit on the root filesystem of an AMI. I could use Block Storage, but a given Block Storage volume can only be attached to a single instance, so I would need 100 clones. Is there some way to create a custom image that has more space on the root filsystem so that I can include my large data file? Or is there a better way to tackle this problem?

Read the article

Breaking up a large PHP object used to abstract the database. Best practices?

- by John Kershaw

Two years ago it was thought a single object with functions such as $database->get_user_from_id($ID) would be a good idea. The functions return objects (not arrays), and the front-end code never worries about the database. This was great, until we started growing the database. There's now 30+ tables, and around 150 functions in the database object. It's getting impractical and unmanageable and I'm going to be breaking it up. What is a good solution to this problem? The project is large, so there's a limit to the extent I can change things. My current plan is to extend the current object for each table, then have the database object contain these. So, the above example would turn into (assume "user" is a table) $database->user->get_user_from_id($ID). Instead of one large file, we would have a file for every table.

Read the article

Can I copy large files faster without using the file cache?

- by Veazer

After adding the preload package, my applications seem to speed up but if I copy a large file, the file cache grows by more than double the size of the file. By transferring a single 3-4 GB virtualbox image or video file to an external drive, this huge cache seems to remove all the preloaded applications from memory, leading to increased load times and general performance drops. Is there a way to copy large, multi-gigabyte files without caching them (i.e. bypassing the file cache)? Or a way to whitelist or blacklist specific folders from being cached?

Read the article

Why use C++ when C works for large projects equally well?

- by Karl

Before I start, please DO NOT make this into a C vs C++ flamewar. This question has nothing to do with which language is better or not. Period. I have read that C++ is said to be fit for large projects. After all, it makes managing code easier. OO and other features, for example the STL. But then why use C++ when C works equally well for large projects? Take the example of the Linux kernel. Or GNOME. Or even Windows I guess, it is written in C right? So why bother at all with the complexity of C++ (templates and all that), when C works well and this is not just a statement, but proper examples have been quoted. If it works for projects of magnitude of the kernel, why is C++ preferred or why is C not used for almost all projects?

Read the article

What techniques can I use to render very large numbers of objects more efficiently in OpenGL?

- by Luke

You can think of my application as drawing a very large ball-and-stick diagram (or graph). At times, this graph can get very large, where the number of elements even outnumbers the pixels on the screen. Currently I am simply passing all of my textures (as GL_POINTS) and lines to the graphics card using VBO's. When the number of elements outnumbers the number of pixels, is this the most efficient way to do this? Or should I do some calculations on the CPU side before handing everything over to the GPU? If it matters, I do use GL_DEPTH_TEST and GL_ALPHA_TEST. I do some alpha blending, but probably not enough to make a huge performance difference. My scene can be static at times, but the user has control over a typical arc-ball camera and can pan, rotate, or zoom. It is during these operations that performance degradation is noticeable.

Read the article

What are the best tools to help work with large ant files.

- by klfox

I just started working at a company that has a very large ant build file that imports lots of other large/small ant files. Needless to say it's giving me a headache trying to figure out what is going on. What are the best tools out there for: Getting some kind of concise answer on what is happening Visualizing the various targets Seeing performance on tasks Can be multiple tools. Any other tips/suggestions? I tagged this as java since I don't have the reputation to create an ant tag.

Read the article

Is CodeIgniter PHP Framework suitable for large ERP or Business Application?

- by adietan63

Is CodeIgniter is recommended for a large web based ERP or Business Application? I want to use CodeIgniter for my future Project and I'm so confused whether to use it or not. Im so worried about in the long term process or lifetime of the application that it may crashed or produce a bug or error. I also worried about the performance of the framework when the data becomes larger and containing millions of records. I searched on the internet the answer but there is no exactly answer that will satisfy me. I think this question is important for the programmers like me who wanted to use PHP Framework for their large business application. I need an advice from you guys in order to decide whether to use it or not. thank you very much!

Read the article

Python: How to read huge text file into memory

- by asmaier

I'm using Python 2.6 on a Mac Mini with 1GB RAM. I want to read in a huge text file $ ls -l links.csv; file links.csv; tail links.csv -rw-r--r-- 1 user user 469904280 30 Nov 22:42 links.csv links.csv: ASCII text, with CRLF line terminators 4757187,59883 4757187,99822 4757187,66546 4757187,638452 4757187,4627959 4757187,312826 4757187,6143 4757187,6141 4757187,3081726 4757187,58197 So each line in the file consists of a tuple of two comma separated integer values. I want to read in the whole file and sort it according to the second column. I know, that I could do the sorting without reading the whole file into memory. But I thought for a file of 500MB I should still be able to do it in memory since I have 1GB available. However when I try to read in the file, Python seems to allocate a lot more memory than is needed by the file on disk. So even with 1GB of RAM I'm not able to read in the 500MB file into memory. My Python code for reading the file and printing some information about the memory consumption is: #!/usr/bin/python # -*- coding: utf-8 -*- import sys infile=open("links.csv", "r") edges=[] count=0 #count the total number of lines in the file for line in infile: count=count+1 total=count print "Total number of lines: ",total infile.seek(0) count=0 for line in infile: edge=tuple(map(int,line.strip().split(","))) edges.append(edge) count=count+1 # for every million lines print memory consumption if count%1000000==0: print "Position: ", edge print "Read ",float(count)/float(total)*100,"%." mem=sys.getsizeof(edges) for edge in edges: mem=mem+sys.getsizeof(edge) for node in edge: mem=mem+sys.getsizeof(node) print "Memory (Bytes): ", mem The output I got was: Total number of lines: 30609720 Position: (9745, 2994) Read 3.26693612356 %. Memory (Bytes): 64348736 Position: (38857, 103574) Read 6.53387224712 %. Memory (Bytes): 128816320 Position: (83609, 63498) Read 9.80080837067 %. Memory (Bytes): 192553000 Position: (139692, 1078610) Read 13.0677444942 %. Memory (Bytes): 257873392 Position: (205067, 153705) Read 16.3346806178 %. Memory (Bytes): 320107588 Position: (283371, 253064) Read 19.6016167413 %. Memory (Bytes): 385448716 Position: (354601, 377328) Read 22.8685528649 %. Memory (Bytes): 448629828 Position: (441109, 3024112) Read 26.1354889885 %. Memory (Bytes): 512208580 Already after reading only 25% of the 500MB file, Python consumes 500MB. So it seem that storing the content of the file as a list of tuples of ints is not very memory efficient. Is there a better way to do it, so that I can read in my 500MB file into my 1GB of memory?

Read the article

Using Hibernate's ScrollableResults to slowly read 90 million records

- by at

I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate: ScrollableResults results = session.createQuery("SELECT person FROM Person person") .setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY); while (results.next()) storeInFile(results.get()[0]); The problem is the above will try and load all 90 million rows into RAM before moving on to the while loop... and that will kill my memory with OutOfMemoryError: Java heap space exceptions :(. So I guess ScrollableResults isn't what I was looking for? What is the proper way to handle this? I don't mind if this while loop takes days (well I'd love it to not). I guess the only other way to handle this is to use setFirstResult and setMaxResults to iterate through the results and just use regular Hibernate results instead of ScrollableResults. That feels like it will be inefficient though and will start taking a ridiculously long time when I'm calling setFirstResult on the 89 millionth row... UPDATE: setFirstResult/setMaxResults doesn't work, it turns out to take an unusably long time to get to the offsets like I feared. There must be a solution here! Isn't this a pretty standard procedure?? I'm willing to forgo Hibernate and use JDBC or whatever it takes. UPDATE 2: the solution I've come up with which works ok, not great, is basically of the form: select * from person where id > <offset> and <other_conditions> limit 1 Since I have other conditions, even all in an index, it's still not as fast as I'd like it to be... so still open for other suggestions..

Read the article

Storing varchar(max) & varbinary(max) together - Problem?

- by Tony Basallo

I have an app that will have entries of both varchar(max) and varbinary(max) data types. I was considering putting these both in a separate table, together, even if only one of the two will be used at any given time. The question is whether storing them together has any impact on performance. Considering that they are stored in the heap, I'm thinking that having them together will not be a problem. However, the varchar(max) column will be probably have the text in row table option set. I couldn't find any performance testing or profiling while "googling bing," probably too specific a question? The SQL Server 2008 table looks like this: Id ParentId Version VersionDate StringContent - varchar(max) BinaryContent - varbinary(max) The app will decide which of the two columns to select for when the data is queried. The string column will much used much more frequently than the binary column - will this have any impact on performance?

Read the article

Importing wikipedia database dumb - kills navicat - anyone got any ideas?

- by Ali

Ok guys I've downloaded the wikipedia xml dump and its a whopping 12 GB of data :\ for one table and I wanted to import it into mysql databse on my localhost - however its a humongous file 12GB and obviously navicats taking its sweet time in importing it or its more likely its hanged :(. Is there a way to include this dump or atleast partially at most you know bit by bit. Let me correct that its 21 GB of data - not that it helps :\ - does any one have any idea of importing humongous files like this into MySQL database.

Read the article

How would you handle making an array or list that would have more entries than the standard implemen

- by faceless1_14

I am trying to create an array or list that could handle in theory, given adequate hardware and such, as many as 100^100 BigInteger entries. The problem with using an array or standard list is that they can only hold Integer.MAX_VALUE number of entries. How would you work around this limitations? A whole new class/interface? A wrapper for list? another data type entirely?

Read the article

Binary search in a sorted (memory-mapped ?) file in Java

- by sds

I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search (essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...) I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files? Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.

Search Results

Search found 10417 results on 417 pages for 'large'.

Page 44/417 | < Previous Page | 40 41 42 43 44 45 46 47 48 49 50 51 | Next Page >

- by Pierre

- by krisvandenbergh

- by Max

- by Jonathan Sternberg

- by imanc

- by notapipe

- by david

- by Moss

- by zxcvbnm

- by Alan Smith

- by tjb1982

- by Pratik

- by mojones

- by John Kershaw

- by Veazer

- by Karl

- by Luke

- by klfox

- by adietan63

- by asmaier

- by at

- by Tony Basallo

- by Ali

- by faceless1_14

- by sds

< Previous Page | 40 41 42 43 44 45 46 47 48 49 50 51 | Next Page >