Search Results

Search found 11409 results on 457 pages for 'large teams'.

Page 47/457 | < Previous Page | 43 44 45 46 47 48 49 50 51 52 53 54 | Next Page >

How can I create multiple identical AWS EC2 server instances with large amounts of persistent data?

- by mojones

I have a CPU-intensive data-processing application that I want to run across many (~100,000) input files. The application needs a large (~20GB) data file in order to run. What I would like to do is create an EC2 machine image that has my application and associated data files installed boot up a large number (e.g. 100) of instances of this image split my input files up into 100 batches and send one batch to be processed on each instance I am having trouble figuring out the best way to ensure that each instance has access to the large data file. The data file is too big to fit on the root filesystem of an AMI. I could use Block Storage, but a given Block Storage volume can only be attached to a single instance, so I would need 100 clones. Is there some way to create a custom image that has more space on the root filsystem so that I can include my large data file? Or is there a better way to tackle this problem?

Read the article
Can I copy large files faster without using the file cache?

- by Veazer

After adding the preload package, my applications seem to speed up but if I copy a large file, the file cache grows by more than double the size of the file. By transferring a single 3-4 GB virtualbox image or video file to an external drive, this huge cache seems to remove all the preloaded applications from memory, leading to increased load times and general performance drops. Is there a way to copy large, multi-gigabyte files without caching them (i.e. bypassing the file cache)? Or a way to whitelist or blacklist specific folders from being cached?

Read the article
Why use C++ when C works for large projects equally well?

- by Karl

Before I start, please DO NOT make this into a C vs C++ flamewar. This question has nothing to do with which language is better or not. Period. I have read that C++ is said to be fit for large projects. After all, it makes managing code easier. OO and other features, for example the STL. But then why use C++ when C works equally well for large projects? Take the example of the Linux kernel. Or GNOME. Or even Windows I guess, it is written in C right? So why bother at all with the complexity of C++ (templates and all that), when C works well and this is not just a statement, but proper examples have been quoted. If it works for projects of magnitude of the kernel, why is C++ preferred or why is C not used for almost all projects?

Read the article
Business Has Killed IT With Overspecialization

<b>Enterprise Networking Planet:</b> "What happened to the old "sysadmin" of just a few years ago? We've split what used to be the sysadmin into application teams, server teams, storage teams, and network teams."

Read the article
What techniques can I use to render very large numbers of objects more efficiently in OpenGL?

- by Luke

You can think of my application as drawing a very large ball-and-stick diagram (or graph). At times, this graph can get very large, where the number of elements even outnumbers the pixels on the screen. Currently I am simply passing all of my textures (as GL_POINTS) and lines to the graphics card using VBO's. When the number of elements outnumbers the number of pixels, is this the most efficient way to do this? Or should I do some calculations on the CPU side before handing everything over to the GPU? If it matters, I do use GL_DEPTH_TEST and GL_ALPHA_TEST. I do some alpha blending, but probably not enough to make a huge performance difference. My scene can be static at times, but the user has control over a typical arc-ball camera and can pan, rotate, or zoom. It is during these operations that performance degradation is noticeable.

Read the article
What are the best tools to help work with large ant files.

- by klfox

I just started working at a company that has a very large ant build file that imports lots of other large/small ant files. Needless to say it's giving me a headache trying to figure out what is going on. What are the best tools out there for: Getting some kind of concise answer on what is happening Visualizing the various targets Seeing performance on tasks Can be multiple tools. Any other tips/suggestions? I tagged this as java since I don't have the reputation to create an ant tag.

Read the article
Is CodeIgniter PHP Framework suitable for large ERP or Business Application?

- by adietan63

Is CodeIgniter is recommended for a large web based ERP or Business Application? I want to use CodeIgniter for my future Project and I'm so confused whether to use it or not. Im so worried about in the long term process or lifetime of the application that it may crashed or produce a bug or error. I also worried about the performance of the framework when the data becomes larger and containing millions of records. I searched on the internet the answer but there is no exactly answer that will satisfy me. I think this question is important for the programmers like me who wanted to use PHP Framework for their large business application. I need an advice from you guys in order to decide whether to use it or not. thank you very much!

Read the article
Python: How to read huge text file into memory

- by asmaier

I'm using Python 2.6 on a Mac Mini with 1GB RAM. I want to read in a huge text file $ ls -l links.csv; file links.csv; tail links.csv -rw-r--r-- 1 user user 469904280 30 Nov 22:42 links.csv links.csv: ASCII text, with CRLF line terminators 4757187,59883 4757187,99822 4757187,66546 4757187,638452 4757187,4627959 4757187,312826 4757187,6143 4757187,6141 4757187,3081726 4757187,58197 So each line in the file consists of a tuple of two comma separated integer values. I want to read in the whole file and sort it according to the second column. I know, that I could do the sorting without reading the whole file into memory. But I thought for a file of 500MB I should still be able to do it in memory since I have 1GB available. However when I try to read in the file, Python seems to allocate a lot more memory than is needed by the file on disk. So even with 1GB of RAM I'm not able to read in the 500MB file into memory. My Python code for reading the file and printing some information about the memory consumption is: #!/usr/bin/python # -*- coding: utf-8 -*- import sys infile=open("links.csv", "r") edges=[] count=0 #count the total number of lines in the file for line in infile: count=count+1 total=count print "Total number of lines: ",total infile.seek(0) count=0 for line in infile: edge=tuple(map(int,line.strip().split(","))) edges.append(edge) count=count+1 # for every million lines print memory consumption if count%1000000==0: print "Position: ", edge print "Read ",float(count)/float(total)*100,"%." mem=sys.getsizeof(edges) for edge in edges: mem=mem+sys.getsizeof(edge) for node in edge: mem=mem+sys.getsizeof(node) print "Memory (Bytes): ", mem The output I got was: Total number of lines: 30609720 Position: (9745, 2994) Read 3.26693612356 %. Memory (Bytes): 64348736 Position: (38857, 103574) Read 6.53387224712 %. Memory (Bytes): 128816320 Position: (83609, 63498) Read 9.80080837067 %. Memory (Bytes): 192553000 Position: (139692, 1078610) Read 13.0677444942 %. Memory (Bytes): 257873392 Position: (205067, 153705) Read 16.3346806178 %. Memory (Bytes): 320107588 Position: (283371, 253064) Read 19.6016167413 %. Memory (Bytes): 385448716 Position: (354601, 377328) Read 22.8685528649 %. Memory (Bytes): 448629828 Position: (441109, 3024112) Read 26.1354889885 %. Memory (Bytes): 512208580 Already after reading only 25% of the 500MB file, Python consumes 500MB. So it seem that storing the content of the file as a list of tuples of ints is not very memory efficient. Is there a better way to do it, so that I can read in my 500MB file into my 1GB of memory?

Read the article
Using Hibernate's ScrollableResults to slowly read 90 million records

- by at

I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate: ScrollableResults results = session.createQuery("SELECT person FROM Person person") .setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY); while (results.next()) storeInFile(results.get()[0]); The problem is the above will try and load all 90 million rows into RAM before moving on to the while loop... and that will kill my memory with OutOfMemoryError: Java heap space exceptions :(. So I guess ScrollableResults isn't what I was looking for? What is the proper way to handle this? I don't mind if this while loop takes days (well I'd love it to not). I guess the only other way to handle this is to use setFirstResult and setMaxResults to iterate through the results and just use regular Hibernate results instead of ScrollableResults. That feels like it will be inefficient though and will start taking a ridiculously long time when I'm calling setFirstResult on the 89 millionth row... UPDATE: setFirstResult/setMaxResults doesn't work, it turns out to take an unusably long time to get to the offsets like I feared. There must be a solution here! Isn't this a pretty standard procedure?? I'm willing to forgo Hibernate and use JDBC or whatever it takes. UPDATE 2: the solution I've come up with which works ok, not great, is basically of the form: select * from person where id > <offset> and <other_conditions> limit 1 Since I have other conditions, even all in an index, it's still not as fast as I'd like it to be... so still open for other suggestions..

Read the article
Storing varchar(max) & varbinary(max) together - Problem?

- by Tony Basallo

I have an app that will have entries of both varchar(max) and varbinary(max) data types. I was considering putting these both in a separate table, together, even if only one of the two will be used at any given time. The question is whether storing them together has any impact on performance. Considering that they are stored in the heap, I'm thinking that having them together will not be a problem. However, the varchar(max) column will be probably have the text in row table option set. I couldn't find any performance testing or profiling while "googling bing," probably too specific a question? The SQL Server 2008 table looks like this: Id ParentId Version VersionDate StringContent - varchar(max) BinaryContent - varbinary(max) The app will decide which of the two columns to select for when the data is queried. The string column will much used much more frequently than the binary column - will this have any impact on performance?

Read the article
Importing wikipedia database dumb - kills navicat - anyone got any ideas?

- by Ali

Ok guys I've downloaded the wikipedia xml dump and its a whopping 12 GB of data :\ for one table and I wanted to import it into mysql databse on my localhost - however its a humongous file 12GB and obviously navicats taking its sweet time in importing it or its more likely its hanged :(. Is there a way to include this dump or atleast partially at most you know bit by bit. Let me correct that its 21 GB of data - not that it helps :\ - does any one have any idea of importing humongous files like this into MySQL database.

Read the article
How would you handle making an array or list that would have more entries than the standard implemen

- by faceless1_14

I am trying to create an array or list that could handle in theory, given adequate hardware and such, as many as 100^100 BigInteger entries. The problem with using an array or standard list is that they can only hold Integer.MAX_VALUE number of entries. How would you work around this limitations? A whole new class/interface? A wrapper for list? another data type entirely?

Read the article
Binary search in a sorted (memory-mapped ?) file in Java

- by sds

I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search (essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...) I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files? Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.

Read the article
expat parser: memory consumption

- by sameer karjatkar

Hi, I am using expat parser to parse an XML file of around 15 GB . The problem is it throws an "Out of Memory" error and the program aborts . I want to know has any body faced a similar issue with the expat parser or is it a known bug and has been rectified in later versions ?

Read the article
search & replace on 3000 row, 25 column spreadsheet

- by Deca

I'm attempting to clean up data in this (old) spreadsheet and need to remove things like single and double quotes, HTML tags and so on. Trouble is, it's a 3000 row file with 25 columns and every spreadsheet app I've tried (NeoOffice, MS Excel, Apple Numbers) chokes on it. Hard. Any ideas on how else I can clean this thing up for import to MySQL? Clearly I could go through each record manually, row by row, but would like to avoid that if at all possible. Likewise, I could write a PHP script to handle it on import, but don't want to put the server into a death spiral either.

Read the article
How do quickly search through a .csv file in Python

- by Baldur

I'm reading a 6 million entry .csv file with Python, and I want to be able to search through this file for a particular entry. Are there any tricks to search the entire file? Should you read the whole thing into a dictionary or should you perform a search every time? I tried loading it into a dictionary but that took ages so I'm currently searching through the whole file every time which seems wasteful. Could I possibly utilize that the list is alphabetically ordered? (e.g. if the search word starts with "b" I only search from the line that includes the first word beginning with "b" to the line that includes the last word beginning with "b") I'm using import csv. (a side question: it is possible to make csv go to a specific line in the file? I want to make the program start at a random line) Edit: I already have a copy of the list as an .sql file as well, how could I implement that into Python?

Read the article
Performing Aggregate Functions on Multi-Million Row Tables

- by Daniel Short

I'm having some serious performance issues with a multi-million row table that I feel I should be able to get results from fairly quick. Here's a run down of what I have, how I'm querying it, and how long it's taking: I'm running SQL Server 2008 Standard, so Partitioning isn't currently an option I'm attempting to aggregate all views for all inventory for a specific account over the last 30 days. All views are stored in the following table: CREATE TABLE [dbo].[LogInvSearches_Daily]( [ID] [bigint] IDENTITY(1,1) NOT NULL, [Inv_ID] [int] NOT NULL, [Site_ID] [int] NOT NULL, [LogCount] [int] NOT NULL, [LogDay] [smalldatetime] NOT NULL, CONSTRAINT [PK_LogInvSearches_Daily] PRIMARY KEY CLUSTERED ( [ID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY] ) ON [PRIMARY] This table has 132,000,000 records, and is over 4 gigs. A sample of 10 rows from the table: ID Inv_ID Site_ID LogCount LogDay -------------------- ----------- ----------- ----------- ----------------------- 1 486752 48 14 2009-07-21 00:00:00 2 119314 51 16 2009-07-21 00:00:00 3 313678 48 25 2009-07-21 00:00:00 4 298863 0 1 2009-07-21 00:00:00 5 119996 0 2 2009-07-21 00:00:00 6 463777 534 7 2009-07-21 00:00:00 7 339976 503 2 2009-07-21 00:00:00 8 333501 570 4 2009-07-21 00:00:00 9 453955 0 12 2009-07-21 00:00:00 10 443291 0 4 2009-07-21 00:00:00 (10 row(s) affected) I have the following index on LogInvSearches_Daily: /****** Object: Index [IX_LogInvSearches_Daily_LogDay] Script Date: 05/12/2010 11:08:22 ******/ CREATE NONCLUSTERED INDEX [IX_LogInvSearches_Daily_LogDay] ON [dbo].[LogInvSearches_Daily] ( [LogDay] ASC ) INCLUDE ( [Inv_ID], [LogCount]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] I need to pull inventory only from the Inventory for a specific account id. I have an index on the Inventory as well. I'm using the following query to aggregate the data and give me the top 5 records. This query is currently taking 24 seconds to return the 5 rows: StmtText ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SELECT TOP 5 Sum(LogCount) AS Views , DENSE_RANK() OVER(ORDER BY Sum(LogCount) DESC, Inv_ID DESC) AS Rank , Inv_ID FROM LogInvSearches_Daily D (NOLOCK) WHERE LogDay DateAdd(d, -30, getdate()) AND EXISTS( SELECT NULL FROM propertyControlCenter.dbo.Inventory (NOLOCK) WHERE Acct_ID = 18731 AND Inv_ID = D.Inv_ID ) GROUP BY Inv_ID (1 row(s) affected) StmtText ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |--Top(TOP EXPRESSION:((5))) |--Sequence Project(DEFINE:([Expr1007]=dense_rank)) |--Segment |--Segment |--Sort(ORDER BY:([Expr1006] DESC, [D].[Inv_ID] DESC)) |--Stream Aggregate(GROUP BY:([D].[Inv_ID]) DEFINE:([Expr1006]=SUM([LOALogs].[dbo].[LogInvSearches_Daily].[LogCount] as [D].[LogCount]))) |--Sort(ORDER BY:([D].[Inv_ID] ASC)) |--Nested Loops(Inner Join, OUTER REFERENCES:([D].[Inv_ID])) |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1011], [Expr1012], [Expr1010])) | |--Compute Scalar(DEFINE:(([Expr1011],[Expr1012],[Expr1010])=GetRangeWithMismatchedTypes(dateadd(day,(-30),getdate()),NULL,(6)))) | | |--Constant Scan | |--Index Seek(OBJECT:([LOALogs].[dbo].[LogInvSearches_Daily].[IX_LogInvSearches_Daily_LogDay] AS [D]), SEEK:([D].[LogDay] > [Expr1011] AND [D].[LogDay] < [Expr1012]) ORDERED FORWARD) |--Index Seek(OBJECT:([propertyControlCenter].[dbo].[Inventory].[IX_Inventory_Acct_ID]), SEEK:([propertyControlCenter].[dbo].[Inventory].[Acct_ID]=(18731) AND [propertyControlCenter].[dbo].[Inventory].[Inv_ID]=[LOA (13 row(s) affected) I tried using a CTE to pick up the rows first and aggregate them, but that didn't run any faster, and gives me essentially the same execution plan. (1 row(s) affected) StmtText ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --SET SHOWPLAN_TEXT ON; WITH getSearches AS ( SELECT LogCount -- , DENSE_RANK() OVER(ORDER BY Sum(LogCount) DESC, Inv_ID DESC) AS Rank , D.Inv_ID FROM LogInvSearches_Daily D (NOLOCK) INNER JOIN propertyControlCenter.dbo.Inventory I (NOLOCK) ON Acct_ID = 18731 AND I.Inv_ID = D.Inv_ID WHERE LogDay DateAdd(d, -30, getdate()) -- GROUP BY Inv_ID ) SELECT Sum(LogCount) AS Views, Inv_ID FROM getSearches GROUP BY Inv_ID (1 row(s) affected) StmtText ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |--Stream Aggregate(GROUP BY:([D].[Inv_ID]) DEFINE:([Expr1004]=SUM([LOALogs].[dbo].[LogInvSearches_Daily].[LogCount] as [D].[LogCount]))) |--Sort(ORDER BY:([D].[Inv_ID] ASC)) |--Nested Loops(Inner Join, OUTER REFERENCES:([D].[Inv_ID])) |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1008], [Expr1009], [Expr1007])) | |--Compute Scalar(DEFINE:(([Expr1008],[Expr1009],[Expr1007])=GetRangeWithMismatchedTypes(dateadd(day,(-30),getdate()),NULL,(6)))) | | |--Constant Scan | |--Index Seek(OBJECT:([LOALogs].[dbo].[LogInvSearches_Daily].[IX_LogInvSearches_Daily_LogDay] AS [D]), SEEK:([D].[LogDay] > [Expr1008] AND [D].[LogDay] < [Expr1009]) ORDERED FORWARD) |--Index Seek(OBJECT:([propertyControlCenter].[dbo].[Inventory].[IX_Inventory_Acct_ID] AS [I]), SEEK:([I].[Acct_ID]=(18731) AND [I].[Inv_ID]=[LOALogs].[dbo].[LogInvSearches_Daily].[Inv_ID] as [D].[Inv_ID]) ORDERED FORWARD) (8 row(s) affected) (1 row(s) affected) So given that I'm getting good Index Seeks in my execution plan, what can I do to get this running faster? Thanks, Dan

Read the article
What method should be used for searching this mysql dataset?

- by GeoffreyF67

I've got a mysql dataset that contains 86 million rows. I need to have a relatively fast search through this data. The data I'll be searching through is all strings. I also need to do partial matches. Now, if I have 'foobar' and search for '%oob%' I know it'll be really slow - it has to look at every row to see if there is a match. What methods can be used to speed queries like this up? G-Man

Read the article
Random access gzip stream

- by jkff

I'd like to be able to do random access into a gzipped file. I can afford to do some preprocessing on it (say, build some kind of index), provided that the result of the preprocessing is much smaller than the file itself. Any advice? My thoughts were: Hack on an existing gzip implementation and serialize its decompressor state every, say, 1 megabyte of compressed data. Then to do random access, deserialize the decompressor state and read from the megabyte boundary. This seems hard, especially since I'm working with Java and I couldn't find a pure-java gzip implementation :( Re-compress the file in chunks of 1Mb and do same as above. This has the disadvantage of doubling the required disk space. Write a simple parser of the gzip format that doesn't do any decompressing and only detects and indexes block boundaries (if there even are any blocks: I haven't yet read the gzip format description)

Read the article
How may I scroll with vim into a big file ?

- by Luc M

Hello, I have a big file with thousands of lines of thousands of characters. I move the cursor to 3000th character. If I use PageDown or <CTRL>-D, the file will scroll but the cursor will come back to the first no-space character. There's is an option to set to keep the cursor in the same column after a such scroll ? I have the behavior with gvim on Window, vim on OpenVMS and Cygwin. Regards

Read the article
Database over 2GB in MongoDB

- by configurator

We've got a file-based program we want to convert to use a document database, specifically MongoDB. Problem is, MongoDB is limited to 2GB on 32-bit machines (according to http://www.mongodb.org/display/DOCS/FAQ#FAQ-Whatarethe32bitlimitations%3F), and a lot of our users will have over 2GB of data. Is there a way to have MongoDB use more than one file somehow? I thought perhaps I could implement sharding on a single machine, meaning I'd run more than one mongod on the same machine and they'd somehow communicate. Could that work?

Read the article
Process xml-like log file queue

- by Zsolt Botykai

Hi all, first of all: I'm not a programmer, never was, although had learn a lot during my professional carreer as a support consultant. Now my task is to process - and create some statistics about a constantly written and rapidly growing XML like log file. It's not valid XML, because it does not have a proper <root> element, e.g. the log looks like this: <log itemdate="somedate"> <field id="0" /> ... </log> <log itemdate="somedate+1"> <field id="0" /> ... </log> <log itemdate="somedate+n"> <field id="0" /> ... </log> E.g. I have to count all the items with field id=0. But most of the solutions I had found (e.g. using XPath) reports an error about the garbage after the first closing </log>. Most probably I can use python (2.6, although I can compile 3.x as well), or some really old perl version (5.6.x), and recently compiled xmlstarlet which really looks promising - I was able to create the statistics for a certain period after copying the file, and pre- & appending the opening and closing root element. But this is a huge file and copying takes time as well. Isn't there a better solution? Thanks in advance!

Read the article
How to map a Entity Data Model conceptual model property to a storage model column using the "Serial

- by codekaizen

I have a conceptual model in EDM where one of the entities has a property which is essentially a big value object whose properties aren't really useful as columns in the datamodel. I'd like to apply the Serialized LOB pattern to it so that I can fit it into a 192 byte binary column. How do I map this in the EDM v4? Is it even possible at this time? Actually, is it possible in any ORM?

Read the article
Mysql: create index on 1.4 billion records

- by SiLent SoNG

I have a table with 1.4 billion records. The table structure is as follows: CREATE TABLE text_page ( text VARCHAR(255), page_id INT UNSIGNED ) ENGINE=MYISAM DEFAULT CHARSET=ascii The requirement is to create an index over the column text. The table size is about 34G. I have tried to create the index by the following statement: ALTER TABLE text_page ADD KEY ix_text (text) After 10 hours' waiting I finally give up this approach. Is there any workable solution on this problem? UPDATE: the table is unlikely to be updated or inserted or deleted. The reason why to create index on the column text is because this kind of sql query would be frequently executed: SELECT page_id FROM text_page WHERE text = ?

Read the article
Cloud HUGE data storage options?

- by ToughPal

Hi, Does anyone have a good suggestion on how to do video recording? We have a camera that can record and then stream live video to a server. So this means we can have 1000's of cameras sending data 24X7 for recording. We will store data for over 7 / 14 / 30 days depending on the package. Per day if a camera is sending data to the server then it will store 1.5GB. So that means there is a traffic of 1.5GB / day / camera Total monthly 45GB / month / camera (Data + bandwidth for one camera) Please let me know the most cost effective way to get this data stored? Thanks!

Read the article

< Previous Page | 43 44 45 46 47 48 49 50 51 52 53 54 | Next Page >