indexing - Page 41 - Developer IT

how to design a schema where the columns of a table are not fixed

- by hIpPy

I am trying to design a schema where the columns of a table are not fixed. Ex: I have an Employee table where the columns of the table are not fixed and vary (attributes of Employee are not fixed and vary). Nullable columns in the Employee table itself i.e. no normalization Instead of adding nullable columns, separate those columns out in their individual tables ex: if Address is a column to be added then create table Address[EmployeeId, AddressValue]. Create tables ExtensionColumnName [EmployeeId, ColumnName] and ExtensionColumnValue [EmployeeId, ColumnValue]. ExtensionColumnName would have ColumnName as "Address" and ExtensionColumnValue would have ColumnValue as address value. Employee table EmployeeId Name ExtensionColumnName table ColumnNameId EmployeeId ColumnName ExtensionColumnValue table EmployeeId ColumnNameId ColumnValue There is a drawback is the first two ways as the schema changes with every new attribute. Note that adding a new attribute is frequent. I am not sure if this is the good or bad design. If someone had a similar decision to make, please give an insight on things like foreign keys / data integrity, indexing, performance, reporting etc.

Read the article

How to retrieve view of MultiIndex DataFrame

- by Henry S. Harrison

This question was inspired by this question. I had the same problem, updating a MultiIndex DataFrame by selection. The drop_level=False solution in Pandas 0.13 will allow me to achieve the same result, but I am still wondering why I cannot get a view from the MultiIndex DataFrame. In other words, why does this not work?: >>> sat = d.xs('sat', level='day', copy=False) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2248, in xs raise ValueError('Cannot retrieve view (copy=False)') ValueError: Cannot retrieve view (copy=False) Of course it could be only because it is not implemented, but is there a reason? Is it somehow ambiguous or impossible to implement? Returning a view is more intuitive to me than returning a copy then later updating the original. I looked through the source and it seems this situation is checked explicitly to raise an error. Alternatively, is it possible to get the same sort of view from any of the other indexing methods? I've experimented but have not been successful. [edit] Some potential implementations are discussed here. I guess with the last question above I'm wondering what the current best solution is to index into arbitrary multiindex slices and cross-sections.

Read the article

Improve speed of a JOIN in MySQL

- by ran2

Dear all, I know there a similar threads around, but this is really the first time I realize that query speed might affect me - so it´s not that easy for me to really make the transfer from other folks problems. That being said I have using the following query successfully with smaller data, but if I use it on what are mildly large tables (about 120,000 records). I am waiting for hours. INSERT INTO anothertable (id,someint1,someint1,somevarchar1,somevarchar1) SELECT DISTINCT md.id,md.someint1,md.someint1,md.somevarchar1,pd.somevarchar1 from table1 AS md JOIN table2 AS pd ON (md.id = pd.id); Tables 1 and 2 contain about 120,000 records. The query has been running for almost 2 hours right now. Is this normal? Do I just have to wait. I really have no idea, but I am pretty sure that one could do it better since it´s my very first try. I read about indexing, but dont know yet what to index in my case? Thanks for any suggestions - feel free to point my to the very beginners guides ! best matt

Read the article

How to structure an index for type ahead for extremely large dataset using Lucene or similar?

- by Pete

I have a dataset of 200million+ records and am looking to build a dedicated backend to power a type ahead solution. Lucene is of interest given its popularity and license type, but I'm open to other open source suggestions as well. I am looking for advice, tales from the trenches, or even better direct instruction on what I will need as far as amount of hardware and structure of software. Requirements: Must have: The ability to do starts with substring matching (I type in 'st' and it should match 'Stephen') The ability to return results very quickly, I'd say 500ms is an upper bound. Nice to have: The ability to feed relevance information into the indexing process, so that, for example, more popular terms would be returned ahead of others and not just alphabetical, aka Google style. In-word substring matching, so for example ('st' would match 'bestseller') Note: This index will purely be used for type ahead, and does not need to serve standard search queries. I am not worried about getting advice on how to set up the front end or AJAX, as long as the index can be queried as a service or directly via Java code. Up votes for any useful information that allows me to get closer to an enterprise level type ahead solution

Read the article

Pairs from single list

- by Apalala

Often enough, I've found the need to process a list by pairs. I was wondering which would be the pythonic and efficient way to do it, and found this on Google: pairs = zip(t[::2], t[1::2]) I thought that was pythonic enough, but after a recent discussion involving idioms versus efficiency, I decided to do some tests: import time from itertools import islice, izip def pairs_1(t): return zip(t[::2], t[1::2]) def pairs_2(t): return izip(t[::2], t[1::2]) def pairs_3(t): return izip(islice(t,None,None,2), islice(t,1,None,2)) A = range(10000) B = xrange(len(A)) def pairs_4(t): # ignore value of t! t = B return izip(islice(t,None,None,2), islice(t,1,None,2)) for f in pairs_1, pairs_2, pairs_3, pairs_4: # time the pairing s = time.time() for i in range(1000): p = f(A) t1 = time.time() - s # time using the pairs s = time.time() for i in range(1000): p = f(A) for a, b in p: pass t2 = time.time() - s print t1, t2, t2-t1 These were the results on my computer: 1.48668909073 2.63187503815 1.14518594742 0.105381965637 1.35109519958 1.24571323395 0.00257992744446 1.46182489395 1.45924496651 0.00251388549805 1.70076990128 1.69825601578 If I'm interpreting them correctly, that should mean that the implementation of lists, list indexing, and list slicing in Python is very efficient. It's a result both comforting and unexpected. Is there another, "better" way of traversing a list in pairs? Note that if the list has an odd number of elements then the last one will not be in any of the pairs. Which would be the right way to ensure that all elements are included? I added these two suggestions from the answers to the tests: def pairwise(t): it = iter(t) return izip(it, it) def chunkwise(t, size=2): it = iter(t) return izip(*[it]*size) These are the results: 0.00159502029419 1.25745987892 1.25586485863 0.00222492218018 1.23795199394 1.23572707176 Results so far Most pythonic and very efficient: pairs = izip(t[::2], t[1::2]) Most efficient and very pythonic: pairs = izip(*[iter(t)]*2) It took me a moment to grok that the first answer uses two iterators while the second uses a single one. To deal with sequences with an odd number of elements, the suggestion has been to augment the original sequence adding one element (None) that gets paired with the previous last element, something that can be achieved with itertools.izip_longest().

Read the article

Why am I getting a permission denied error on my public folder?

- by Robin Fisher

Hi all, This one has got me stumped. I'm deploying a Rails 3 app to Slicehost running Apache 2 and Passenger. My server is running Ruby 1.9.1 using RVM. I am receiving a permission denied error on the "public" folder in my app. My Virtual Host is setup as follows: <VirtualHost *:80> ServerName sharerplane.com ServerAlias www.sharerplane.com ServerAlias *.sharerplane.com DocumentRoot /home/robinjfisher/public_html/sharerplane.com/current/public/ <Directory "/home/robinjfisher/public_html/sharerplane.com/public/"> AllowOverride all Options -MultiViews Order allow,deny Allow from all </Directory> PassengerDefaultUser robinjfisher </VirtualHost> I've tried the following things: trailing slash on public; no trailing slash on public; PassengerUserSwitching on and off; PassengerDefaultUser set and not set; with and without the block. The public folder is owned by robinjfisher:www-data and Passenger is running as robinjfisher so I can't see why there are permission issues. Does anybody have any thoughts? Thanks Robin PS. Have disabled the site for the time being to avoid indexing so what is there currently is not the site in question.

Read the article

Commercial web application--scalable database design

- by Rob Campbell

I'm designing a set of web apps to track scientific laboratory data. Each laboratory has several members, each of whom will access both their own data and that of their laboratory as a whole. Many typical queries will thus be expected to return records of multiple members (e.g. my mouse, joe's mouse and sally's mouse). I think I have the database fairly well normalized. I'm now wondering how to ensure that users can efficiently access both their own data and their lab's data set when it is mixed among (hopefully) a whole ton of records from other labs. What I've come up with so far is that most tables will end with two fields: user_id and labgroup_id. The WHERE clause of any SELECT statement will include the appropriate reference to one of the id fields ("...WHERE 'labroup_id=n..." or "...WHERE user_id=n..."). My questions are: Is this an approach that will scale to 10^6 or more records? If so, what's the best way to use these fields in a query so that it most efficiently searches the relevant subset of the database? e.g. Should the first step in querying be to create a temporary table containing just the labgroup's data? Or will indexing using some combination of the id, user_id, and labroup_id fields be sufficient at that scale? I thank any responders very much in advance.

Read the article

mysql: storing arbitrary data

- by Hailwood

Background: I was asking a question on stack overflow regarding creating tables on the fly where this conversation ensued: This smells like a terrible idea! In fact, it smells just like this one. What in the world do you want to use this for? – deceze @deceze: very true, However, How else would you store the contents of these CSV files. They must be stored in mysql for indexing. The only solid fact about them is that they all have a mobile column with a standard format. The CSV can have an arbitrary amount of columns with an arbitrary amount of rows. They can (with no exaggeration) range from a single row, 35 column csv to an 80k row single column CSV. I am open to other ideas. – Hailwood There are many solutions for this, from attribute-value schemas to JSON storage and NoSQL storage. Open a new question about it. Whatever you do though, don't dynamically create tables! – deceze Question: So my question is, What would you say is the best way to store this data? Are you in agreement with deceze about not creating dynamic tables?

Read the article

Matab - Trace contour line between two different points

- by Graham

Hi, I have a set of points represented as a 2 row by n column matrix. These points make up a connected boundary or edge. I require a function that traces this contour from a start point P1 and stop at an end point P2. It also needs to be able trace the contour in a clockwise or anti-clockwise direction. I was wondering if this can be achieved by using some of matlabs functions. I have tried to write my own function but this was riddled with bugs and I have also tried using bwtraceboundary and indexing however this has problematic results as the points within the matrix are not in the order that create the contour. Thank you in advance for any help. Btw, I have included a link to a plot of the set of points. It is half the outline of a hand. The function would ideally trace the contour from ether the red star to the green triangle. Returning the points in order of traversal.

Read the article

Java - Removing duplicates in an ArrayList

- by Will

I'm working on a program that uses an ArrayList to store Strings. The program prompts the user with a menu and allows the user to choose an operation to perform. Such operations are adding Strings to the List, printing the entries etc. What I want to be able to do is create a method called removeDuplicates().This method will search the ArrayList and remove any duplicated values. I want to leave one instance of the duplicated value(s) within the list. I also want this method to return the total number of duplicates removed. I've been trying to use nested loops to accomplish this but I've been running into trouble because when entries get deleted, the indexing of the ArrayList gets altered and things don't work as they should. I know conceptually what I need to do but I'm having trouble implementing this idea in code. Here is some pseudo code: start with first entry; check each subsequent entry in the list and see if it matches the first entry; remove each subsequent entry in the list that matches the first entry; after all entries have been examined, move on to the second entry; check each entry in the list and see if it matches the second entry; remove each entry in the list that matches the second entry; repeat for entry in the list Here's the code I have so far: public int removeDuplicates() { int duplicates = 0; for ( int i = 0; i < strings.size(); i++ ) { for ( int j = 0; j < strings.size(); j++ ) { if ( i == j ) { // i & j refer to same entry so do nothing } else if ( strings.get( j ).equals( strings.get( i ) ) ) { strings.remove( j ); duplicates++; } } } return duplicates; }

Read the article

Lucene.Net: How can I add a date filter to my search results?

- by rockinthesixstring

I've got my searcher working really well, however it does tend to return results that are obsolete. My site is much like NerdDinner whereby events in the past become irrelevant. I'm currently indexing like this Public Function AddIndex(ByVal searchableEvent As [Event]) As Boolean Implements ILuceneService.AddIndex Dim writer As New IndexWriter(luceneDirectory, New StandardAnalyzer(), False) Dim doc As Document = New Document doc.Add(New Field("id", searchableEvent.ID, Field.Store.YES, Field.Index.UN_TOKENIZED)) doc.Add(New Field("fullText", FullTextBuilder(searchableEvent), Field.Store.YES, Field.Index.TOKENIZED)) doc.Add(New Field("user", If(searchableEvent.User.UserName = Nothing, "User" & searchableEvent.User.ID, searchableEvent.User.UserName), Field.Store.YES, Field.Index.TOKENIZED)) doc.Add(New Field("title", searchableEvent.Title, Field.Store.YES, Field.Index.TOKENIZED)) doc.Add(New Field("location", searchableEvent.Location.Name, Field.Store.YES, Field.Index.TOKENIZED)) doc.Add(New Field("date", searchableEvent.EventDate, Field.Store.YES, Field.Index.UN_TOKENIZED)) writer.AddDocument(doc) writer.Optimize() writer.Close() Return True End Function Notice how I have a "date" index that stores the event date. My search then looks like this ''# code omitted Dim reader As IndexReader = IndexReader.Open(luceneDirectory) Dim searcher As IndexSearcher = New IndexSearcher(reader) Dim parser As QueryParser = New QueryParser("fullText", New StandardAnalyzer()) Dim query As Query = parser.Parse(q.ToLower) ''# We're using 10,000 as the maximum number of results to return ''# because I have a feeling that we'll never reach that full amount ''# anyways. And if we do, who in their right mind is going to page ''# through all of the results? Dim topDocs As TopDocs = searcher.Search(query, Nothing, 10000) Dim doc As Document = Nothing ''# loop through the topDocs and grab the appropriate 10 results based ''# on the submitted page number While i <= last AndAlso i < topDocs.totalHits doc = searcher.Doc(topDocs.scoreDocs(i).doc) IDList.Add(doc.[Get]("id")) i += 1 End While ''# code omitted I did try the following, but it was to no avail (threw a NullReferenceException). While i <= last AndAlso i < topDocs.totalHits If Date.Parse(doc.[Get]("date")) >= Date.Today Then doc = searcher.Doc(topDocs.scoreDocs(i).doc) IDList.Add(doc.[Get]("id")) i += 1 End If End While I also found the following documentation, but I can't make heads or tails of it http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/search/DateFilter.html

Read the article

Is there a recommended way to communicate scientific/engineering programming to C developers?

- by ggkmath

Hi, I have a lot of MATLAB code that needs to get ported to C (execution speed is critical for this work) as part of a back-end process for a web application. When I attempt to outsource this code to a C developer, I assume (correct me if I'm wrong) few C developers also understand MATLAB code (things like indexing and memory management are different, etc.). I wonder if there are any C developers out there that can recommend a procedure for me to follow to best communicate what the code does? For example, should I provide the MATLAB code and explain what it's doing line by line? Or, should I just provide the math/algorithm, explain it in plain English, and let the C developer implement it with this understanding in his/her own way (e.g. can I assume the developer understands how to work with complex math (i.e. imaginary numbers), how to generate histograms, perform an FFT, etc.)? Or, is there a better method? I expect I'm not the first to need to do this, so I wonder if any C developers out there ran into this situation and can share any conventional wisdom how they'd like this task to be transferred? Thanks in advance for any comments.

Read the article

Out-of-memory algorithms for addressing large arrays

- by reve_etrange

I am trying to deal with a very large dataset. I have k = ~4200 matrices (varying sizes) which must be compared combinatorially, skipping non-unique and self comparisons. Each of k(k-1)/2 comparisons produces a matrix, which must be indexed against its parents (i.e. can find out where it came from). The convenient way to do this is to (triangularly) fill a k-by-k cell array with the result of each comparison. These are ~100 X ~100 matrices, on average. Using single precision floats, it works out to 400 GB overall. I need to 1) generate the cell array or pieces of it without trying to place the whole thing in memory and 2) access its elements (and their elements) in like fashion. My attempts have been inefficient due to reliance on MATLAB's eval() as well as save and clear occurring in loops. for i=1:k [~,m] = size(data{i}); cur_var = ['H' int2str(i)]; %# if i == 1; save('FileName'); end; %# If using a single MAT file and need to create it. eval([cur_var ' = cell(1,k-i);']); for j=i+1:k [~,n] = size(data{j}); eval([cur_var '{i,j} = zeros(m,n,''single'');']); eval([cur_var '{i,j} = compare(data{i},data{j});']); end save(cur_var,cur_var); %# Add '-append' when using a single MAT file. clear(cur_var); end The other thing I have done is to perform the split when mod((i+j-1)/2,max(factor(k(k-1)/2))) == 0. This divides the result into the largest number of same-size pieces, which seems logical. The indexing is a little more complicated, but not too bad because a linear index could be used. Does anyone know/see a better way?

Read the article

Hibernate Search + Spring

- by Zane

I'm trying to integrate Hibernate Search with Spring, but I can't seem to index anything. I was able to get Hibernate Search to work without Spring, but I'm having a problem integrating it with Spring. Any help would be much appreciated. Below is my springmvc-servlet.xml: <bean id="transactionManager" class="org.springframework.orm.jpa.JpaTransactionManager"> <property name="entityManagerFactory" ref="entityManagerFactory" /> </bean> <bean id="entityManagerFactory" class="org.springframework.orm.jpa.LocalEntityManagerFactoryBean"> <property name="persistenceUnitName" value="enewsclipsPersistenceUnit" /> </bean> And here is my DAO class: @Repository public class SearchDaoImpl implements SearchDao { JpaTemplate jpaTemplate; @Autowired public SearchDaoImpl(EntityManagerFactory entityManagerFactory) { this.jpaTemplate = new JpaTemplate(entityManagerFactory); } @SuppressWarnings("unchecked") public void updateSearchIndex() { /* Implement the callback method */ jpaTemplate.execute(new JpaCallback() { public Object doInJpa(EntityManager em) throws PersistenceException { List<Article> articles = em.createQuery("select a from Article a").getResultList(); FullTextEntityManager ftEm = Search.getFullTextEntityManager(em); ftEm.getTransaction().begin(); for(Article article : articles) { System.out.println("Indexing Item " + article.getTitle()); ftEm.index(article); } ftEm.getTransaction().commit(); return null; } }); } } I think that it may have to do with the transactions but I'm not exactly sure. If you could just point me in the right direction, that would be helpful too! Thank you.

Read the article

Statistical analysis on large data set to be published on the web

- by dassouki

I have a non-computer related data logger, that collects data from the field. This data is stored as text files, and I manually lump the files together and organize them. The current format is through a csv file per year per logger. Each file is around 4,000,000 lines x 7 loggers x 5 years = a lot of data. some of the data is organized as bins item_type, item_class, item_dimension_class, and other data is more unique, such as item_weight, item_color, date_collected, and so on ... Currently, I do statistical analysis on the data using a python/numpy/matplotlib program I wrote. It works fine, but the problem is, I'm the only one who can use it, since it and the data live on my computer. I'd like to publish the data on the web using a postgres db; however, I need to find or implement a statistical tool that'll take a large postgres table, and return statistical results within an adequate time frame. I'm not familiar with python for the web; however, I'm proficient with PHP on the web side, and python on the offline side. users should be allowed to create their own histograms, data analysis. For example, a user can search for all items that are blue shipped between week x and week y, while another user can search for sort the weight distribution of all items by hour for all year long. I was thinking of creating and indexing my own statistical tools, or automate the process somehow to emulate most queries. This seemed inefficient. I'm looking forward to hearing your ideas Thanks

Read the article

Extended slice that goes to beginning of sequence with negative stride

- by recursive

Bear with me while I explain my question. Skip down to the bold heading if you already understand extended slice list indexing. In python, you can index lists using slice notation. Here's an example: >>> A = list(range(10)) >>> A[0:5] [0, 1, 2, 3, 4] You can also include a stride, which acts like a "step": >>> A[0:5:2] [0, 2, 4] The stride is also allowed to be negative, meaning the elements are retrieved in reverse order: >>> A[5:0:-1] [5, 4, 3, 2, 1] But wait! I wanted to see [4, 3, 2, 1, 0]. Oh, I see, I need to decrement the start and end indices: >>> A[4:-1:-1] [] What happened? It's interpreting -1 as being at the end of the array, not the beginning. I know you can achieve this as follows: >>> A[4::-1] [4, 3, 2, 1, 0] But you can't use this in all cases. For example, in a method that's been passed indices. My question is: Is there any good pythonic way of using extended slices with negative strides and explicit start and end indices that include the first element of a sequence? This is what I've come up with so far, but it seems unsatisfying. >>> A[0:5][::-1] [4, 3, 2, 1, 0]

Read the article

Using Lucene to index private data, should I have a separate index for each user or a single index

- by Nathan Bayles

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do) My questions revolve around performance and security. Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient? Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index. Any insight would be greatly appreciated. Regards, Nate

Read the article

Grails - ElasticSearch - QueryParsingException[[index] No query registered for [query]]; with elasticSearchHelper; JSON via curl works fine though

- by v1p

I have been working on a Grails project, clubbed with ElasticSearch ( v 20.6 ), with a custom build of elasticsearch-grails-plugin(to support geo_point indexing : v.20.6) have been trying to do a filtered Search, while using script_fields (to calculate distance). Following is Closure & the generated JSON from the GXContentBuilder : Closure records = Domain.search(searchType:'dfs_query_and_fetch'){ query { filtered = { query = { if(queryTxt){ query_string(query: queryTxt) }else{ match_all {} } } filter = { geo_distance = { distance = "${userDistance}km" "location"{ lat = latlon[0]?:0.00 lon = latlon[1]?:0.00 } } } } } script_fields = { distance = { script = "doc['location'].arcDistanceInKm($latlon)" } } fields = ["_source"] } GXContentBuilder generated query JSON : { "query": { "filtered": { "query": { "match_all": {} }, "filter": { "geo_distance": { "distance": "5km", "location": { "lat": "37.752258", "lon": "-121.949886" } } } } }, "script_fields": { "distance": { "script": "doc['location'].arcDistanceInKm(37.752258, -121.949886)" } }, "fields": ["_source"] } The JSON query, using curl-way, works perfectly fine. But when I try to execute it from Groovy Code, I mean with this : elasticSearchHelper.withElasticSearch { Client client -> def response = client.search(request).actionGet() } It throws following error : Failed to execute phase [dfs], total failure; shardFailures {[1][index][3]: SearchParseException[[index][3]: from[0],size[60]: Parse Failure [Failed to parse source [{"from":0,"size":60,"query_binary":"eyJxdWVyeSI6eyJmaWx0ZXJlZCI6eyJxdWVyeSI6eyJtYXRjaF9hbGwiOnt9fSwiZmlsdGVyIjp7Imdlb19kaXN0YW5jZSI6eyJkaXN0YW5jZSI6IjVrbSIsImNvbXBhbnkuYWRkcmVzcy5sb2NhdGlvbiI6eyJsYXQiOiIzNy43NTIyNTgiLCJsb24iOiItMTIxLjk0OTg4NiJ9fX19fSwic2NyaXB0X2ZpZWxkcyI6eyJkaXN0YW5jZSI6eyJzY3JpcHQiOiJkb2NbJ2NvbXBhbnkuYWRkcmVzcy5sb2NhdGlvbiddLmFyY0Rpc3RhbmNlSW5LbSgzNy43NTIyNTgsIC0xMjEuOTQ5ODg2KSJ9fSwiZmllbGRzIjpbIl9zb3VyY2UiXX0=","explain":true}]]]; nested: QueryParsingException[[index] No query registered for [query]]; } The above Closure works if I only use filtered = { ... } script_fields = { ... } but it doesn't return the calculated distance. Anyone had any similar problem ? Thanks in advance :) It's possible that I might have been dim to point out the obvious here :P

Read the article

finding long repeated substrings in a massive string

- by Will

I naively imagined that I could build a suffix trie where I keep a visit-count for each node, and then the deepest nodes with counts greater than one are the result set I'm looking for. I have a really really long string (hundreds of megabytes). I have about 1 GB of RAM. This is why building a suffix trie with counting data is too inefficient space-wise to work for me. To quote Wikipedia's Suffix tree: storing a string's suffix tree typically requires significantly more space than storing the string itself. The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures. And that was wikipedia's comments on the tree, not trie. How can I find long repeated sequences in such a large amount of data, and in a reasonable amount of time (e.g. less than an hour on a modern desktop machine)? (Some wikipedia links to avoid people posting them as the 'answer': Algorithms on strings and especially Longest repeated substring problem ;-) )

Read the article

Is there a Java data structure that is effectively an ArrayList with double indicies and built-in in

- by Bob Cross

I am looking for a pre-built Java data structure with the following characteristics: It should look something like an ArrayList but should allow indexing via double-precision rather than integers. Note that this means that it's likely that you'll see indicies that don't line up with the original data points (i.e., asking for the value that corresponds to key "1.5"). As a consequence, the value returned will likely be interpolated. For example, if the key is 1.5, the value returned could be the average of the value at key 1.0 and the value at key 2.0. The keys will be sorted but the values are not ensured to be monotonically increasing. In fact, there's no assurance that the first derivative of the values will be continuous (making it a poor fit for certain types of splines). Freely available code only, please. For clarity, I know how to write such a thing. In fact, we already have an implementation of this and some related data structures in legacy code that I want to replace due to some performance and coding issues. What I'm trying to avoid is spending a lot of time rolling my own solution when there might already be such a thing in the JDK, Apache Commons or another standard library. Frankly, that's exactly the approach that got this legacy code into the situation that it's in right now.... Is there such a thing out there in a freely available library?

Read the article

Seeking good practice advice: multisite in Drupal

- by deanloh

I'm using multisite to host my client sites. During development stage, I use subdomain to host the staging site, e.g. client1.mydomain.com. And here's how it look under the SITES folder: /sites/client1.mydomain.com When the site is completed and ready to go live, I created another folder for the actual domain, e.g. client1.com. Hence: /sites/client1.com Next, I created symlinks under client1.com for FILES and SETTINGS.PHP that points to the subdomain i.e. /sites/client1.com/settings.php --> /sites/client1.mydomain.com/settings.php /sites/client1.com/files --> /sites/client1.mydomain.com/files Finally, to prevent Google from indexing both the subdomain and actual domain, I created the rule in .htaccess to rewrite client1.mydomain.com to client1.com, therefore, should anyone try to access the subdomain, he will be redirected to the actual domain. This above arrangement works perfectly fine. But I somehow feel there is a better way to achieve the above in much simplified manner. Please feel free to share your views and all advice is much appreciated.

Read the article

Dealing with large number of text strings

- by Fadrian

My project when it is running, will collect a large number of string text block (about 20K and largest I have seen is about 200K of them) in short span of time and store them in a relational database. Each of the string text is relatively small and the average would be about 15 short lines (about 300 characters). The current implementation is in C# (VS2008), .NET 3.5 and backend DBMS is Ms. SQL Server 2005 Performance and storage are both important concern of the project, but the priority will be performance first, then storage. I am looking for answers to these: Should I compress the text before storing them in DB? or let SQL Server worry about compacting the storage? Do you know what will be the best compression algorithm/library to use for this context that gives me the best performance? Currently I just use the standard GZip in .NET framework Do you know any best practices to deal with this? I welcome outside the box suggestions as long as it is implementable in .NET framework? (it is a big project and this requirements is only a small part of it) EDITED: I will keep adding to this to clarify points raised I don't need text indexing or searching on these text. I just need to be able to retrieve them in later stage for display as a text block using its primary key. I have a working solution implemented as above and SQL Server has no issue at all handling it. This program will run quite often and need to work with large data context so you can imagine the size will grow very rapidly hence every optimization I can do will help.

Read the article

SQLAuthority News – Speaking Sessions at TechEd India – 3 Sessions – 1 Panel Discussion

- by pinaldave

Microsoft Tech-Ed India 2010 is considered as the major Technology event of the year for various IT professionals and developers. This event will feature a comprehensive forum in order to learn, connect, explore, and evolve the current technologies we have today. I would recommend this event to you since here you will learn about today’s cutting-edge trends, thereby enhancing your work profile and getting ahead of the rest. But, the most important benefit of all might be the networking opportunity that that you can attain by attending the forum. You can build personal connections with various Microsoft experts and peers that will last even far beyond this event! It also feels good to let you know that I will be speaking at this year’s event! So, here are the sessions that await you in this mega-forum. Session 1: True Lies of SQL Server – SQL Myth Buster Date: April 12, 2010 Time: 11:15pm – 11:45pm In this 30-minute demo session, I am going to briefly demonstrate few SQL Server Myth and their resolution backing up with some demo. This demo session is a must-attend for all developers and administrators who would come to the event. This is going to be a very quick yet fun session. Session 2: Master Data Services in Microsoft SQL Server 2008 R2 Date: April 12, 2010 Time: 2:30pm-3:30pm SQL Server Master Data Services will ship with SQL Server 2008 R2 and will improve Microsoft’s platform appeal. This session provides an in depth demonstration of MDS features and highlights important usage scenarios. Master Data Services enables consistent decision making by allowing you to create, manage and propagate changes from single master view of your business entities. Also with MDS – Master Data-hub which is the vital component helps ensure reporting consistency across systems and deliver faster more accurate results across the enterprise. We will talk about establishing the basis for a centralized approach to defining, deploying, and managing master data in the enterprise. Session 3: Developing with SQL Server Spatial and Deep Dive into Spatial Indexing Date: April 14, 2010 Time: 5:00pm-6:00pm Microsoft SQL Server 2008 delivers new spatial data types that enable you to consume, use, and extend location-based data through spatial-enabled applications. Attend this session to learn how to use spatial functionality in next version of SQL Server to build and optimize spatial queries. This session outlines the new geography data type to store geodetic spatial data and perform operations on it, use the new geometry data type to store planar spatial data and perform operations on it, take advantage of new spatial indexes for high performance queries, use the new spatial results tab to quickly and easily view spatial query results directly from within Management Studio, extend spatial data capabilities by building or integrating location-enabled applications through support for spatial standards and specifications and much more. Panel Discussion: Harness the power of Web – SEO and Technical Blogging Date: April 12, 2010 Time: 5:00pm-6:00pm Here you will learn lots of tricks and tips about SEO and Technical Blogging from various Industry Technical Blogging Experts. This event will surely be one of the most important Tech conventions of 2010. TechEd is going to be a very busy time for Tech developers and enthusiasts, since every evening there will be a fun session to attend. If you are interested in any of the above topics for every session, I suggest that you visit each of them as you will learn so many things about the topic to be discussed. Reference: Pinal Dave (http://blog.SQLAuthority.com) Filed under: MVP, Pinal Dave, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, SQLAuthority Author Visit, SQLAuthority News, T SQL, Technology Tagged: TechEd, TechEdIn

Read the article

SQL SERVER – Video – Beginning Performance Tuning with SQL Server Execution Plan

- by pinaldave

Traveling can be most interesting or most exhausting experience. However, traveling is always the most enlightening experience one can have. While going to long journey one has to prepare a lot of things. Pack necessary travel gears, clothes and medicines. However, the most essential part of travel is the journey to the destination. There are many variations one prefer but the ultimate goal is to have a delightful experience during the journey. Here is the video available which explains how to begin with SQL Server Execution plans. Performance Tuning is a Journey Performance tuning is just like a long journey. The goal of performance tuning is efficient and least resources consuming query execution with accurate results. Just as maps are the most essential aspect of performance tuning the same way, execution plans are essentially maps for SQL Server to reach to the resultset. The goal of the execution plan is to find the most efficient path which translates the least usage of the resources (CPU, memory, IO etc). Execution Plans are like Maps When online maps were invented (e.g. Bing, Google, Mapquests etc) initially it was not possible to customize them. They were given a single route to reach to the destination. As time evolved now it is possible to give various hints to the maps, for example ‘via public transport’, ‘walking’, ‘fastest route’, ‘shortest route’, ‘avoid highway’. There are places where we manually drag the route and make it appropriate to our needs. The same situation is with SQL Server Execution Plans, if we want to tune the queries, we need to understand the execution plans and execution plans internals. We need to understand the smallest details which relate to execution plan when we our destination is optimal queries. Understanding Execution Plans The biggest challenge with maps are figuring out the optimal path. The same way the most common challenge with execution plans is where to start from and which precise route to take. Here is a quick list of the frequently asked questions related to execution plans: Should I read the execution plans from bottoms up or top down? Is execution plans are left to right or right to left? What is the relational between actual execution plan and estimated execution plan? When I mouse over operator I see CPU and IO but not memory, why? Sometime I ran the query multiple times and I get different execution plan, why? How to cache the query execution plan and data? I created an optimal index but the query is not using it. What should I change – query, index or provide hints? What are the tools available which helps quickly to debug performance problems? Etc… Honestly the list is quite a big and humanly impossible to write everything in the words. SQL Server Performance: Introduction to Query Tuning My friend Vinod Kumar and I have created for the same a video learning course for beginning performance tuning. We have covered plethora of the subject in the course. Here is the quick list of the same: Execution Plan Basics Essential Indexing Techniques Query Design for Performance Performance Tuning Tools Tips and Tricks Checklist: Performance Tuning We believe we have covered a lot in this four hour course and we encourage you to go over the video course if you are interested in Beginning SQL Server Performance Tuning and Query Tuning. Reference: Pinal Dave (http://blog.SQLAuthority.com) Filed under: PostADay, SQL, SQL Authority, SQL Optimization, SQL Performance, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology, Video Tagged: Execution Plan

Read the article

Collaborate10 – THEconference

- by jean-pierre.dijcks

After spending a few days in Mandalay Bay's THEHotel, I guess I now call everything THE... Seriously, they even tag their toilet paper with THEtp... I guess the brand builders in Vegas thought that once you are on to something you keep on doing it, and granted it is a nice hotel with nice rooms. THEanalytics Most of my collab10 experience was in a room called Reef C, where the BIWA bootcamp was held. Two solid days of BI, Warehousing and Analytics organized by the BIWA SIG at IOUG. Didn't get to see all sessions, but what struck me was the high interest in Analytics. Marty Gubar's OLAP session was full and he did some very nice things with the OLAP option. The cool bit was that he actually gets all the advanced calculations in OLAP to show up in OBI EE without any effort. It was nice to see that the idea from OWB where you generate an RPD is now also in AWM. I think it makes life so much simpler to generate these RPD's from your data model. Even if the end RPD needs some tweaking, it is all a lot less effort to get something going. You can see this stuff for yourself in this demo (click here). OBI EE uses just SQL to get to the calculations, and so, if you prefer APEX, you can build you application there and get the same nice calculations in an APEX application. Marty also showed the Simba MDX driver used with Excel. I guess we should call that THEcoolone... and it is very slick and wonderfully useful for all of you who actually know Excel. The nice thing is that you leverage pure Excel for all operations (no plug-ins). That means no new tools to learn, no new controls, all just pure Excel. THEdatabasemachine Got some very good questions in my "what makes Exadata fast" session and overall, the interest in Exadata is overwhelming. One of the things that I did try to do in my session is to get people to think in new patterns rather than in patterns based on Oracle 9i running on some random hardware configuration. We talked a little bit about the often over-indexing and how everyone has to unlearn all of that on Exadata. The main thing however is that everyone needs to get used to the shear size of some of the components in a Database machine V2. 5TB of flash cache is a lot of very fast data storage, half a TB of memory gets quite interesting as well. So what I did there was really focus on some of the content in these earlier posts on Upward ILM and In-Memory processing. In short, I do believe the these newer media point out a trend. In-memory and other fast media will get cheaper and will see more use. Some of that we do automatically by adding new functionality, but in some cases I think the end user of the system needs to start thinking about how to leverage all this new hardware. I think most people got very excited about these new capabilities and opportunities. THEcoolkids One of the cool things about the BIWA track was the hand-on track. Very cool to see big crowds for both OLAP and OWB hands-on. Also quite nice to see that the folks at RittmanMead spent so much time on preparing for that session. While all of them put down cool stuff, none was more cool that seeing Data Mining on an Apple iPAD... it all just looks great on an iPAD! Very disappointing to see that Mark Rittman still wasn't showing OWB on his iPAD ;-) THEend All in all this was a great set of sessions in the BIWA track. Lots of value to our guests (we hope) and we hope they all come again next year!

Search Results

Search found 1124 results on 45 pages for 'indexing'.

Page 41/45 | < Previous Page | 37 38 39 40 41 42 43 44 45 | Next Page >

- by hIpPy

- by Henry S. Harrison

- by ran2

- by Pete

- by Apalala

- by Robin Fisher

- by Rob Campbell

- by Hailwood

- by Graham

- by Will

- by rockinthesixstring

- by ggkmath

- by reve_etrange

- by Zane

- by dassouki

- by recursive

- by Nathan Bayles

- by v1p

- by Will

- by Bob Cross

- by deanloh

- by Fadrian

- by pinaldave

- by pinaldave

- by jean-pierre.dijcks

< Previous Page | 37 38 39 40 41 42 43 44 45 | Next Page >