Search Results

Search found 883 results on 36 pages for 'subset'.

Page 9/36 | < Previous Page | 5 6 7 8 9 10 11 12 13 14 15 16 | Next Page >

SQL INSTR() using CSV. Need exact match rather than part

- by Alastair Pitts

This is a follow up issue relating to the answer for http://stackoverflow.com/questions/2445029/sql-placeholder-in-where-in-issue-inserted-strings-fail Quick background: We have a SQL query that uses a placeholder value to accept a string, which represents a unique tag/id. Usually, this is only a single tag, but we needed the ability to use a csv string for multiple tags, returning a combined result. In the answer we received from the vendor, they suggested the use of the INSTR function, ala: select * from pitotal where tag IN (SELECT tag from pipoint WHERE INSTR(?, tag) <> 0) and time between 'y' and 't' This works perfectly well 99% of the time, the issue is when the tag is also a subset of 2 parts of the CSV string. Eg the placeholder value is: 'northdom,southdom,eastdom,westdom' and possible tags include: north or northdom What happens, as north is a subset of northdom, is that the two tags are return instead of just northdom, which is actually what we want. I'm not strong on SQL so I couldn't work out how to set it as exact, or split the csv string, so help would be appreciated. Is there a way to split the csv string or make it look for an exact match?

Read the article
What is the correct way to implement a massive hierarchical, geographical search for news?

- by Philip Brocoum

The company I work for is in the business of sending press releases. We want to make it possible for interested parties to search for press releases based on a number of criteria, the most important being location. For example, someone might search for all news sent to New York City, Massachusetts, or ZIP code 89134, sent from a governmental institution, under the topic of "traffic". Or whatever. The problem is, we've sent, literally, hundreds of thousands of press releases. Searching is slow and complex. For example, a press release sent to Queens, NY should show up in the search I mentioned above even though it wasn't specifically sent to New York City, because Queens is a subset of New York City. We may also want to implement "and" and "or" and negation and text search to the query to create complex searches. These searches also have to be fast enough to function as dynamic RSS feeds. I really don't know anything about search theory, or how it's properly done. The way we are getting by right now is using a data mart to store the locations the releases were sent to in a single table. However, because of the subset thing mentioned above, the data mart is gigantic with millions of rows. And we haven't even implemented cities yet, and there are about 50,000 cities in the United States, which will exponentially increase the size of the data mart by so much I'm afraid it just won't work anymore. Anyway, I realize this is not a simple question and there won't be a "do this" answer. However, I'm hoping one of you can point me in the right direction where I can learn about how massive searches are done? Because I really know nothing about it. And such a search engine is turning out to be incredibly difficult to make. Thanks! I know there must be a way because if Google can search the entire internet we must be able to search our own database :-)

Read the article
I cannot grok MVC, what it is, and what it is not?

- by Hao

I cannot grok what MVC is, what mindset or programming model should I acquire so MVC stuff can instantly "lightbulb" on my head? If not instantly, what simple programs/projects should I try to do first so I can apply the neat things MVC brings to programming. OOP is intuitive and easier, object is all around us, and the benefits of code reuse using OOP-paradigm instantly click to anyone. You can probably talk to anybody about OOP in a few minutes and lecture some examples and they would get it. While OOP somehow raise the intuitiveness aspect of programming, MVC seems to do the opposite. I'm getting negative thoughts that some future employers(or even clients) would look down upon me for not using MVC technology. Though I probably get the skinnable aspect of MVC, but when I try to apply it to my own project, I don't know where to start. And also some programmers even have diverging views on how to accomplish MVC properly. Take this for instance from Jeff's post about MVC: The view is simply how you lay the data out, how it is displayed. If you want a subset of some data, for example, my opinion is that is a responsibility of the model. So maybe some programmers use MVC, but they somehow inadvertently use the View or the Controller to extract a subset of data. Why we can't have a definitive definition of what and how to accomplish MVC properly? And also, when I search for MVC .NET programs, most of it applies to web programs, not desktop apps, this intrigue me further. My guess is, this is most advantageous to web apps, there's not much problem about intermixed view(html) and controller(program code) in desktop apps.

Read the article
Is there a standard lexer/parser tool for Python?

- by Salim Fadhley

A volunteer job requires us to convert a large number of LaTeX documents into ePub format. It's a series of open-source fiction book which has so far only been produced only on paper via a print on demand service. We'd like to be able to offer the book to users of book-reader devices (such as Kindle) which require the ePub format for best results. Fortunately, ePub is a very simple format, however there's no trivial way for LaTeX to produce the XHTML outut required. We experimented with alternative LaTeX compilers (e.g. plastex) but in the end we figured that it would probably be a lot easier to simply write our own compiler which understands a tiny subset of the LaTeX language and compiles directly to XHTML / ePub. Previously I used a tool on Windows called GOLD. This allowed me to go directly from BNF grammars to a stub parser. It also alllowed me to implement the parser in any language I liked. (I'd choose Python). This product has to work on Linux, so I'm wondering if there's an equivalent toolchain that works as well under Ubutnu / Eclipse / Python. The idea is that we will take the grammar of TeX and just implement a teeny subset of that, but we do not want to spend a huge amount of time worrying about grammar and parsing. A parser generator would obviously save us a great deal of time. Sal UPDATE 1: Bonus marks for a solution with excellent documentation or tutorials.

Read the article
Execute code on assembly load

- by Dmitriy Matveev

I'm working on wrapper for some huge unmanaged library. Almost every of it's functions can call some error handler deeply inside. The default error handler writes error to console and calls abort() function. This behavior is undesirable for managed library, so I want to replace the default error handler with my own which will just throw some exception and let program continue normal execution after handling of this exception. The error handler must be changed before any of the wrapped functions will be called. The wrapper library is written in managed c++ with static linkage to wrapped library, so nothing like "a type with hundreds of dll imports" is present. I also can't find a single type which is used by everything inside wrapper library. So I can't solve that problem by defining static constructor in one single type which will execute code I need. I currently see two ways of solving that problem: Define some static method like Library.Initialize() which must be called one time by client before his code will use any part of the wrapper library. Find the most minimal subset of types which is used by every top-level function (I think the size of this subset will be something like 25-50 types) and add static constructors calling Library.Initialize (which will be internal in that scenario) to every of these types. I've read this and this questions, but they didn't helped me. Is there any proper ways of solving that problem? Maybe some nice hacks available?

Read the article
Fixing color in scatter plots in matplotlib

- by ajhall

Hi guys, I'm going to have to come back and add some examples if you need them, which you might. But, here's the skinny- I'm plotting scatter plots of lab data for my research. I need to be able to visually compare the scatter plots from one plot to the next, so I want to fix the color range on the scatter plots and add in a colorbar to each plot (which will be the same in each figure). Essentially, I'm fixing all aspects of the axes and colorspace etc. so that the plots are directly comparable by eye. For the life of me, I can't seem to get my scatter() command to properly set the color limits in the colorspace (default)... i.e., I figure out my total data's min and total data's max, then apply them to vmin, vmax, for the subset of data, and the color still does not come out properly in both plots. This must come up here and there, I can't be the only one that wants to compare various subsets of data amongst plots... so, how do you fix the colors so that each data keeps it's color between plots and doesn't get remapped to a different color due to the change in max/min of the subset -v- the whole set? I greatly appreciate all your thoughts!!! A mountain-dew and fiery-hot cheetos to all! -Allen

Read the article
Issue with XSLT Processing on PHP

- by monksy

I'm getting a few errors from XSLTProcessor: XSLTProcessor::transformToDoc() [<a href='function.XSLTProcessor-transformToDoc'>function.XSLTProcessor-transformToDoc</a>]: Invalid or inclomplete context XSLTProcessor::transformToDoc() [<a href='function.XSLTProcessor-transformToDoc'>function.XSLTProcessor-transformToDoc</a>]: XSLTProcessor::transformToDoc() [<a href='function.XSLTProcessor-transformToDoc'>function.XSLTProcessor-transformToDoc</a>]: xsltValueOf: text copy failed in Which is parsing this XSLT Line: <xsl:apply-templates select="page/sections/section" mode="subset"/> The section is: <xsl:template match="page/sections/section" mode="subset"> <a href="#{shorttitle}"> <xsl:value-of select="title"/> </a> <xsl:if test="position() != last()"> | </xsl:if> </xsl:template> The XML that the section is parsing is: <shorttitle>About</shorttitle> <title>#~ About</title> The PHP XSLT Code is: $xslt = new XSLTProcessor(); $XSL = new DOMDocument(); $XSL->load( $xsltFile, LIBXML_NOCDATA); $xslt->importStylesheet( $XSL ); print $xslt->transformToXML( $XML ); My suspicion about the the errors is due to content. I'm not getting these errors with Firefox's XSLT rendering, nor am I getting an invalid XML document on the backend.I'm not getting errors on the load, its just on the transformToXML function. Does anyone have a clue on how to solve this? This is with PHP5.

Read the article
Cannot use await in Portable Class Library for Win 8 and Win Phone 8

- by Harry Len

I'm attempting to create a Portable Class Library in Visual Studio 2012 to be used for a Windows 8 Store app and a Windows Phone 8 app. I'm getting the following error: 'await' requires that the type 'Windows.Foundation.IAsyncOperation' have a suitable GetAwaiter method. Are you missing a using directive for 'System'? At this line of code: StorageFolder guidesInstallFolder = await Package.Current.InstalledLocation.GetFolderAsync(guidesFolder); My Portable Class Library is targeted at .NET Framework 4.5, Windows Phone 8 and .NET for Windows Store apps. I don't get this error for this line of code in a pure Windows Phone 8 project, and I don't get it in a Windows Store app either so I don't understand why it won't work in my PCL. The GetAwaiter is an extension method in the class WindowsRuntimeSystemExtensions which is in System.Runtime.WindowsRuntime.dll. Using the Object Browser I can see this dll is available in the .NET for Windows Store apps component set and in the Windows Phone 8 component set but not in the .NET Portable Subset. I just don't understand why it wouldn't be in the Portable Subset if it's available in both my targeted platforms.

Read the article
Performing calculations by subsets of data in R

- by Vivi

I want to perform calculations for each company number in the column PERMNO of my data frame, the summary of which can be seen here: > summary(companydataRETS) PERMNO RET Min. :10000 Min. :-0.971698 1st Qu.:32716 1st Qu.:-0.011905 Median :61735 Median : 0.000000 Mean :56788 Mean : 0.000799 3rd Qu.:80280 3rd Qu.: 0.010989 Max. :93436 Max. :19.000000 My solution so far was to create a variable with all possible company numbers compns <- companydataRETS[!duplicated(companydataRETS[,"PERMNO"]),"PERMNO"] And then use a foreach loop using parallel computing which calls my function get.rho() which in turn perform the desired calculations rhos <- foreach (i=1:length(compns), .combine=rbind) %dopar% get.rho(subset(companydataRETS[,"RET"],companydataRETS$PERMNO == compns[i])) I tested it for a subset of my data and it all works. The problem is that I have 72 million observations, and even after leaving the computer working overnight, it still didn't finish. I am new in R, so I imagine my code structure can be improved upon and there is a better (quicker, less computationally intensive) way to perform this same task (perhaps using apply or with, both of which I don't understand). Any suggestions?

Read the article
Hierarchy of meaning

- by asldkncvas

I am looking for a method to build a hierarchy of words. Background: I am a "amateur" natural language processing enthusiast and right now one of the problems that I am interested in is determining the hierarchy of word semantics from a group of words. For example, if I have the set which contains a "super" representation of others, i.e. [cat, dog, monkey, animal, bird, ... ] I am interested to use any technique which would allow me to extract the word 'animal' which has the most meaningful and accurate representation of the other words inside this set. Note: they are NOT the same in meaning. cat != dog != monkey != animal BUT cat is a subset of animal and dog is a subset of animal. I know by now a lot of you will be telling me to use wordnet. Well, I will try to but I am actually interested in doing a very domain specific area which WordNet doesn't apply because: 1) Most words are not found in Wordnet 2) All the words are in another language; translation is possible but is to limited effect. another example would be: [ noise reduction, focal length, flash, functionality, .. ] so functionality includes everything in this set. I have also tried crawling wikipedia pages and applying some techniques on td-idf etc but wikipedia pages doesn't really do much either. Can someone possibly enlighten me as to what direction my research should go towards? (I could use anything)

Read the article
Tools to document/visualize call graph?

- by Dave Griffiths

Hi, having recently joined a project with a vast amount of code to get to grips with, I would like to start documenting and visualizing some of the flows through the call graph to give me a better understanding of how everything fits together. This is what I would like to see in my ideal tool: every node is a function/method nodes are connected if one function can call another optional square box in between detailing conditions under which call is made (or a label icon you can hover over like a tooltip) also icon on edge describing parameters hover over node and description is displayed optional icons for node to display pseudo code scenario/domain view - display subset of complete diagram for particular use-case slide view mode - for each frame, the currently executing function is highlighted plenty of options over what to display to reduce on-screen clutter the interactive use of such a tool is key, I'm not looking for a Graphviz type solution because there would be too much clutter. The ability to form a view of a subset of the entire graph would be very handy (maybe with the unimportant clutter greyed out). Don't need automatic generation from source code, happy to enter it manually. Almost like a mind-map. Does that make sense? If you are not aware of such a tool, do you also think it would be useful? (Just in case I decide to go and scratch that itch one day!)

Read the article
Altering an embedded truetype font so it will be useable by Windows GDI

- by Ritsaert Hornstra

I am trying to render PDF content to a GDI device context (a 24bit bitmap to be exact). Parsing the PDF stream into PDF objects and rendering the PDF commands from the content dictionary works well, inclduing font rendering. Embedded fonts are decompressed from their FontFile streams and "loaded" using AddFontMemResourceEx. Now some embedded fonts remove some TrueType tables that are needed by GDI, like the NAME table. Because of this, I tried to modify the font by parsing the TrueType subset font into it's tables and modify those tables that have data missing / missing tables are regenerated with as correct information as possible. I use the Microsoft Font Validator tool to see how "correct" the generated font is. I still get a few errors, like for the maxp table the max values are usually too large (it is a subset) or The xAvgCharWidth field does not equal the calculated value of the OS/2 table is not correct but this does not stop other embedded fonts to be useable.The fonts embedded using PDFCreator are the ones that are problematic. Question: - How can I determine what I need to change to the font file in order for GDI to be able to use it? - Are there any other font validation tools that might give me insight into what is still wrong with the fontfile? If needed: I can make an original fontfile and an altered fontfile available for download somewhere.

Read the article
how to remove subsets form given text file

- by user324887

i have a problem like this 10 20 30 40 70 20 30 70 30 40 10 20 29 70 80 90 20 30 40 40 45 65 10 20 80 45 65 20 I want to remove all subset transaction from this file. output file should be like follows 10 20 30 40 70 29 70 80 90 20 30 40 40 45 65 10 20 80 Where records like 20 30 70 30 40 10 20 45 65 20 are removed because of they are subset of other records. i AM using set for this but i am not able to create one set for one line can anybody know how to do this please help me here i am sending you my code include include include using namespace std; using namespace std; set s1; int main() { FILE fp = fopen ( "abc.txt", "r" ); if ( fp != NULL ) { char line [ 128 ]; / or other suitable maximum line size */ while ( fgets ( line, sizeof line, fp ) != NULL ) /* read a line */ { istringstream iss(line); do { string sub; iss >> sub; s1.insert(sub); } while (iss); for (set<string>::const_iterator p = s1.begin( );p != s1.end( ); ++p) cout << *p << endl; } } }

Read the article
Partitioning data set in r based on multiple classes of observations

- by Danny

I'm trying to partition a data set that I have in R, 2/3 for training and 1/3 for testing. I have one classification variable, and seven numerical variables. Each observation is classified as either A, B, C, or D. For simplicity's sake, let's say that the classification variable, cl, is A for the first 100 observations, B for observations 101 to 200, C till 300, and D till 400. I'm trying to get a partition that has 2/3 of the observations for each of A, B, C, and D (as opposed to simply getting 2/3 of the observations for the entire data set since it will likely not have equal amounts of each classification). When I try to sample from a subset of the data, such as sample(subset(data, cl=='A')), the columns are reordered instead of the rows. To summarize, my goal is to have 67 random observations from each of A, B, C, and D as my training data, and store the remaining 33 observations for each of A, B, C, and D as testing data. I have found a very similar question to mine, but it did not factor in multiple variables. I feel silly asking this question because it seems so simple, but I'm stumped. Also, this is my first question on this site, so I apologize in advance for any faux pas on my part.

Read the article
What's the most efficient way to load data from a file to a collection on-demand?

- by Dan

I'm working on a java project that will allows users to parse multiple files with potentially thousands of lines. The information parsed will be stored in different objects, which then will be added to a collection. Since the GUI won't require to load ALL these objects at once and keep them in memory, I'm looking for an efficient way to load/unload data from files, so that data is only loaded into the collection when a user requests it. I'm just evaluation options right now. I've also thought of the case where, after loading a subset of the data into the collection, and presenting it on the GUI, the best way to reload the previously observed data. Re-run the parser/Populate collection/Populate GUI? or probably find a way to keep the collection into memory, or serialize/deserialize the collection itself? I know that loading/unloading subsets of data can get tricky if some sort of data filtering is performed. Let's say that I filter on ID, so my new subset will contain data from two previous analyzed subsets. This would be no problem is I keep a master copy of the whole data in memory. I've read that google-collections are good and efficient when handling big amounts of data, and offer methods that simplify lots of things so this might offer an alternative to allow me to keep the collection in memory. This is just general talking. The question on what collection to use is a separate and complex thing. Do you know what's the general recommendation on this type of task? I'd like to hear what you've done with similar scenarios. I can provide more specifics if needed.

Read the article
Are Thread.stop and friends ever safe in Java?

- by Stephen C

The stop(), suspend(), and resume() in java.lang.Thread are deprecated because they are unsafe. The Sun recommended work around is to use Thread.interrupt(), but that approach doesn't work in all cases. For example, if you are call a library method that doesn't explicitly or implicitly check the interrupted flag, you have no choice but to wait for the call to finish. So, I'm wondering if it is possible to characterize situations where it is (provably) safe to call stop() on a Thread. For example, would it be safe to stop() a thread that did nothing but call find(...) or match(...) on a java.util.regex.Matcher? (If there are any Sun engineers reading this ... a definitive answer would be really appreciated.) EDIT: Answers that simply restate the mantra that you should not call stop() because it is deprecated, unsafe, whatever are missing the point of this question. I know that that it is genuinely unsafe in the majority of cases, and that if there is a viable alternative you should always use that instead. This question is about the subset cases where it is safe. Specifically, what is that subset?

Read the article
Point covering problem

- by Sean

I recently had this problem on a test: given a set of points m (all on the x-axis) and a set n of lines with endpoints [l, r] (again on the x-axis), find the minimum subset of n such that all points are covered by a line. Prove that your solution always finds the minimum subset. The algorithm I wrote for it was something to the effect of: (say lines are stored as arrays with the left endpoint in position 0 and the right in position 1) algorithm coverPoints(set[] m, set[][] n): chosenLines = [] while m is not empty: minX = min(m) bestLine = n[0] for i=1 to length of n: if n[i][0] <= m and n[i][1] > bestLine[1] then bestLine = n[i] add bestLine to chosenLines for i=0 to length of m: if m <= bestLine[1] then delete m[i] from m return chosenLines I'm just not sure if this always finds the minimum solution. It's a simple greedy algorithm so my gut tells me it won't, but one of my friends who is much better than me at this says that for this problem a greedy algorithm like this always finds the minimal solution. For proving mine always finds the minimal solution I did a very hand wavy proof by contradiction where I made an assumption that probably isn't true at all. I forget exactly what I did. If this isn't a minimal solution, is there a way to do it in less than something like O(n!) time? Thanks

Read the article
Using list() to extract a data.table inside of a function

- by Nathan VanHoudnos

I must admit that the data.table J syntax confuses me. I am attempting to use list() to extract a subset of a data.table as a data.table object as described in Section 1.4 of the data.table FAQ, but I can't get this behavior to work inside of a function. An example: require(data.table) ## Setup some test data set.seed(1) test.data <- data.table( X = rnorm(10), Y = rnorm(10), Z = rnorm(10) ) setkey(test.data, X) ## Notice that I can subset the data table easily with literal names test.data[, list(X,Y)] ## X Y ## 1: -0.8356286 -0.62124058 ## 2: -0.8204684 -0.04493361 ## 3: -0.6264538 1.51178117 ## 4: -0.3053884 0.59390132 ## 5: 0.1836433 0.38984324 ## 6: 0.3295078 1.12493092 ## 7: 0.4874291 -0.01619026 ## 8: 0.5757814 0.82122120 ## 9: 0.7383247 0.94383621 ## 10: 1.5952808 -2.21469989 I can even write a function that will return a column of the data.table as a vector when passed the name of a column as a character vector: get.a.vector <- function( my.dt, my.column ) { ## Step 1: Convert my.column to an expression column.exp <- parse(text=my.column) ## Step 2: Return the vector return( my.dt[, eval(column.exp)] ) } get.a.vector( test.data, 'X') ## [1] -0.8356286 -0.8204684 -0.6264538 -0.3053884 0.1836433 0.3295078 ## [7] 0.4874291 0.5757814 0.7383247 1.5952808 But I cannot pull a similar trick for list(). The inline comments are the output from the interactive browser() session. get.a.dt <- function( my.dt, my.column ) { ## Step 1: Convert my.column to an expression column.exp <- parse(text=my.column) ## Step 2: Enter the browser to play around browser() ## Step 3: Verity that a literal X works: my.dt[, list(X)] ## << not shown >> ## Step 4: Attempt to evaluate the parsed experssion my.dt[, list( eval(column.exp)] ## Error in `rownames<-`(`*tmp*`, value = paste(format(rn, right = TRUE), (from data.table.example.R@1032mCJ#7) : ## length of 'dimnames' [1] not equal to array extent return( my.dt[, list(eval(column.exp))] ) } get.a.dt( test.data, "X" ) What am I missing? Update: Due to some confusion as to why I would want to do this I wanted to clarify. My use case is when I need to access a data.table column when when I generate the name. Something like this: set.seed(2) test.data[, X.1 := rnorm(10)] which.column <- 'X' new.column <- paste(which.column, '.1', sep="") get.a.dt( test.data, new.column ) Hopefully that helps.

Read the article
How do I do a table join on two fields in my second table?

- by Cannonade

I have two tables: Messages - Amongst other things, has a to_id and a from_id field. People - Has a corresponding person_id I am trying to figure out how to do the following in a single linq query: Give me all messages that have been sent to and from person x (idself). I had a couple of cracks at this. Not quite right MsgPeople = (from p in db.people join m in db.messages on p.person_id equals m.from_id where (m.from_id == idself || m.to_id == idself) orderby p.name descending select p).Distinct(); This almost works, except I think it misses one case: "people who have never received a message, just sent one to me" How this works in my head So what I really need is something like: join m in db.messages on (p.people_id equals m.from_id or p.people_id equals m.to_id) Gets me a subset of the people I am after It seems you can't do that. I have tried a few other options, like doing two joins: MsgPeople = (from p in db.people join m in AllMessages on p.person_id equals m.from_id join m2 in AllMessages on p.person_id equals m2.to_id where (m2.from_id == idself || m.to_id == idself) orderby p.name descending select p).Distinct(); but this gives me a subset of the results I need, I guess something to do with the order the joins are resolved. My understanding of LINQ (and perhaps even database theory) is embarrassingly superficial and I look forward to having some light shed on my problem.

Read the article
C#/.NET Little Wonders: The Useful But Overlooked Sets

- by James Michael Hare

Once again we consider some of the lesser known classes and keywords of C#. Today we will be looking at two set implementations in the System.Collections.Generic namespace: HashSet<T> and SortedSet<T>. Even though most people think of sets as mathematical constructs, they are actually very useful classes that can be used to help make your application more performant if used appropriately. A Background From Math In mathematical terms, a set is an unordered collection of unique items. In other words, the set {2,3,5} is identical to the set {3,5,2}. In addition, the set {2, 2, 4, 1} would be invalid because it would have a duplicate item (2). In addition, you can perform set arithmetic on sets such as: Intersections: The intersection of two sets is the collection of elements common to both. Example: The intersection of {1,2,5} and {2,4,9} is the set {2}. Unions: The union of two sets is the collection of unique items present in either or both set. Example: The union of {1,2,5} and {2,4,9} is {1,2,4,5,9}. Differences: The difference of two sets is the removal of all items from the first set that are common between the sets. Example: The difference of {1,2,5} and {2,4,9} is {1,5}. Supersets: One set is a superset of a second set if it contains all elements that are in the second set. Example: The set {1,2,5} is a superset of {1,5}. Subsets: One set is a subset of a second set if all the elements of that set are contained in the first set. Example: The set {1,5} is a subset of {1,2,5}. If We’re Not Doing Math, Why Do We Care? Now, you may be thinking: why bother with the set classes in C# if you have no need for mathematical set manipulation? The answer is simple: they are extremely efficient ways to determine ownership in a collection. For example, let’s say you are designing an order system that tracks the price of a particular equity, and once it reaches a certain point will trigger an order. Now, since there’s tens of thousands of equities on the markets, you don’t want to track market data for every ticker as that would be a waste of time and processing power for symbols you don’t have orders for. Thus, we just want to subscribe to the stock symbol for an equity order only if it is a symbol we are not already subscribed to. Every time a new order comes in, we will check the list of subscriptions to see if the new order’s stock symbol is in that list. If it is, great, we already have that market data feed! If not, then and only then should we subscribe to the feed for that symbol. So far so good, we have a collection of symbols and we want to see if a symbol is present in that collection and if not, add it. This really is the essence of set processing, but for the sake of comparison, let’s say you do a list instead: 1: // class that handles are order processing service 2: public sealed class OrderProcessor 3: { 4: // contains list of all symbols we are currently subscribed to 5: private readonly List<string> _subscriptions = new List<string>(); 6: 7: ... 8: } Now whenever you are adding a new order, it would look something like: 1: public PlaceOrderResponse PlaceOrder(Order newOrder) 2: { 3: // do some validation, of course... 4: 5: // check to see if already subscribed, if not add a subscription 6: if (!_subscriptions.Contains(newOrder.Symbol)) 7: { 8: // add the symbol to the list 9: _subscriptions.Add(newOrder.Symbol); 10: 11: // do whatever magic is needed to start a subscription for the symbol 12: } 13: 14: // place the order logic! 15: } What’s wrong with this? In short: performance! Finding an item inside a List<T> is a linear - O(n) – operation, which is not a very performant way to find if an item exists in a collection. (I used to teach algorithms and data structures in my spare time at a local university, and when you began talking about big-O notation you could immediately begin to see eyes glossing over as if it was pure, useless theory that would not apply in the real world, but I did and still do believe it is something worth understanding well to make the best choices in computer science). Let’s think about this: a linear operation means that as the number of items increases, the time that it takes to perform the operation tends to increase in a linear fashion. Put crudely, this means if you double the collection size, you might expect the operation to take something like the order of twice as long. Linear operations tend to be bad for performance because they mean that to perform some operation on a collection, you must potentially “visit” every item in the collection. Consider finding an item in a List<T>: if you want to see if the list has an item, you must potentially check every item in the list before you find it or determine it’s not found. Now, we could of course sort our list and then perform a binary search on it, but sorting is typically a linear-logarithmic complexity – O(n * log n) - and could involve temporary storage. So performing a sort after each add would probably add more time. As an alternative, we could use a SortedList<TKey, TValue> which sorts the list on every Add(), but this has a similar level of complexity to move the items and also requires a key and value, and in our case the key is the value. This is why sets tend to be the best choice for this type of processing: they don’t rely on separate keys and values for ordering – so they save space – and they typically don’t care about ordering – so they tend to be extremely performant. The .NET BCL (Base Class Library) has had the HashSet<T> since .NET 3.5, but at that time it did not implement the ISet<T> interface. As of .NET 4.0, HashSet<T> implements ISet<T> and a new set, the SortedSet<T> was added that gives you a set with ordering. HashSet<T> – For Unordered Storage of Sets When used right, HashSet<T> is a beautiful collection, you can think of it as a simplified Dictionary<T,T>. That is, a Dictionary where the TKey and TValue refer to the same object. This is really an oversimplification, but logically it makes sense. I’ve actually seen people code a Dictionary<T,T> where they store the same thing in the key and the value, and that’s just inefficient because of the extra storage to hold both the key and the value. As it’s name implies, the HashSet<T> uses a hashing algorithm to find the items in the set, which means it does take up some additional space, but it has lightning fast lookups! Compare the times below between HashSet<T> and List<T>: Operation HashSet<T> List<T> Add() O(1) O(1) at end O(n) in middle Remove() O(1) O(n) Contains() O(1) O(n) Now, these times are amortized and represent the typical case. In the very worst case, the operations could be linear if they involve a resizing of the collection – but this is true for both the List and HashSet so that’s a less of an issue when comparing the two. The key thing to note is that in the general case, HashSet is constant time for adds, removes, and contains! This means that no matter how large the collection is, it takes roughly the exact same amount of time to find an item or determine if it’s not in the collection. Compare this to the List where almost any add or remove must rearrange potentially all the elements! And to find an item in the list (if unsorted) you must search every item in the List. So as you can see, if you want to create an unordered collection and have very fast lookup and manipulation, the HashSet is a great collection. And since HashSet<T> implements ICollection<T> and IEnumerable<T>, it supports nearly all the same basic operations as the List<T> and can use the System.Linq extension methods as well. All we have to do to switch from a List<T> to a HashSet<T> is change our declaration. Since List and HashSet support many of the same members, chances are we won’t need to change much else. 1: public sealed class OrderProcessor 2: { 3: private readonly HashSet<string> _subscriptions = new HashSet<string>(); 4: 5: // ... 6: 7: public PlaceOrderResponse PlaceOrder(Order newOrder) 8: { 9: // do some validation, of course... 10: 11: // check to see if already subscribed, if not add a subscription 12: if (!_subscriptions.Contains(newOrder.Symbol)) 13: { 14: // add the symbol to the list 15: _subscriptions.Add(newOrder.Symbol); 16: 17: // do whatever magic is needed to start a subscription for the symbol 18: } 19: 20: // place the order logic! 21: } 22: 23: // ... 24: } 25: Notice, we didn’t change any code other than the declaration for _subscriptions to be a HashSet<T>. Thus, we can pick up the performance improvements in this case with minimal code changes. SortedSet<T> – Ordered Storage of Sets Just like HashSet<T> is logically similar to Dictionary<T,T>, the SortedSet<T> is logically similar to the SortedDictionary<T,T>. The SortedSet can be used when you want to do set operations on a collection, but you want to maintain that collection in sorted order. Now, this is not necessarily mathematically relevant, but if your collection needs do include order, this is the set to use. So the SortedSet seems to be implemented as a binary tree (possibly a red-black tree) internally. Since binary trees are dynamic structures and non-contiguous (unlike List and SortedList) this means that inserts and deletes do not involve rearranging elements, or changing the linking of the nodes. There is some overhead in keeping the nodes in order, but it is much smaller than a contiguous storage collection like a List<T>. Let’s compare the three: Operation HashSet<T> SortedSet<T> List<T> Add() O(1) O(log n) O(1) at end O(n) in middle Remove() O(1) O(log n) O(n) Contains() O(1) O(log n) O(n) The MSDN documentation seems to indicate that operations on SortedSet are O(1), but this seems to be inconsistent with its implementation and seems to be a documentation error. There’s actually a separate MSDN document (here) on SortedSet that indicates that it is, in fact, logarithmic in complexity. Let’s put it in layman’s terms: logarithmic means you can double the collection size and typically you only add a single extra “visit” to an item in the collection. Take that in contrast to List<T>’s linear operation where if you double the size of the collection you double the “visits” to items in the collection. This is very good performance! It’s still not as performant as HashSet<T> where it always just visits one item (amortized), but for the addition of sorting this is a good thing. Consider the following table, now this is just illustrative data of the relative complexities, but it’s enough to get the point: Collection Size O(1) Visits O(log n) Visits O(n) Visits 1 1 1 1 10 1 4 10 100 1 7 100 1000 1 10 1000 Notice that the logarithmic – O(log n) – visit count goes up very slowly compare to the linear – O(n) – visit count. This is because since the list is sorted, it can do one check in the middle of the list, determine which half of the collection the data is in, and discard the other half (binary search). So, if you need your set to be sorted, you can use the SortedSet<T> just like the HashSet<T> and gain sorting for a small performance hit, but it’s still faster than a List<T>. Unique Set Operations Now, if you do want to perform more set-like operations, both implementations of ISet<T> support the following, which play back towards the mathematical set operations described before: IntersectWith() – Performs the set intersection of two sets. Modifies the current set so that it only contains elements also in the second set. UnionWith() – Performs a set union of two sets. Modifies the current set so it contains all elements present both in the current set and the second set. ExceptWith() – Performs a set difference of two sets. Modifies the current set so that it removes all elements present in the second set. IsSupersetOf() – Checks if the current set is a superset of the second set. IsSubsetOf() – Checks if the current set is a subset of the second set. For more information on the set operations themselves, see the MSDN description of ISet<T> (here). What Sets Don’t Do Don’t get me wrong, sets are not silver bullets. You don’t really want to use a set when you want separate key to value lookups, that’s what the IDictionary implementations are best for. Also sets don’t store temporal add-order. That is, if you are adding items to the end of a list all the time, your list is ordered in terms of when items were added to it. This is something the sets don’t do naturally (though you could use a SortedSet with an IComparer with a DateTime but that’s overkill) but List<T> can. Also, List<T> allows indexing which is a blazingly fast way to iterate through items in the collection. Iterating over all the items in a List<T> is generally much, much faster than iterating over a set. Summary Sets are an excellent tool for maintaining a lookup table where the item is both the key and the value. In addition, if you have need for the mathematical set operations, the C# sets support those as well. The HashSet<T> is the set of choice if you want the fastest possible lookups but don’t care about order. In contrast the SortedSet<T> will give you a sorted collection at a slight reduction in performance. Technorati Tags: C#,.Net,Little Wonders,BlackRabbitCoder,ISet,HashSet,SortedSet

Read the article
How can I filter packets from a port monitor?

- by engineerchuan

I have some data going from Point A to Point B. I have a SPAN monitor set up to a monitoring device C. To recreate some real world scenarios, I want to filter out all traffic which is a certain type (H.323 VoIP Signaling Packets) so that C sees a subset of the information that is flowing from A to B. What would the easiest way to do this be? I assume I would need a computer with 2 NIC cards and some software to examine each packet and chuck out the H.323 VoIP packets? Thanks!

Read the article
Cropping videos in Windows Live Movie Maker

- by RCIX

I want to crop (as in, take a subset of each frame, like when you crop a photo) a video in Windows Live Movie maker. Is there a way to do this, and failing that is there another free tool i can use to do that?

Read the article
SunOne case-insensitive URLs

- by RoToRa

It it possible to configure a SunOne web server to automatically redirect all URLs with capital letters to the corresponding lower case URLs? For example, redirect /Example, /eXamPle and /EXAMPLE all to /example. This would have to be for all URLs (or at least a subset excluding a specific prefix) I normally have nothing to do with web server configuration (especially not SunOne). I just need to now if it is generally possible and be pointed to the right direction on how to do this. Thanks.

Read the article
Alternative to povray ?

- by Stefano Borini

I need a raytracer, but I have some trouble with povray. is there a free, cross platform alternative which accepts a simple subset of povray syntax ? need to run it from command line. no graphics.

Read the article
What is a quick way to report login/logout times on Windows 2003?

- by blueberryfields

I have about a dozen servers, and I am looking to quickly find out all of the login/logout times, for a subset of users, for all servers, during January. Is there a quick, easy way to get this information (faster and easier than manually combing through the security logs)? I would rather not replicate any work - are there any publicly posted tools or scripts that already implement a solution to this problem?

Read the article

< Previous Page | 5 6 7 8 9 10 11 12 13 14 15 16 | Next Page >