Search Results

Search found 5 results on 1 pages for 'levenstein'.

Page 1/1 | 1 

  • Generate a set of strings with maximum edit distance

    - by Kevin Jacobs
    Problem 1: I'd like to generate a set of n strings of fixed length m from alphabet s such that the minimum Levenshtein distance (edit distance) between any two strings is greater than some constant c. Obviously, I can use randomization methods (e.g., a genetic algorithm), but was hoping that this may be a well-studied problem in computer science or mathematics with some informative literature and an efficient algorithm or three. Problem 2: Same as above except that adjacent characters cannot repeat; the i'th character in each string may not be equal to the i+1'th character. E.g., 'CAT', 'AGA' and 'TAG' are allowed, 'GAA', 'AAT', and 'AAA' are not. Background: The basis for this problem is bioinformatic and involves designing unique DNA tags that can be attached to biologically derived DNA fragments and then sequenced using a fancy second generation sequencer. The goal is to be able to recognize each tag, allowing for random insertion, deletion, and substitution errors. The specific DNA sequencing technology has a relatively low error rate per base (~1%), but is less precise when a single base is repeated 2 or more times (motivating the additional constraints imposed in problem 2).

    Read the article

  • How can I optimize retrieving lowest edit distance from a large table in SQL?

    - by Matt
    Hey, I'm having troubles optimizing this Levenshtein Distance calculation I'm doing. I need to do the following: Get the record with the minimum distance for the source string as well as a trimmed version of the source string Pick the record with the minimum distance If the min distances are equal (original vs trimmed), choose the trimmed one with the lowest distance If there are still multiple records that fall under the above two categories, pick the one with the highest frequency Here's my working version: DECLARE @Results TABLE ( ID int, [Name] nvarchar(200), Distance int, Frequency int, Trimmed bit ) INSERT INTO @Results SELECT ID, [Name], (dbo.Levenshtein(@Source, [Name])) As Distance, Frequency, 'False' As Trimmed FROM MyTable INSERT INTO @Results SELECT ID, [Name], (dbo.Levenshtein(@SourceTrimmed, [Name])) As Distance, Frequency, 'True' As Trimmed FROM MyTable SET @ResultID = (SELECT TOP 1 ID FROM @Results ORDER BY Distance, Trimmed, Frequency) SET @Result = (SELECT TOP 1 [Name] FROM @Results ORDER BY Distance, Trimmed, Frequency) SET @ResultDist = (SELECT TOP 1 Distance FROM @Results ORDER BY Distance, Trimmed, Frequency) SET @ResultTrimmed = (SELECT TOP 1 Trimmed FROM @Results ORDER BY Distance, Trimmed, Frequency) I believe what I need to do here is to.. Not dumb the results to a temporary table Do only 1 select from `MyTable` Setting the results right in the select from the initial select statement. (Since select will set variables and you can set multiple variables in one select statement) I know there has to be a good implementation to this but I can't figure it out... this is as far as I got: SELECT top 1 @ResultID = ID, @Result = [Name], (dbo.Levenshtein(@Source, [Name])) As distOrig, (dbo.Levenshtein(@SourceTrimmed, [Name])) As distTrimmed, Frequency FROM MyTable WHERE /* ... yeah I'm lost */ ORDER BY distOrig, distTrimmed, Frequency Any ideas?

    Read the article

  • Writing a post search algorithm.

    - by MdaG
    I'm trying to write a free text search algorithm for finding specific posts on a wall (similar kind of wall as Facebook uses). A user is suppose to be able to write some words in a search field and get hits on posts that contain the words; with the best match on top and then other posts in decreasing order according to match score. I'm using the edit distance (Levenshtein) "e(x, y) = e" to calculate the score for each post when compared to the query word "x" and post word "y" according to: score(x, y) = 2^(2 - e)(1 - min(e, |x|) / |x|) Each word in a post contributes to the total score for that specific post. This approach seems to work well when the posts are of roughly the same size, but sometime certain large posts manages to rack up score solely on having a lot of words in them while in practice not being relevant to the query. Am I approaching this problem in the wrong way or is there some way to normalize the score that I haven't thought of?

    Read the article

  • Algorithm to compare people names to detect identicalness

    - by Pentium10
    I am working on address book synchronization algorithm. I would like to reuse some code if there exists, but couldn't find one yet. Does someone know about an algorithm that will tell me in numbers/float/procent how much two names are identical. Levenstein distance is not good in this approach, as names and our adddress books are matching the begining of each of the name sections. John Smith should match Smith Jon, Jonathan Smith, Johnny Smith

    Read the article

  • Find recipes that can be cooked from provided ingridients

    - by skaurus
    Sorry for bad English :( Suppose i can preliminary organize recipes and ingredients data in any way. How can i effectively conduct search of recipes by user-provided ingredients, preferably sorted by max match - so, first going recipes that use maximum of provided ingridients and do not contain any other ingrs, after them recipes that uses less of provided set and still not any other ingrs, after them recipes with minimum additional requirements and so on? All i can think about is represent recipe ingridients like bitmasks, and compare required bitmask with all recipes, but it is obviously a bad way to go. And related things like Levenstein distance i don't see how to use here. I believe it should be quite common task...

    Read the article

1