couchdb lucene - Page 6

Indexing different type of Entities/Objects with Solr Lucene

- by Yos

Let's say I want to index my shop using Solr Lucene. I have many types of entities : Products, Product Reviews, Articles How do I get my Lucene to index those types, but each type with different Schema ?

Read the article

How get the offset of term in Lucene ?

- by Mehdi Amrollahi

I want to get the offset of one term in the Lucene . How can i get it ? I vectored my content as Field.TermVector.WITH_POSITIONS_OFFSETS Is there any method in Lucene that give me offset of the term in one Document ?

Read the article

Couchdb conflict resolution

- by Sundar

How does CouchDB handles conflicts while doing bi-directional replication? For example: Lets say there are two address book databases (in server A and B). There is a document for Jack which contains contact details of Jack. Server A and B are replicated and both have the same version of Jack document. In server A, Jack's mobile no is updated. In server B, Jack's address is updated. Now when we do bi-directional replication there is a conflict. How does couchDB handles it? If we initiate replication in a Java program, is there a way to know whether there were any conflicts from the java program?

Read the article

CouchDB Versioning / Auditing

- by Cory

I'm attempting to use CouchDB for a system that requires full auditing of all data operations. Because of its built in revision-tracking, couch seemed like an ideal choice. But then I read in the O'Reilly textbook that "CouchDB does not guarantee that older versions are kept around." I can't seem to find much more documentation on this point, or how couch deals with its revision-tracking internally. Is there any way to configure couch either on a per-database, or per-document level to keep all versions around forever? If so, how?

Read the article

Read huge free text docs in one file for lucene indexing

- by Jun

I have heaps of free text news docs in one big file. The structure of each news doc is like: (Header line) Category, Doc1, Date (day, month, year) (body text) ... ... ... (Header line) Category, Doc2, Date (day, month, year) (body text) ... ... ... If I extract each doc from the big file, it costs too much time and not efficient. Therefore, I decide to read the file line by line and feed information to lucene the same time. I write c# code to index each doc to lucene like: Streamreader sr = new Streamreader(file); string line = ""; while((line = sr.ReadLine()) != null) { How can I tell this line is a doc header line from text line and get the metadata and all the text lines of a doc for lucene to index. Also, the text is read by OCR which can not give correct line-separating. Captions are mixed with content text iterate the process till the end of the file } with thanks

Read the article

How do i implement tag searching with lucene?

- by acidzombie24

I havent used lucene. Last time i ask (many months ago, maybe a year) people suggested lucene. As am example say there are 3 items tag like this apples carrots apples carrots apple banana if a user search apples i dont care if there is any preference from 1,2 and 4. However i seen many forums do this which i hated is when a user search apple carrots 2 and 3 are get high results while 1 is hard to find even though it matches my search more closely. I HATED this in forums. Also i would like the ability to do search carrots -apples which will only get me 3. I am not sure what should happen if i search carrots banana but anyways as long as more 2 and 3 results are lower priority then 1 when i search apples carrots i'll be happy. Can lucene do this? and where do i start? i see a lot of classes and many of them talk about docs. What should i use for tagging?

Read the article

Luke like tool in C#

- by Kapil

Is there are Luke like tool for viewing lucene indexes in C# using the Lucene.NET api?

Read the article

Building a case for solr

- by Midhat

Our product consists of multiple applications, All using Lucene. 2 of the applications I am involved with have Lucene indexes of about 3 GB and 12GB. Another team is building an application, for which they estimate the LUCENE INDEX size to be close to 1 Terabyte. New documents are added to the indexes every 15 days approx. We do not have any apparent performance issues with the current applications. So my question is SHould we be using Solr now? When should one stop using Lucene and graduate to Solr? Any disadvantages/problems for using Solr? The client applications are made in ASP.Net, but I assume they will be able to use a solr server using solrnet

Read the article

Apache CouchDB and Java

Accessing the CouchDB database for data storage and retrieval using Java CouchDB - Java - Programming - Languages - FAQs Help and Tutorials

Read the article

Which of CouchDB or MongoDB suits my needs?

- by vonconrad

Where I work, we use Ruby on Rails to create both backend and frontend applications. Usually, these applications interact with the same MySQL database. It works great for a majority of our data, but we have one situation which I would like to move to a NoSQL environment. We have clients, and our clients have what we call "inventories"--one or more of them. An inventory can have many thousands of items. This is currently done through two relational database tables, inventories and inventory_items. The problems start when two different inventories have different parameters: # Inventory item from inventory 1, televisions { inventory_id: 1 sku: 12345 name: Samsung LCD 40 inches model: 582903-4 brand: Samsung screen_size: 40 type: LCD price: 999.95 } # Inventory item from inventory 2, accomodation { inventory_id: 2 sku: 48cab23fa name: New York Hilton accomodation_type: hotel star_rating: 5 price_per_night: 395 } Since we obviously can't use brand or star_rating as the column name in inventory_items, our solution so far has been to use generic column names such as text_a, text_b, float_a, int_a, etc, and introduce a third table, inventory_schemas. The tables now look like this: # Inventory schema for inventory 1, televisions { inventory_id: 1 int_a: sku text_a: name text_b: model text_c: brand int_b: screen_size text_d: type float_a: price } # Inventory item from inventory 1, televisions { inventory_id: 1 int_a: 12345 text_a: Samsung LCD 40 inches text_b: 582903-4 text_c: Samsung int_a: 40 text_d: LCD float_a: 999.95 } This has worked well... up to a point. It's clunky, it's unintuitive and it lacks scalability. We have to devote resources to set up inventory schemas. Using separate tables is not an option. Enter NoSQL. With it, we could let each and every item have their own parameters and still store them together. From the research I've done, it certainly seems like a great alterative for this situation. Specifically, I've looked at CouchDB and MongoDB. Both look great. However, there are a few other bits and pieces we need to be able to do with our inventory: We need to be able to select items from only one (or several) inventories. We need to be able to filter items based on its parameters (eg. get all items from inventory 2 where type is 'hotel'). We need to be able to group items based on parameters (eg. get the lowest price from items in inventory 1 where brand is 'Samsung'). We need to (potentially) be able to retrieve thousands of items at a time. We need to be able to access the data from multiple applications; both backend (to process data) and frontend (to display data). Rapid bulk insertion is desired, though not required. Based on the structure, and the requirements, are either CouchDB or MongoDB suitable for us? If so, which one will be the best fit? Thanks for reading, and thanks in advance for answers. EDIT: One of the reasons I like CouchDB is that it would be possible for us in the frontend application to request data via JavaScript directly from the server after page load, and display the results without having to use any backend code whatsoever. This would lead to better page load and less server strain, as the fetching/processing of the data would be done client-side.

Read the article

Why do my CouchDB databases grow so fast?

- by konrad

I was wondering why my CouchDB database was growing to fast so I wrote a little test script. This script changes an attributed of a CouchDB document 1200 times and takes the size of the database after each change. After performing these 1200 writing steps the database is doing a compaction step and the db size is measured again. In the end the script plots the databases size against the revision numbers. The benchmarking is run twice: The first time the default number of document revision (=1000) is used (_revs_limit). The second time the number of document revisions is set to 1. The first run produces the following plot The second run produces this plot For me this is quite an unexpected behavior. In the first run I would have expected a linear growth as every change produces a new revision. When the 1000 revisions are reached the size value should be constant as the older revisions are discarded. After the compaction the size should fall significantly. In the second run the first revision should result in certain database size that is then keeps during the following writing steps as every new revision leads to the deletion of the previous one. I could understand if there is a little bit of overhead needed to manage the changes but this growth behavior seems weird to me. Can anybody explain this phenomenon or correct my assumptions that lead to the wrong expectations?

Read the article

How to reference other documents in a couchDB view (joining like functionality)

- by Surfrdan

We have a CouchDB representation of an XML database which we use to power a javascript based frontend for manipulating the XML documents. The basic structure is a simple 3 level hierachy. i.e. A - B - C A: Parent doucument (type A) B: any number of child documents of parent type A C: any number of child documents of parent type B We represent these 3 document types in CouchDB with a 'type' attribute: e.g. { "_id":"llgc-id:433", "_rev":"1-3760f3e01d7752a7508b047e0d094301", "type":"A", "label":"Top Level A document", "logicalMap":{ "issues":{ "1":{ "URL":"http://hdl.handle.net/10107/434-0", "FILE":"llgc-id:434" }, "2":{ "URL":"http://hdl.handle.net/10107/467-0", "FILE":"llgc-id:467" etc... } } } } { "_id":"llgc-id:433", "_rev":"1-3760f3e01d7752a7508b047e0d094301", "type":"B", "label":"a B document", } What I want to do is produce a view which returns documents just like the A type but includes the label attribute from the B document within the logicalMap list e.g. { "_id":"llgc-id:433", "_rev":"1-3760f3e01d7752a7508b047e0d094301", "type":"A", "label":"Top Level A document", "logicalMap":{ "issues":{ "1":{ "URL":"http://hdl.handle.net/10107/434-0", "FILE":"llgc-id:434", "LABEL":"a B document" }, "2":{ "URL":"http://hdl.handle.net/10107/467-0", "FILE":"llgc-id:467", "LABEL":"another B document" etc... } } } } I'm struggling to get my head around the best way to perform this. It looks like it should be fairly simple though!

Read the article

Need Explanation of couchdb reduce function

- by Alan

From http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views The couchdb reduce function is defined as function (key, values, rereduce) { return sum(values); } key will be an array whose elements are arrays of the form [key,id] values will be an array of the values emitted for the respective elements in keys i.e. reduce([ [key1,id1], [key2,id2], [key3,id3] ], [value1,value2,value3], false) I am having trouble understanding when/why the array of keys would contain different key values. If the array of keys does contain different key values, how would I deal with it? As an example, assume that my database contains movements between accounts of the form. {"amount":100, "CreditAccount":"account_number", "DebitAccount":"account_number"} I want a view that gives the balance of an account. My map function does: emit( doc.CreditAccount, doc.amount ) emit( doc.DebitAccount, -doc.amount ) My reduce function does: return sum(values); I seem to get the expected results, however I can't reconcile this with the possibility that my reduce function gets different key values. Is my reduce function supposed to group key values first? What kind of result would I return in that case?

Read the article

Tokenizing Twitter Posts in Lucene

- by Amaç Herdagdelen

Hello, My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene? More detailed version: I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag? My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example, String newText = newText.replaceAll("#", "hashtag"); newText = newText.replaceAll("@", "addresstag"); Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense? Thanks in advance! Amaç

Read the article

Lucene Error While Reading binary block : java.io.EOFException

- by tushar Khairnar

Hi, I am getting java.io.EOFException while reading a binary block from lucene index. I am storing java object as byte-array in lucene index field and reading it when hit occurs. Here is stack trace : Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2281) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2750) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:780) at java.io.ObjectInputStream.(ObjectInputStream.java:280) at org.terracotta.modules.searchable.util.SerializationUtil$OIS.(SerializationUtil.java:20) I have some background threads which write into index. But i buffer them and then write them at once like 1000. Occasionally I also issue optimize() on index. When I write, I am re-opening IndexReader. Does this is happening because of IndexReader re-opening call? Thanks. Regards Tushar

Read the article

Lucene.NET search index approach

- by Tim Peel

Hi, I am trying to put together a test case for using Lucene.NET on one of our websites. I'd like to do the following: Index in a single unique id. Index across a comma delimitered string of terms or tags. For example. Item 1: Id = 1 Tags = Something,Separated-Term I will then be structuring the search so I can look for documents against tag i.e. tags:something OR tags:separate-term I need to maintain the exact term value in order to search against it. I have something running, and the search query is being parsed as expected, but I am not seeing any results. Here's some code. My parser (_luceneAnalyzer is passed into my indexing service): var parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_CURRENT, "Tags", _luceneAnalyzer); parser.SetDefaultOperator(QueryParser.Operator.AND); return parser; My Lucene.NET document creation: var doc = new Document(); var id = new Field( "Id", NumericUtils.IntToPrefixCoded(indexObject.id), Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO); var tags = new Field( "Tags", string.Join(",", indexObject.Tags.ToArray()), Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES); doc.Add(id); doc.Add(tags); return doc; My search: var parser = BuildQueryParser(); var query = parser.Parse(searchQuery); var searcher = Searcher; TopDocs hits = searcher.Search(query, null, max); IList<SearchResult> result = new List<SearchResult>(); float scoreNorm = 1.0f / hits.GetMaxScore(); for (int i = 0; i < hits.scoreDocs.Length; i++) { float score = hits.scoreDocs[i].score * scoreNorm; result.Add(CreateSearchResult(searcher.Doc(hits.scoreDocs[i].doc), score)); } return result; I have two documents in my index, one with the tag "Something" and one with the tags "Something" and "Separated-Term". It's important for the - to remain in the terms as I want an exact match on the full value. When I search with "tags:Something" I do not get any results. Question What Analyzer should I be using to achieve the search index I am after? Are there any pointers for putting together a search such as this? Why is my current search not returning any results? Many thanks

Read the article

Lucene numDocs and doqFreq on custom similarity class

- by David A

Hi All, im doing an aplication with Lucene (im a noob with it) and im facing some problems. My aplication uses the Lucene 2.4.0 library with a custom similaraty implementation (the jar is imported) In my app im calculating doqFreq and numDocs manually (im adding the values of all indexes and then i calculate a global value in order to use it on every query) and i want to use that values on a custom similarity implementation in order to calculate a new IDF. The problem is that I dont know how to use (or send) the new doqFreq and numDocs values from my app on that new similarty implementation as I dont want to change lucene´s code apart from this extra class. Any suggestions or examples? I read the docs but i dont now how to aproach this :s Thanks

Read the article

Lucene 2.2 arabic analyzer

- by Mustafa Zidan

Is it possible to modify Lucene 2.2 to add Arabic analyzer and if anyone have done this already where can I get source/jar

Read the article

Get highest frequency terms from Lucene index

- by Julia

Hello! i need to extract terms with highest frequencies from several lucene indexes, to use them for some semantic analysis. So, I want to get maybe top 30 most occuring terms(still did not decide on threshold, i will analyze results) and their per-index counts. I am aware that I might lose some precision because of potentionally dropped duplicates, but for now, lets say i am ok with that. So for the proposed solutions, (needless to say maybe) speed is not important, since I would do static analysis, I would put accent on simplicity of implementation because im not so skilled with Lucene (not the programming guru too :/ ) and cant wrap my mind around many concepts of it.. I can not find any code samples from something similar, so all concrete advices (code, pseudocode, links to code samples...) I will apretiate very much!!! Thank you!

Read the article

Lucene real-time indexing?

- by Kip

What is the best way to achieve Lucene real-time indexing?

Read the article

Lucene Search for japanese characters

- by Pranali Desai

Hi All, I have implemented lucene for my application and it works very well unless you have introduced something like japanese characters. The problem is that if I have japanese string ?????????????? and I search with ? that is the first character than it works well whereas if I use more than one japanese character(????)in search token search fails and there is no document found. Are japanese characters supported in lucene? what are the settings to be done to get it working?

Read the article

Exception when indexing text documents with Lucene, using SnowballAnalyzer for cleaning up

- by Julia

Hello!!! I am indexing the documents with Lucene and am trying to apply the SnowballAnalyzer for punctuation and stopword removal from text .. I keep getting the following error :( IllegalAccessError: tried to access method org.apache.lucene.analysis.Tokenizer.(Ljava/io/Reader;)V from class org.apache.lucene.analysis.snowball.SnowballAnalyzer Here is the code, I would very much appreciate help!!!! I am new with this.. public class Indexer { private Indexer(){}; private String[] stopWords = {....}; private String indexName; private IndexWriter iWriter; private static String FILES_TO_INDEX = "/Users/ssi/forindexing"; public static void main(String[] args) throws Exception { Indexer m = new Indexer(); m.index("./newindex"); } public void index(String indexName) throws Exception { this.indexName = indexName; final File docDir = new File(FILES_TO_INDEX); if(!docDir.exists() || !docDir.canRead()){ System.err.println("Something wrong... " + docDir.getPath()); System.exit(1); } Date start = new Date(); PerFieldAnalyzerWrapper analyzers = new PerFieldAnalyzerWrapper(new SimpleAnalyzer()); analyzers.addAnalyzer("text", new SnowballAnalyzer("English", stopWords)); Directory directory = FSDirectory.open(new File(this.indexName)); IndexWriter.MaxFieldLength maxLength = IndexWriter.MaxFieldLength.UNLIMITED; iWriter = new IndexWriter(directory, analyzers, true, maxLength); System.out.println("Indexing to dir..........." + indexName); if(docDir.isDirectory()){ File[] files = docDir.listFiles(); if(files != null){ for (int i = 0; i < files.length; i++) { try { indexDocument(files[i]); }catch (FileNotFoundException fnfe){ fnfe.printStackTrace(); } } } } System.out.println("Optimizing...... "); iWriter.optimize(); iWriter.close(); Date end = new Date(); System.out.println("Time to index was" + (end.getTime()-start.getTime()) + "miliseconds"); } private void indexDocument(File someDoc) throws IOException { Document doc = new Document(); Field name = new Field("name", someDoc.getName(), Field.Store.YES, Field.Index.ANALYZED); Field text = new Field("text", new FileReader(someDoc), Field.TermVector.WITH_POSITIONS_OFFSETS); doc.add(name); doc.add(text); iWriter.addDocument(doc); } }

Read the article

Search Lucene with precise edit distances

- by askullhead

I would like to search a Lucene index with edit distances. For example, say, there is a document with a field FIRST_NAME; I want all documents with first names that are 1 edit distance away from, say, 'john'. I know that Lucene supports fuzzy searches (FIRST_NAME:john~) and takes a number between 0 and 1 to control the fuzziness. The problem (for me) is this number does not directly translate to an edit distance. And when the values in the documents are short strings (less than 3 characters) the fuzzy search has difficulty finding them. For example if there is a document with FIRST_NAME 'J' and I search for FIRST_NAME:I~0.0 I don't get anything back.

Read the article

lucene get matched terms in query

- by iamrohitbanga

what is the best way to find out which terms in a query matched against a given document returned as a hit in lucene? I have tried a weird method involving hit highlighting package in lucene contrib and also a method that searches for every word in the query against the top most document ("docId: xy AND description: each_word_in_query"). Do not get satisfactory results? hit highlighting does not report some of the words that matched for a document other than the first one. i am not sure if the second approach is the best alternative.

Read the article

How to make Lucene.NET Query '#' and '+' characters ?

- by Yoann. B

How to make Lucene.NET Query '#' and '+' characters ? Like "C#" and "C++" Note : i use NHibernate.Search

Search Results

Search found 631 results on 26 pages for 'couchdb lucene'.

Page 6/26 | < Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13 | Next Page >

- by Yos

- by Mehdi Amrollahi

- by Sundar

- by Cory

- by Jun

- by acidzombie24

- by Kapil

- by Midhat

- by vonconrad

- by konrad

- by Surfrdan

- by Alan

- by Amaç Herdagdelen

- by tushar Khairnar

- by Tim Peel

- by David A

- by Mustafa Zidan

- by Julia

- by Kip

- by Pranali Desai

- by Julia

- by askullhead

- by iamrohitbanga

- by Yoann. B

< Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13 | Next Page >