Search Results

Search found 4479 results on 180 pages for 'pdf scraping'.

Page 106/180 | < Previous Page | 102 103 104 105 106 107 108 109 110 111 112 113 | Next Page >

How do I process multipart http responses in Ruby Net:HTTP?

- by seal-7

There is so much information out there on how to generate multipart responses or do multipart file uploads. I can't seem to find any information on how to process a multipart http response. Here is some IRB output from a multipart http response I am working with. >> response.http.content_type => "multipart/related" >> response.http.body[0..2048] => "\r\n------=_Part_3_806633756.1271797659309\r\nContent-Type: text/xml; charset=UTF-8\r\nContent-Transfer-Encoding: binary\r\nContent-Id: <A0FCC4333C6D0FCA346B97FAB6B61818>\r\n\r\n<?xml version="1.0" encoding="UTF-8"?><soapenv:Envelope xmlns:soapenv="http://www.w3.org/2003/05/soap-envelope" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><soapenv:Body><ns1:runReportResponse soapenv:encodingStyle="http://www.w3.org/2003/05/soap-encoding" xmlns:ns1="http://192.168.1.200:8080/jasperserver/services/repository"><ns2:result xmlns:ns2="http://www.w3.org/2003/05/soap-rpc">runReportReturn</ns2:result><runReportReturn xsi:type="xsd:string"><?xml version="1.0" encoding="UTF-8"?>\n<operationResult version="2.0.1">\n\t<returnCode><![CDATA[0]]></returnCode>\n</operationResult>\n</runReportReturn></ns1:runReportResponse></soapenv:Body></soapenv:Envelope>\r\n------=_Part_3_806633756.1271797659309\r\nContent-Type: application/pdf\r\nContent-Transfer-Encoding: binary\r\nContent-Id: <report>\r\n\r\n%PDF-1.4\n%\342\343\317\323\n3 0 obj

Read the article
Scraped HTML is not written at the beginning of text file.

- by karikari

Currently, I'm scraping the HTML code of a page, and writing it to a text file. My problem is, why must there be empty spaces or empty lines at the beginning? The HTML codes written to the txt file do not seem to start at the beginning of the text file. This means that the '<' is not located at the position 0 of the txt file. After a few runs, my HTML is always written a few lines down inside the text file. Can anyone tell me why?

Read the article
Cache consistency & spawning a thread

- by Dave Keck

Background I've been reading through various books and articles to learn about processor caches, cache consistency, and memory barriers in the context of concurrent execution. So far though, I have been unable to determine whether a common coding practice of mine is safe in the strictest sense. Assumptions The following pseudo-code is executed on a two-processor machine: int sharedVar = 0; myThread() { print(sharedVar); } main() { sharedVar = 1; spawnThread(myThread); sleep(-1); } main() executes on processor 1 (P1), while myThread() executes on P2. Initially, sharedVar exists in the caches of both P1 and P2 with the initial value of 0 (due to some "warm-up code" that isn't shown above.) Question Strictly speaking – preferably without assuming any particular CPU – is myThread() guaranteed to print 1? With my newfound knowledge of processor caches, it seems entirely possible that at the time of the print() statement, P2 may not have received the invalidation request for sharedVar caused by P1's assignment in main(). Therefore, it seems possible that myThread() could print 0. References These are the related articles and books I've been reading. (It wouldn't allow me to format these as links because I'm a new user - sorry.) Shared Memory Consistency Models: A Tutorial hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf Memory Barriers: a Hardware View for Software Hackers rdrop.com/users/paulmck/scalability/paper/whymb.2009.04.05a.pdf Linux Kernel Memory Barriers kernel.org/doc/Documentation/memory-barriers.txt Computer Architecture: A Quantitative Approach amazon.com/Computer-Architecture-Quantitative-Approach-4th/dp/0123704901/ref=dp_ob_title_bk

Read the article
ZIP Numerous Blob Files

- by Michael

I have a database table that contains numerous PDF blob files. I am attempting to combine all of the files into a single ZIP file that I can download and then print. Please help! <?php include 'config.php'; include 'connect.php'; $session= $_GET[session]; $query = " SELECT $tbl_uploads.username, $tbl_uploads.description, $tbl_uploads.type, $tbl_uploads.size, $tbl_uploads.content, $tbl_members.session FROM $tbl_uploads LEFT JOIN $tbl_members ON $tbl_uploads.username = $tbl_members.username WHERE $tbl_members.session= '$session'"; $result = mysql_query($query) or die('Error, query failed'); while(list($username, $description, $type, $size, $content) = mysql_fetch_array($result)) { header("Content-length: $size"); header("Content-type: $type"); header("Content-Disposition: inline; filename=$username-$description.pdf"); echo $content; } $files = array('File 1 from database', 'File 2 from database'); $zip = new ZipArchive; $zip->open('file.zip', ZipArchive::CREATE); foreach ($files as $file) { $zip->addFile($file); } $zip->close(); header('Content-Type: application/zip'); header('Content-disposition: attachment; filename=filename.zip'); header('Content-Length: ' . filesize($zipfilename)); readfile($zipname); mysql_close($link); exit; ?>

Read the article
A database of questions with unambiguous numeric answers.

- by dreeves

I (and co-hackers) are building a sort of trivia game inspired by this blog post: http://messymatters.com/calibration. The idea is to give confidence intervals and learn how to be calibrated (when you're "90% sure" you should be right 90% of the time). We're thus looking for, ideally, thousands of questions with unambiguous numerical answers. Also, they shouldn't be too boring. There are a lot of random statistics out there -- eg, enclosed water area in different countries -- that would make the game mind-numbing. Things like release dates of classic movies are more interesting (to most people). Other interesting ones we've found include Olympic records, median incomes for different professions, dates of famous inventions, and celebrity ages. Scraping things like above, by the way, was my reason for asking this question: http://stackoverflow.com/questions/2611418/scrape-html-tables So, if you know of other sources of interesting numerical facts (in a parsable form) I'm eager for pointers to them. Thanks!

Read the article
OpenOffice in Java

- by user156073

I need a Java interface to the OpenOffice document conversion which would equal to a manual (Open... and then Save As...) as well as access to the PDF generation. The following are some example of what I want to achieve. 1) Open one type of document (fx. OpenOffice Writer document or Microsoft Office document) onto memory and save it in another supported format. 2) Open one type of document into memory and export it as PDF. I have gone through http://api.openoffice.org but didn't get any material which can help me to get started. I also tried JODConverter but it requires me to run OpenOffice as a service. I wold prefer to include all the core functionality of OpenOffice in one JAR file so that that the user can use my application without installing or running openoffice on their PC. What would be a URL/code snippet from where I can get tutorials to get started? Also do I have to add some JAR files? Will the application work even if I don't have OpenOffice installed on my PC?

Read the article
How Can I Create Reports in a Custom C#.NET Windows Application? - General Question

- by user311509

Assume i have a custom Windows application written in C#. This application has only the following functionalists, add, edit, delete and view. For example, a user can add a sale, change sales record, delete a sale record or view the whole sales record. I need to add some reporting functionalists e.g. i want a user to print the sales of a certain customer from 2008 to 2009 into pdf, what all products a certain customer has purchased from us and so on. I will only include the basic common report requests that are usually needed in the office. Any other kind of reports that are requested inconsistently, i would do it manually from my side at the back end and send the results manually to the requester. What i would do is: If a user wants more info of a certain customer, a special window box appears for that customer. This window box will have different controls that allows user to request more info such as, print customer purchases from ..... to ..... (user chooses the dates) and user will view results in pdf or so. Of course, at the back scene i will write an appropriate SQL Query with parameters that meets a certain function. Is this how it should be done? I have heard about SQL Reporting, i don't know anything about it yet. I will check it out. Anyhow, your suggestions won't harm. I'm still a student, so i don't have practical experience yet. I hope my question is clear enough. Thank you.

Read the article
how to install python-spidermonkey on windows

- by paul

Hello all, im making some script with python mechanize, one of problem is it really hard to find which support javascript supported web client scraping or crawler. actually i was found some such as python-spidermonkey and pykhtml and so on. but most of all only support on linux . i want to make my python script with exe file. so definitely i have to install on windows platform. my question is ..are there any method to can install python-spidermonkey or pykhtml on windows platform? i really need to support windows platform. if anyone can hint or help really appreicate! thanks in advance Paul

Read the article
Scrapped HTML does not written at the beginning of text file.

- by karikari

Currently, I'm scraping the HTML code of a page, and write it to a txt file. My problem is, why there must be empty spaces or empty lines at the beginning? The HTML codes written to the txt file, seems does not start at the beginning of the txt file. Means, the '<' does not located at the position 0 of the txt file. After a few runs, my HTML is always written a few lines down inside the txt file. Can anyone tell me why : )

Read the article
Need a tool to search large structure text documents for words, phrases and related phrases

- by pitosalas

I have to keep up with structured documents containing things such as requests for proposals, government program reports, threat models and all kinds of things like that. They are in techno-legalese as I would call them: highly structured, with section numbering and 3, 4 and 5 levels of nesting. All in English I need a more efficient way to locate those paragraphs of nuggets that matter to me. So what I’d like is kind of a local document index/repository, that would allow me to have some standing queries and easily locate sections in documents that talk about my queries. Here’s an example: I’d like to load in 10 large PDF files, each of say 100 pages. Each PDF contains English text, formatted very nicely into paragraphs and sections. I’d like to specify that I am interested in “blogging platforms”, “weaknesses in Ruby”, “localization and internationalization” Ideally then look at a list that showed the section of text, the name of the document, and other information that seemed to be related to and/or include the words and phrases I specified. I am sure something like this exists. I would call it something like document indexing, document comprehension or structured searching.

Read the article
PHP regex help with preg_match_all

- by Steve

Hi, I am trying to refine a preg_match_all by finding the second occurrence of a period then a space: <?php $str = "East Winds 20 knots. Gusts to 25 knots. Waters a moderate chop. Slight chance of showers."; preg_match_all ('/(^)((.|\n)+?)(\.\s{2})/',$str, $matches); $dataarray=$matches[2]; foreach ($dataarray as $value) { echo $value; } ?> But it does not work: the {2} occurrence is incorrect. I have to use preg_match_all because I a scraping dynamic HTML. I want to capture this from the string: East Winds 20 knots. Gusts to 25 knots. Any ideas? Thx

Read the article
Web framework with JasperReports integration?

- by Dave Jarvis

What web development frameworks natively support JasperReports? Consider the following form as an example: <form name="report" method="post"> <input type="hidden" name="REPORT_PATH" value="reports/Names" /> <input type="hidden" name="REPORT_FILE" value="List" /> <input type="hidden" name="REPORT_FORMAT" value="pdf" /> <input type="hidden" name="REPORT_EMBED" value="false" /> Name: <input type="text" name="report_Name" value="" /><br /> Date: <input type="text" name="report_Date" value="" /><br /> <input type="submit" name="View" value="View" /> </form> The framework would pass the report_ parameters to JasperReports, which in turn runs reports/Names/List.jasper, and then sends a PDF attachment to the browser. In general: Ability to configure report (i.e., the hidden REPORT_ variables) Web FORM for setting report parameters (i.e., the report_ variables) Framework handles configuring database connection, report execution, etc. I don't care about the technical minutia on how the integration works. The example above is just one possibility of many.

Read the article
Python - Things one MUST avoid

- by Anurag Uniyal

Today I was bitten again by "Mutable default arguments" after many years. I usually don't use mutable default arguments unless needed but I think with time I forgot about that, and today in the application I added tocElements=[] in a pdf generation function's argument list and now 'Table of Content' gets longer and longer after each invocation of "generate pdf" :) My question is what other things should I add to my list of things to MUST avoid? 1 Mutable default arguments 2 import modules always same way e.g. 'from y import x' and 'import x' are totally different things actually they are treated as different modules see http://stackoverflow.com/questions/1459236/module-reimported-if-imported-from-different-path 3 Do not use range in place of lists because range() will become an iterator anyway, so things like this will fail, so wrap it by list myIndexList = [0,1,3] isListSorted = myIndexList == range(3) # will fail in 3.0 isListSorted = myIndexList == list(range(3)) # will not same thing can be mistakenly done with xrange e.g myIndexList == xrange(3). 4 Catching multiple exceptions try: raise KeyError("hmm bug") except KeyError,TypeError: print TypeError It prints "hmm bug", though it is not a bug, it looks like we are catching exceptions of type KeyError,TypeError but instead we are catching KeyError only as variable TypeError, instead use try: raise KeyError("hmm bug") except (KeyError,TypeError): print TypeError

Read the article
Vehicle License Plate Detection

- by Ash

Hey all Basically for my final project at university, I'm developing a vehicle license plate detection application. Now I consider myself an intermediate programmer, however my mathematics knowledge lacks anything above secondary school, therefore producing detection formulae is basically impossible. I've spend a good amount of time looking up academic papers such as: http://www.scribd.com/doc/266575/Detecting-Vehicle-License-Plates-in-Images http://www.cic.unb.br/~mylene/PI_2010_2/ICIP10/pdfs/0003945.pdf http://www.eurasip.org/Proceedings/Eusipco/Eusipco2007/Papers/d3l-b05.pdf When it comes to the maths, I'm lost. Due to this testing various graphic images proved productive, for example: to However this approach is only catered to that particular image, and if the techniques were applied to different images, I'm sure a different, most likely poorer conversion would occur. I've read about a formula called the bottom hat morphology transform, which according to the first does the following: "Basically, the trans- formation keeps all the dark details of the picture, and eliminates everything else (including bigger dark regions and light regions)." Sadly I can't find much information on this, however the image within the documentation near the end of the report shows it's effectiveness. I'm aware this is complicated and vast, I'd just appreciate a little advice, even in terms of what transformation techniques I should focus on developing, or algorithm regarding edge detection or pixel detection. Few things I need to add Developing in C Sharp Confining the project to UK registration plates only I can basically choose the images to convert as a demonstration Thanks

Read the article
treeview dynamically populated

- by Laziale

Hello everyone - I have this treeview control where I want to put uploaded files on the server. I want to be able to create the nodes and the child nodes dynamically from the database. I am using this query for getting the data from DB: SELECT c.Category, d.DocumentName FROM Categories c INNER JOIN DocumentUserFile d ON c.ID = d.CategoryId WHERE d.UserId = '9rge333a-91b5-4521-b3e6-dfb49b45237c' The result from that query is this one: Agendas transactions.pdf Minutes accounts.pdf I want to have the treeview sorted that way too. I am trying with this piece of code: TreeNode tn = new TreeNode(); TreeNode tnSub = new TreeNode(); foreach (DataRow dt in tblTreeView.Rows) { tn.Text = dt[0].ToString(); tn.Value = dt[0].ToString(); tnSub.Text = dt[1].ToString(); tnSub.NavigateUrl = "../downloading.aspx?file=" + dt[1].ToString() +"&user=" + userID; tn.ChildNodes.Add(tnSub); tvDocuments.Nodes.Add(tn); } I am getting the treeview populated nicely for the 1st category and the document under that category, but I can't get it to work when I want to show more documents under that category, or even more complicate to show new category beneath the 1st one with documents from that category. How can I solve this? I appreciate the answers a lot. Thanks, Laziale

Read the article
Free (as in beer) Reverse Image Search API/Library/Service

- by Bauer

TinEye provides a great way to "reverse" search by image (i.e. upload/transload an image and have multiple possible sources of that image returned as results.) Since screen-scraping is messy and unreliable, I'm looking for a free API/library/web-service that could offer the same (or similar) reverse-image search function. At present, TinEye offers a commercial API, but since I'll only be using the service for small personal projects, it's hard to justify the cost of the service (the lowest being 1,000 searches for $70 USD). Is anyone aware of such a free service? Or is there a simpler way to approach this (programmatic solution; any language)? I understand that this is a tall order, and submitting the question is really only a last resort in the hope that there is some solution. Example image search is 99designs' StackOverflow logo competition entry by wolv

Read the article
How to loop over nodes with xmlfeed using scrapy python

- by Kour ipm

Hi i working on scrapy and trying xml feeds first time, below is my code class TestxmlItemSpider(XMLFeedSpider): name = "TestxmlItem" allowed_domains = {"http://www.nasinteractive.com"} start_urls = [ "http://www.nasinteractive.com/jobexport/advance/hcantexasexport.xml" ] iterator = 'iternodes' itertag = 'job' def parse_node(self, response, node): title = node.select('title/text()').extract() job_code = node.select('job-code/text()').extract() detail_url = node.select('detail-url/text()').extract() category = node.select('job-category/text()').extract() print title,";;;;;;;;;;;;;;;;;;;;;" print job_code,";;;;;;;;;;;;;;;;;;;;;" item = TestxmlItem() item['title'] = node.select('title/text()').extract() ....... return item result: File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/item.py", line 56, in __setitem__ (self.__class__.__name__, key)) exceptions.KeyError: 'TestxmlItem does not support field: title' Totally there are 200+ items so i need to loop over and assign the node text to item but here all the results are displaying at once when we print, actually how can we loop over on nodes in scraping xml files with xmlfeedspider

Read the article
Rendering blocks side-by-side with FOP

- by Rolf

I need to generate a PDF from XML data using Apache FOP. The problem is that FOP doesn't support fo:float, and I really need to have items (boxes of rendered data) side-by-side in the PDF. More precisely, I need them in a 4x4 grid on each page, like so: In HTML, I would simply render these as left-floated divs with appropriate widths and heights. My data looks something like this: <item id="1"> <a>foo</a> <b>bar</b> <c>baz</c> </item> <item id="2">...</item> ... <item id="n">...</item> I considered using a two-column region-body, but then the order of items would be 1, 3, 2, 4 (reading from left to right) since they would be rendered tb-lr instead of lr-tb, and I need them to be in the correct order (id in above xml). I suppose I could try using a table, but I'm not quite sure how to group the items into table rows. So, some kind of workaround for the lack of fo:float would be greatly appreciated.

Read the article
jQuery and IE not playing nice

- by Neurofluxation

Hello, I have this section of code: function createDownload() { var category, format, specification, download; $('#submitform').click(function() { category = $('#cate').val(); format = $('#form').val(); specification = $('#spec').val(); if (category == "NULL" || format == "NULL" || specification == "NULL") { alert("Please select all options."); return false; } else { download = "pdfs/"+specification+format+category+".pdf"; window.open(download); } }); } Now... In Internet Explorer it says there is an "Error on the page" - Message: 'return' statement outside of function and I have to click the button again. In Firefox, Chrome and Safari - I have to click the button twice to get the PDF to appear... (and no errors)... Now why could that be?! As per request - My Form declaration: <form method="post" action="javascript: return false;" onSubmit="createDownload();">

Read the article
Automate download of BusinessObjecs Web Intelligence reports

- by Daren Thomas

I'm tasked with automating the retrieval of a couple of BusinessObjects Web Intelligence reports and further processing thereof. I have no other means of access to this data (this was the first avenue I followed), so I will have to do some screen scraping. Alas, the interface seems user-only. Grr! Has anyone done this before? Like to share? Also, does anyone know of a good library for automating the web browser? I know there is a python thingy out there that can be used for testing web applications - I need something in .NET though... What is your favorite? PS: I have also checked this thread (automate getting report from webpage), but am hoping for a Web Intelligence specific sollution.

Read the article
Why do firefox/chrome show a different page than IE8?

- by Edward Tanguay

When I look at this published Google Docs document, I see the latest version with Firefox and Chrome, but an older version with IE8. Also, screen-scraping it via PHP/Curl gives me an older version. I've tried CTRL-Refresh in IE8 but I can't get it to show me the newest version. No matter what headers I try to change in PHP/Curl, I can't get it to show me the newest version. What am I not understanding about browsers/headers/caching here? How can it be that different browsers show different contents of one page?

Read the article
How to download a file from a UNC mapped share via IIS and ASP

- by helgeg

I am writing an ASP application that will serve files to clients through the browser. The files are located on a file server that is available from the machine IIS is running on via a UNC path (\server\some\path). I want to use something like the code below to serve the file. Serving files that are local to the machine IIS is running on is working well with this method, my trouble is being able to serve files from the UNC mapped share: //Set the appropriate ContentType. Response.ContentType = "Application/pdf"; //Get the physical path to the file. string FilePath = MapPath("acrobat.pdf"); //Write the file directly to the HTTP content output stream. Response.WriteFile(FilePath); Response.End(); My question is how I can specify a UNC path for the file name. Also, to access the file share I need to connect with a specific username/password. I would appreciate some pointers on how I can achieve this (either using the approach above or by other means).

Read the article
Which rdfa parser for java that supports currently used rdfa attributes?

- by lennyks

I am building an app in Java using Jena for semantic information scraping. I am looking for a RDFa parser that would allow me to correctly extract all the rdfa statements. Specifically, one that extracts info about namespaces used and presuming that rdfa tags are correct in the page produces correct triples, ones that distinguish between object and data properties. I went through all rdfa parsers from the site http://rdfa.info/wiki/Consume for Java. They all struggle to extract any rdfa statements and if they do not crash, Jena RDFa parser shows plenty of errors and then dies a terrible death, the data is of little use as it is incorrectly processed and generally mixed up. I am newbie in this area so please be gentle:) I was also thinking of using a library written in different language but then again I don't really know how to plug it into Java code. Any suggestions?

Read the article
Problem executing trackPageview with Google Analytics.

- by dmrnj

I'm trying to capture the clicks of certain download links and track them in Google Analytics. Here's my code var links = document.getElementsByTagName("a"); for (var i = 0; i < links.length; i++) { linkpath = links[i].pathname; if( linkpath.match(/\.(pdf|xls|ppt|doc|zip|txt)$/) || links[i].href.indexOf("mode=pdf") >=0 ){ //this matches our search addClickTracker(links[i]); } } function addClickTracker(obj){ if (obj.addEventListener) { obj.addEventListener('click', track , true); } else if (obj.attachEvent) { obj.attachEvent("on" + 'click', track); } } function track(e){ linkhref = (e.srcElement) ? e.srcElement.pathname : this.pathname; pageTracker._trackPageview(linkhref); } Everything up until the pageTracker._trackPageview() call works. In my debugging linkhref is being passed fine as a string. No abnormal characters, nothing. The issue is that, watching my http requests, Google never makes a second call to the tracking gif (as it does if you call this function in an "onclick" property). Calling the tracker from my JS console also works as expected. It's only in my listener. Could it be that my listener is not deferring the default action (loading the new page) before it has a chance to contact Google's servers? I've seen other tracking scripts that do a similar thing without any deferral.

Read the article
Find and Replace with Notepad++

- by Levi

I have a document that was converted from PDF to HTML for use on a company website to be referenced and indexed for search. I'm attempting to format the converted document to meet my needs and in doing so I am attempting to clean up some of the junk that was pulled over from when it was a PDF such as page numbers, headers, and footers. luckily all of these lines that need to be removed are in blocks of 4 lines unfortunately they are not exactly the same therefore cannot be removed with a simple literal replace. The lines contain numbers which are incremental as they correlate with the pages. How can I remove the following example from my html file. Title<br> 10<br> <hr> <A name=11></a>Footer<br> I've tried many different regular expression attempts but as my skill in that area is limited I can't find the proper syntax. I'm sure i'm missing something fairly easy as it would seem all I need is a wildcard replace for the two numbers in the code and the rest is literal. any help is apprciated

Read the article

< Previous Page | 102 103 104 105 106 107 108 109 110 111 112 113 | Next Page >