Search Results

Search found 4479 results on 180 pages for 'pdf scraping'.

Page 43/180 | < Previous Page | 39 40 41 42 43 44 45 46 47 48 49 50 | Next Page >

Do search engines crawl PDFs and if so are there any rules to follow when making them

- by RandomBen

The website I am working on has a few hundred PDFs in it. I don't think I have ever seen any of them come back in a search but there are linked to directly from out site. They are also full of keywords because they are product documents. Is there anything special we need to do to get Google or other search engines to crawl them? Is there any hard and fast rules for making PDFs to help Google like them more? For instance should I run them through ghostscript to clean up broken PDF tags that Adobe creates during generation?

Read the article
How do I read my PDFs and watch videos properly again?

- by asp

Whenever I watch videos, either online of local via Totem and/or open PDF files using evince. The system gets really, really bogged down. All apps get really slow, menus take forever to display, switching windows gives me time to make a coffee, etc. I have a couple of bug reports open on this, but what do I need to do to really troubleshoot the issue? I've purged Adobe Flash from the system, but YouTube HTML5 videos still have the issue. A bunch of PDFs saved locally trigger the problem. And to (temporarily) remove the slowness, I need to shutdown the computer, breath for a few minutes, then restart. A simple reboot does not do the trick. How can I identify the cause? This only started on 13.04. I've had Ubuntu on this machine for a year without a problem until "upgrading" to 13.04. I am not a programmer, but I suspect an issue with the Intel video driver.

Read the article
How to track opens and pageviews in PDFs?

- by Osvaldo

I know how to track clicks in links to pdfs and pfd's downloads. But I need to track how many times a PDF is opened after being downloaded and if possible track how many times certain pages are shown to users. Tracking has to be done without warnings that personal information is being sent somewhere. I do not want readers personal informations, just to know how many opens happened, so this warnings would be inaccurate. Can anyone help by pointing to a tutorial or an example? If you are sure that this can't be done, can you please point to documentation that explains why?

Read the article
scraping website with javascript cookie with c#

- by erwin

Hi all, I want to scrap some things from the following site: http://www.conrad.nl/modelspoor This is my function: public string SreenScrape(string urlBase, string urlPath) { CookieContainer cookieContainer = new CookieContainer(); HttpWebRequest httpWebRequest = (HttpWebRequest)WebRequest.Create(urlBase + urlPath); httpWebRequest.CookieContainer = cookieContainer; httpWebRequest.UserAgent = "Mozilla/6.0 (Windows; U; Windows NT 7.0; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.9 (.NET CLR 3.5.30729)"; WebResponse webResponse = httpWebRequest.GetResponse(); string result = new System.IO.StreamReader(webResponse.GetResponseStream(), Encoding.Default).ReadToEnd(); webResponse.Close(); if (result.Contains("<frame src=")) { Regex metaregex = new Regex("http:[a-z:/._0-9!?=A-Z&]*",RegexOptions.Multiline); result = result.Replace("\r\n", ""); Match m = metaregex.Match(result); string key = m.Groups[0].Value; foreach (Match match in metaregex.Matches(result)) { HttpWebRequest redirectHttpWebRequest = (HttpWebRequest)WebRequest.Create(key); redirectHttpWebRequest.CookieContainer = cookieContainer; webResponse = redirectHttpWebRequest.GetResponse(); string redirectResponse = new System.IO.StreamReader(webResponse.GetResponseStream(), Encoding.Default).ReadToEnd(); webResponse.Close(); return redirectResponse; } } return result; } But when i do this i get a string with an error from the website that it use javascript. Does anybody know how to fix this? Kind regards Erwin

Read the article
scraping text from multiple html files into a single csv file

- by Lulu

I have just over 1500 html pages (1.html to 1500.html). I have written a code using Beautiful Soup that extracts most of the data I need but "misses" out some of the data within the table. My Input: e.g file 1500.html My Code: #!/usr/bin/env python import glob import codecs from BeautifulSoup import BeautifulSoup with codecs.open('dump2.csv', "w", encoding="utf-8") as csvfile: for file in glob.glob('*html*'): print 'Processing', file soup = BeautifulSoup(open(file).read()) rows = soup.findAll('tr') for tr in rows: cols = tr.findAll('td') #print >> csvfile,"#".join(col.string for col in cols) #print >> csvfile,"#".join(td.find(text=True)) for col in cols: print >> csvfile, col.string print >> csvfile, "===" print >> csvfile, "***" Output: One CSV file, with 1500 lines of text and columns of data. For some reason my code does not pull out all the required data but "misses" some data, e.g the Address1 and Address 2 data at the start of the table do not come out. I modified the code to put in * and === separators, I then use perl to put into a clean csv file, unfortunately I'm not sure how to work my code to get all the data I'm looking for!

Read the article
Automating scraping of table data to XML

- by thewinchester

Problem I have a YQL query result that I'm trying to get converted and sort into a clean XML file. Background Being the pains that they are, information from the World Cup isn't freely available in an easy to reuse format. So, after a bit of finessing with YQL I have managed to liberate the required table rows which contain the data I'm after. The YQL query can be viewed at: http://query.yahooapis.com/v1/public/yql/ravingbeefsteak/worldcup2010groupliberator?diagnostics=true I'd like to now convert this information into XML, and being an absolute n00b I don't know where to start or what to look for. I'm also needing to do a find and replace on the data to get the URL's working as they should without manual changes, and hopefully an initial sorting of the data. If anyone can point me in the right direction of what I need to be doing to make my needs a reality it would be greatly appreciated.

Read the article
Difficulty screen scraping http://www.momondo.com using nokogiri

- by Khai Kiong

I have some difficulty to extract the total price (css selector = '.total') from the flight result. http://www.momondo.com/multicity/?Search=true&TripType=oneway&SegNo=1&SO0=KUL&SD0=KBR&SDP0=31-12-2012&AD=2&CA=0,0&DO=false&NA=false#Search=true&TripType=oneway&SegNo=1&SO0=KUL&SD0=KBR&SDP0=31-12-2012&AD=2&CA=0,0&DO=false&NA=false I get the error "undefined method `text' for nil:NilClass nokogiri ". My code desc "Fetch product prices" task :fetch_details => :environment do require 'nokogiri' require 'open-uri' include ERB::Util OneWayFlight.find_all_by_money(nil).each do |flight| url = "http://www.momondo.com/multicity/Search=true&TripType=oneway&SegNo=1&SO0=KUL&SD0=KBR&SDP0=31-12-2012&AD=2&CA=0,0&DO=false&NA=false#Search=true&TripType=oneway&SegNo=1&SO0=KUL&SD0=KBR&SDP0=31-12-2012&AD=2&CA=0,0&DO=false&NA=false" doc = Nokogiri::HTML(open(url)) price = doc.at_css(".total").text[/[0-9\.]+/] flight.update_attribute(:price, price) end end

Read the article
Scraping *.aspx content using Python

- by tomato

I'm having difficulties scrapping dynamically generated table in ASPX. Trying to scrap the gas prices from a site like these GasPrices. I can extract all the information in the gas price table (address, time submitted etc.), except for the actual gas price. Is there a way I could scrap the gas prices? i.e. somehow get a text representation of it. I'm not very familiar with ASP/ASPX - but what's being generated now is not showing up in the final HTML. I'm using Python to do the scrapping, but that's irrelevant unless there's a specific library...

Read the article
Having trouble scraping an ASP .NET web page

- by Seth

I am trying to scrape an ASP.NET website but am having trouble getting the results from a post. I have the following python code and am using httplib2 and BeautifulSoup: conn = Http() # do a get first to retrieve important values page = conn.request(u"http://somepage.com/Search.aspx", "GET") #event_validation and viewstate variables retrieved from GET here... body = {"__EVENTARGUMENT" : "", "__EVENTTARGET" : "" , "__EVENTVALIDATION": event_validation, "__VIEWSTATE" : viewstate, "ctl00_ContentPlaceHolder1_GovernmentCheckBox" : "On", "ctl00_ContentPlaceHolder1_NonGovernmentCheckBox" : "On", "ctl00_ContentPlaceHolder1_SchoolKeyValue" : "", "ctl00_ContentPlaceHolder1_SchoolNameTextBox" : "", "ctl00_ContentPlaceHolder1_ScriptManager1" : "ctl00_ContentPlaceHolder1_UpdatePanel1|cct100_ContentPlaceHolder1_SearchImageButton", "ct100_ContentPlaceHolder1_SearchImageButton.x" : "375", "ct100_ContentPlaceHolder1_SearchImageButton.y" : "11", "ctl00_ContentPlaceHolder1_SuburbTownTextBox" : "Adelaide,SA,5000", "hiddenInputToUpdateATBuffer_CommonToolkitScripts" : 1} headers = {"Content-type": "application/x-www-form-urlencoded"} resp, content = conn.request(url,"POST", headers=headers, body=urlencode(body)) When I print content I still seem to be getting the same results as the "GET" or is there a fundamental concept I'm missing to retrieve the result values of an ASP .NET post?

Read the article
Top techniques to avoid 'data scraping' from a website database

- by Addsy

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db. Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information. Does anyone have any good tactics for preventing or even just dettering this that they could share. Thanks

Read the article
grabbing a substring while scraping with Python2.6

- by Diego

Hey can someone help with the following? I'm trying to scrape a site that has the following information.. I need to pull just the number after the </strong> tag.. [<li><strong>ISBN-13:</strong> 9780375853401</li>, <li><strong>Pub. Date: </strong> 05/11/2010</li>] [<li><strong>UPC:</strong> 490355000372</li>, <li><strong>Catalog No:</strong> 15024/25</li>, <li><strong>Label:</strong> CAMERATA</li>] here's a piece of the code I've been using to grab the above data using mechanize and BeautifulSoup. I'm stuck here as it won't let me use the find() function for a list br_results = mechanize.urlopen(br_results) html = br_results.read() soup = BeautifulSoup(html) local_links = soup.findAll("a", {"class" : "down-arrow csa"}) upc_code = soup.findAll("ul", {"class" : "bc-meta3"}) for upc in upc_code: upc_text = upc.contents.contents print upc_text

Read the article
rcurl web scraping timeout exits program

- by user1742368

I am using a loop and rcurl scrape data from multiple pages which seems to work fine at certain times but fails when there is a timeout due to the server not responding. I am using a timeout=30 which traps the timeout error however the program stops after the timeout. i would like the progrm to continue to the next page when the timeout occurrs but cant figureout how to do this? url = getCurlHandle(cookiefile = "", verbose = TRUE) Here is the statement I am using that causes the timeout. I am happy to share the code if there is interest. webpage = getURLContent(url, followlocation=TRUE, curl = curl,.opts=list( verbose = TRUE, timeout=90, maxredirs = 2)) woodwardjj

Read the article
scraping blog contents

- by goh

Hi lads, After obtaining the urls for various blogspots, tumblr and wordpress pages, I faced some problems processing the html pages. The thing is, i wish to distinguish between the content,title and date for each blog post. I might be able to get the date through regex, but there are so many custom scripts people are using now that the html classes and structure is so different. Does anyone has a solution that may help?

Read the article
json service from data scraping with php

- by fredz0003

I am trying to figure out what is the best way to make this work, I am new to php. I was able to make my script work to find specific data on my htm file with the following script tested on my local server. <?php include ('simple_html_dom.php'); //create DOM from URL or local file $html = file_get_html ('Lotto Texas.htm'); //find td class name currLotWinnum and store in variable winNumbers foreach($html ->find('td.currLotWinnum') as $winNumbers) //print winNumbers echo "<b>The winning numbers are</b><br>"; echo $winNumbers -> innertext . '<br>'; ?> Need some light here, ultimately I would like to create a web service to return json format and access that data from my iOS application using NSJSONSerialization class.

Read the article
Scraping html WITHOUT uniquie identifiers using python

- by Nicholas Law

I would like to design an algorithm using python that scrapes thousands of pages like this one and this one, gathers all the data and inserts it into a MySQL database. The script will be run on a weekly or bi-weekly basis to update the database of any new information added to each individual page. Ideally I would like a scraper that is easy to work with for table structured data but also data that does not have unique identifiers (ie. id and classes attributes). Which scraper add-on should I use? BeautifulSoup, Scrapy or Mechanize? Are there any particular tutorials/books I should be looking at for this desired result? In the long-run I will be implementing a mobile app that works with all this data through querying the database.

Read the article
Scraping digg rss feed with python

- by Timmy

is there a way to get the link from digg through its rss feed? or do i have to get the website and manually scrape it with a regex?

Read the article
Scraping non-absolute URL

- by cooldude

I am trying to scrape www.weather.bm. I want all 10 radar images, but I can only get one (the image updates regularly) and it's not a absolute image url. I was hoping I could use the image as a image slideshow like the link but dont know how. Also, how can I remove images/Radarlegend.png? I just need the radar images. Here is my code: include('simple_html_dom.php'); $html = file_get_html('http://www.weather.bm/radarMobile.asp'); foreach($html->find('img') as $element) echo $element->src . '<br>' My output is: <div id="main"> images/Radar/CurrentRadarAnimation_100km_sri/100km_sri-radar-2011-01-04-1556.jpg<br>images/Radarlegend.png<br></div> </div>

Read the article
Perl scraping script not recognising certain characters

- by user1849286

I have a script that works fine locally but on the server fails. It displays the non-breaking space symbol   as ? when printing to standard output. In the parsing of the page, if I try to get rid of non breaking space symbol with s/ //g nothing happens, neither getting rid of the question mark s/?//g It seems to stick no matter what. Bizzarely, this is not an issue when running the script locally. Additionally, question marks within a diamond symbol are inserted everywhere (on both the server script and the local script) instead of apostrophes, although at least that is not causing the parsing of the page to break on the local page. Confused, pls help.

Read the article
How to insert scraping data to mysql

- by user1887288

i am fetching data from other websites can any one tell me how to insert fetch data to mysql database Below code i am using to fetch results coming $urls = $_POST["urls"]; require_once('simple_html_dom.php'); $useragent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; foreach ($urls as $url) { $curl = curl_init(); curl_setopt($curl, CURLOPT_URL, $url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 20); curl_setopt($curl, CURLOPT_USERAGENT, $useragent); $str = curl_exec($curl); curl_close($curl); $html= str_get_html($str); foreach($html->find('span.price') as $e) echo $e->innertext . '<br>'; }

Read the article
pdf printer in wine ?

- by Arkapravo

I wish to convert (print) my MS Word files to pdf on the fly ! I am on Ubuntu 9.10 and using Wine 1.1.40. Can someone help ? I have heard that pdf printer can be installed using Wine Cups ! Thanks !

Read the article
Encrypt uploaded pdf files with mcrypt and php

- by microchasm

I'm currently set up with a CentOS box that utilizes mcrypt to encrypt/decrypt data to/from the database. In my haste, I forgot that I also need a solution to encrypt files (primarily pdf, with a xls and txt file here and there). Is there a way to utilize mcrypt to encrypt uploaded pdf files? I understand the possibility of file_get_contents() with txt; is a similar solution available for other formats? Thanks!

Read the article
open pdf when usb is plugged in

- by Funky Dude

i have a pdf file in an usb drive. how do i get it to open automatically when i plug the usb drive. no dialog or anything. just open the pdf directly after usb drive is plugged in. let's say we have to do this in win xp. autorun.inf doesnt seem to be able to do it.

Read the article
Software automatic content analysis of pdf files

- by iceman

Is there a commercial software which can perform automatic content analysis in a large collection of pdf documents which have tagged meta-data for easy classification? What is the technology Google uses to parse web-hosted pdf and rank them?

Read the article
Adobe pdf icon disappeared!

- by Nano8Blazex

Starting last week I've noticed that all my pdf files now have a generic white document as an icon instead of the original Adobe pdf icon... I've reinstalled Adobe Reader, repaired it, and have had no success in getting the original icon back. The generic document icon is really getting into my head now... it's just... generic. Is there any way to fix this?

Read the article
Encrypt pdf files with mcrypt and php

- by microchasm

I'm currently set up with a CentOS box that utilizes mcrypt to encrypt/decrypt data to/from the database. In my haste, I forgot that I also need a solution to encrypt files (primarily pdf, with a xls and txt file here and there). Is there a way to utilize mcrypt to encrypt pdf files? I understand the possibility of file_get_contents() with txt; is a similar solution available for other formats? Thanks!

Read the article

< Previous Page | 39 40 41 42 43 44 45 46 47 48 49 50 | Next Page >