scrape - Page 6 - Developer IT

how to convert webpage apostrophe (’) to ascii 39 in ruby 1.8.7

- by maninwarren

That's pretty much it. I'm using Nokogiri to scrape a web page what has ’ ; characters in it, and I can't figure out how to do the conversion. here's what I tried: str.gsub(/’/,"'") str.gsub("’","'") str.gsub("GÇÖ","'") # that's how it looks when I do a puts (In the above, there's no space between the ’ and the ";", but if I don't put the space in, SO converts it to an apostrophe -- the cruel, cruel irony!) I'm sure this is covered somewhere, but couldn't find the solution here or on the web. TIA

Read the article

PHP Magento Screen Scraping

- by Grant unwin

I am trying to scrape a suppliers magento site in an effort to save some time because of there being around 2000 products I need to gather info for. I'm totally OK with writing a screen scraper for pretty much anything but i've encountered a major problem. Im using get_file_contentsto gather the html of the product page. The problem is: You need to be logged in, to view the product page. Its a standard magento login, so how can I get round this in my screen scraper? I don't require a full script, just advice on a method. Thanks

Read the article

Programmatic Reaction to Receiving New Email

- by vicatcu

I'm interested in automating some reactive work I do when receiving certain emails in one of my email accounts. What I would like to have happen is: On receipt of new email in the account If the new email passes the "Need to React" criteria (based on body content and subject line) 3a. Scrape some content out of the email body and subject lines 3b. Populate a template form (e.g. Excel spreadsheet) with the scraped data 3c. Print the populated form and save the populated form in some folder (e.g. as a pdf) What's the best (defined as easiest to implement by myself) approach / combination of technologies for achieving this automation?

Read the article

wxPython formatting questions

- by Kevin

I have an app I was working on to learn more about wxPython( I have been primarily been a scripter ). I forgot about it now I am opening it back up. It's a screen scraper, and I have it working almost the way I want it, going to build a regex parser to strip out the links in every scrape that I don't need. The questions I have are this. In it current state, if I check more than one site, it goes out and scrapes, and returns it in separate windows, the for:each section in the Clicked function. I want to put them in a frame, in the window, altogether. I also want to know if I can take the list they are read into and send it to a checklist, so someone could check off separate items, I want to build a save function and keep certain ones. In regards to a save function, I want to keep saved checks, are there calls to the widgets to save their states? I know it's a lot, but thanks for the help.

Read the article

Scraping non-absolute URL

- by cooldude

I am trying to scrape www.weather.bm. I want all 10 radar images, but I can only get one (the image updates regularly) and it's not a absolute image url. I was hoping I could use the image as a image slideshow like the link but dont know how. Also, how can I remove images/Radarlegend.png? I just need the radar images. Here is my code: include('simple_html_dom.php'); $html = file_get_html('http://www.weather.bm/radarMobile.asp'); foreach($html->find('img') as $element) echo $element->src . '<br>' My output is: <div id="main"> images/Radar/CurrentRadarAnimation_100km_sri/100km_sri-radar-2011-01-04-1556.jpg<br>images/Radarlegend.png<br></div> </div>

Read the article

How to know if the website being scraped has changed?

- by Lost_in_code

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead. It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.

Read the article

What micro web-framework has the lowest overhead but includes templating

- by Simon Martin

I want to rewrite a simple small (10 page) website and besides a contact form it could be written in pure html. It is currently built with classic asp and Dreamweaver templates. The reason I'm not simply writing 10 html pages is that I want to keep the layout all in 1 place so would need either includes or a masterpage. I don't want to use Dreamweaver templates, or batch processing (like org-mode) because I want to be able to edit using notepad (or Visual Studio) because occasionally I might need to edit a file on the server (Go Daddy's IIS admin interface will let me edit text). I don't want to use ASP.NET MVC or WebForms (which I use in my day job) because I don't need all the overhead they bring with them when essentially I'm serving up 9 static files, 1 contact form and 1 list of clubs (that I aim to use jQuery to filter). The shared hosting package I have on Go Daddy seems to take a long time to spin up when serving aspx files. Currently the clubs page is driven from an MS SQL database that I try to keep up to date by manually checking the dojo locator on the main HQ pages and editing the entries myself, this is again way over the top. I aim to get a text file with the club details (probably in JSON or xml format) and use that as the source for the clubs page. There will need to be a bit of programming for this as the HQ site is unable to provide an extract / feed so something will have to scrape the site periodically to update my clubs persistence file. I'd like that to be automated - but I'm happy to have that triggered on a visit to the clubs page so I don't need to worry about scheduling a job. I would probably have a separate process that updates the persistence that has nothing to do with the rest of the site. Ideally I'd like to use Mercurial (or git) to publish, I know Bitbucket (and github) both serve static page sites so they wouldn't work in this scenario (dynamic pages and a contact form) but that's the model I'd like to use if there is such a thing. My requirements are: Simple templating system, 1 place to define header, footers, menu etc., that can be edited using just notepad. Very minimal / lightweight framework. I don't need a monster for 10 pages Must run either on IIS7 (shared Go Daddy Windows hosting) or other free host

Read the article

How to wrap console utils in webserver

- by Alex Brown

I have a big dataset (100Mbs/day) and a bunch of console a TCL/TK tools to view it - I want to turn it into a web app that I can build, and others can maintain. In long: my group runs simulations yielding 100s of Mbs of data daily, in multiple (mostly but not only) text forms. We have a bunch of scripts and tools, mostly old school 1990's style stuff requiring a 5-button mouse, as well as lots of ad-hoc scripts that engineers build out of frustration every month or so. These produces UIs, graphs, spreadsheets (various sizes), logs, event histories etc. I want to replace (or at least supplement) the xwindows / console style UI with a web-based one, so I need the following properties: pleasant to program can wrap existing command-line tools in separate views (I don't need to scrape GUIs or anything) as I port logic from the existing scripts I can create a modularised and pleasant codebase to replace it I can attach a web-ui to navigate between views - each view is likely to contain keys which might make sense to view in another I am new to building systems that have logic on the back-end and front-end of a web-server. from that point of view, they do this: backend wraps old-school executables, constructs calls into them and them takes the output and wraps it up, niceifies it and delivers it to the web client. For instance the tool might generate a number of indexed images (per invocation) which I might deliver all at once or on-demand. May (probably) need to to heavy stats on some sources. frontend provides navigation connecting multiple views, performs requests from one view for data from another (or self to self), etc. Probably will have some views with a lot of interactivity. Can people please point me towards viable solutions for this? I know it's a bit of an open question so as answers come in I hope to refine the spec until we have a good match. I guess I expect to see answers like "RoR!" "beans!" "Scala!" but please give an indication of why those are a good fit; I know nothing! I got bumped off SO for asking an open-ended question, so sorry if its OT here too (let me know). I take the policy that I use the best/closest matched language for a project but most of my team are extremely low level (ie pipeline stages and CDyn) so I don't have the peer group to know where to start.

Read the article

Screen scraping software that will traverse pages

- by nilbus

We're creating a mashup site that pulls information from many sources all over the web. Many of these sites don't provide RSS feeds or APIs to access the information they provide. This leaves us with screen scraping as our method for collecting the data. There are many scripting tools out there written in different scripting languages for screen scraping that require you to write scraping scripts in the language the scraper was written in. Scrapy, scrAPI, and scrubyt are a few written in Ruby and Python. There are other web-based tools I've seen like Dapper that create XML or RSS feeds based on a webpage. It has a beautiful web-based interface that requires no scripting skills to use. This would be a great tool, if it were able to traverse multiple pages to gather data from hundreds pages of results. We need something that will scrape information from paginated web sites, much like scrubyt, but with a user interface that a non-programmer could use. We'll script up our own solution if we need to, probably using scrubyt, but if there's a better solution out there, we want to use it. Does anything like this exist?

Read the article

Zabbix Trigger for SELinux (type=AVC) Errors

- by Kevin Soviero

I would like to create a trigger in Zabbix to alert me anytime a type=AVC error appears in a CentOS 6 server's /var/log/audit/audit.log file. I've already tried creating a basic log scrape. E.g.: log[/var/log/audit/audit.log,type=AVC,"UTF-8",100] However, it does not work. I believe this is due to the /var/log/audit/audit.log and it's parent folder using the following permissions: drwxr-x---. 2 root root 4096 Apr 20 04:29 . drwxr-xr-x. 13 root root 4096 Apr 14 12:07 .. -rw-------. 1 root root 5948185 Apr 20 15:27 audit.log -r--------. 1 root root 6291566 Apr 20 04:29 audit.log.1 -r--------. 1 root root 6291704 Apr 19 16:56 audit.log.2 -r--------. 1 root root 6291499 Apr 19 05:22 audit.log.3 -r--------. 1 root root 6291552 Apr 18 17:48 audit.log.4 I would prefer not to change the permissions for security reasons. Has anyone done log monitoring of /var/log/audit/audit.log using Zabbix? And if so, how?

Read the article

Ubuntu 13.04 to 13.10: Filesystem check or mount failed [migrated]

- by SamHuckaby

I attempted to upgrade from Ubuntu 13.04 to 13.10 today, and mid upgrade the system started flaking out, and eventually locked up entirely. I was forced to restart the computer, and am now unable to get the computer to boot up at all. When I boot currently, it takes me to the GRUB menu, and I can choose to boot normally, or boot in an older version. I have tried several things, which I list below, but no matter what, when I try to finish booting into Ubuntu, I receive the following error: Filesystem check or mount failed. A maintenance shell will now be started. CONTROL-D will terminate this shell and continue booting after re-trying filesystems. Any further errors will be ignored root@ubuntu-computername:~# I have fun fsck -f and everything appears correct, no errors are reported. and it passes all 5 checks. If I run fdisk -l then I get the following information: Disk /dev/sda: 320.1 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders, total 625142448 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes Disk identifier: 0x00010824 Device Boot Start End Blocks Id System /dev/sda1 * 2048 608456703 304227328 83 Linux /dev/sda2 608458750 625141759 8341505 5 Extended Partition 2 does not start on physical sector boundary. /dev/sda5 608458752 625141759 8341504 82 Linux swap / Solaris Disk /dev/sdb: 320.1 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders, total 625142448 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x0fb4b7e8 Device Boot Start End Blocks Id System /dev/sdb1 8192 625139711 312565760 7 HPFS/NTFS/exFAT I am considering just installing a new OS on the other disk, that currently has nothing on it, and then just attempting to scrape my data off the old disk (thankfully I didn't encrypt the files). Really my question is this: Can I salvage this Ubuntu install, or should I give up and just reinstall?

Read the article

How can I recover a Fedora 12 installation that is showing signs of disk errors?

- by Bob Cross

I am currently overseas (i.e., very far from my normal library of tools) and my primary machine that would normally act as the data server in the performance test that we're trying to run is failing to boot to Fedora 12 properly. This is a machine that, as of yesterday, was booting fine. However, this morning, very strange portions of the boot process were complaining with messages such as "unexpected 0x0 in rpcbind" and "bad file descriptor" (I don't have the error in front of me - scavenged a windows installation to get onto serverfault). Eventually, the boot hung for a long time at the NFS service and then brought up what looked like the KDE login screen but neither the mouse nor keyboard functioned. In olden days, I would try to get to a point where I could manage to run fsck and pray that the bad sectors would come back into alignment just long enough for me to scrape the critical data off of the machine. However, now that we live in the future, it seems like our options in situations like this should be a little more varied. Is there a way to recover a Fedora 12 installation with bad disk sectors that won't boot properly? For completeness, I am comfortable working with bootable recovery distros-on-CD and such but I don't know which one is likely to work best with modern Fedora. In the absence of guidance, I'm frantically torrenting the Fedora 12 Live CD and DVD, hoping to try rescue mode before tomorrow morning.

Read the article

webscraper grabbing images, but not entering info into database

- by Jason

Hello, again. I'm having more issues with my script entering info into my database. The script below grabs a page, strips down the necessary info, then downloads the related image file. After that, it is supposed to enter the information gleaned from the URL into the database. For some reason, the script seems to iterate through the URLs, as I get downloaded images for each URL, but each URL's product is not entered into the database. The script will insert the first product's categories and product info, and then it just stops, and continues to download images. Any suggestions? <?php define('IN_PHPBB', true); $phpbb_root_path = (defined('PHPBB_ROOT_PATH')) ? PHPBB_ROOT_PATH : './'; $phpEx = substr(strrchr(__FILE__, '.'), 1); include($phpbb_root_path . 'common.' . $phpEx); include($phpbb_root_path . 'includes/simple_html_dom.' . $phpEx); // Start session management $user->session_begin(); $auth->acl($user->data); $user->setup(); set_time_limit(259200); function save($in, $out) { $ch = curl_init ($in); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_BINARYTRANSFER,1); $rawdata=curl_exec($ch); curl_close ($ch); if(file_exists($out)) { unlink($out); } $fp = fopen($out,'x'); fwrite($fp, $rawdata); fclose($fp); } function scrape($i) { $url = 'http:/xxxxxxxx/index.php?main_page=product_info&products_id='.$i.'&zenid=e4b7dde8de02e1df005d4549e2e3e529'; echo "$url -- "; $exists = file_get_contents($url); if ($exists != false) { $html = file_get_html($url); foreach($html->find('body') as $html) { $test = $html->find('#productName', 0); if ($test) { $item['title'] = trim($html->find('#productName', 0)->plaintext); $item['price'] = trim($html->find('#productPrices', 0)->plaintext); $item['cat'] = $html->find('#navBreadCrumb', 0)->plaintext; list($home, $item['cat'], $item['subcat'], $title) = explode("::", $item['cat']); $item['cat'] = str_replace(" ", "", $item['cat']); $item['subcat'] = str_replace("\n", "", str_replace(" ", "", $item['subcat'])); $item['desc'] = trim($html->find('#productDescription', 0)->plaintext); $item['model'] = $html->find('ul#productDetailsList', 0)->find('li', 0)->plaintext; $item['model'] = explode(":", $item['model']); $item['model'] = trim($item['model'][1]); $item['manufacturer'] = $html->find('ul#productDetailsList', 0)->find('li', 1)->plaintext; $item['manufacturer'] = explode(":", $item['manufacturer']); $item['manufacturer'] = trim($item['manufacturer'][1]); foreach($html->find('img') as $img) { if($img->alt == $item['title']) { $item['img_sm'] = $img->src; } } $ret[] = $item; } } $html->clear(); unset($html); unset($item); return $ret; } else { echo "Could not find page<br />"; } unset($exists); } $i = 1; $end = 9999999; while($i < $end) { $ret = scrape($i); if(isset($ret)) { foreach($ret as $v) { $item['title'] = $v['title']; $item['price'] = $v['price']; $item['desc'] = $v['desc']; $item['model'] = $v['model']; $item['manufacturer'] = $v['manufacturer']; $item['image'] = $v['image']; $item['cat'] = $v['cat']; $item['subcat'] = $v['subcat']; $item['img_sm'] = $v['img_sm']; } unset($ret); unset($v); $sm_img_src = "http://xxxxxx/".$item['img_sm']; $ext = strrchr($item['img_sm'], '.'); $filename = $item['model'] . $ext; $lg_img_src = "http://xxxxx/images/STC/".$filename; $new_sm = "./rip_images/small/{$filename}"; $new_lg = "./rip_images/large/{$filename}"; $item['image'] = $filename; save($lg_img_src,$new_lg); save($sm_img_src,$new_sm); //see if parent cat exists $sql = 'SELECT cat_id FROM ' . SHOP_CAT_TABLE . ' WHERE cat_name = "'.$db->sql_escape($item['cat']).'"'; $result = $db->sql_query($sql); $parent = $db->sql_fetchrow($result); $db->sql_freeresult($result); // if not exists if($parent['cat_id'] == '') { //add the parent cat to the db $sql_ary = array( 'cat_name' => $item['cat'], 'cat_parent' => 0 ); $sql = 'INSERT INTO '.SHOP_CAT_TABLE.' '.$db->sql_build_array('INSERT', $sql_ary); $db->sql_query($sql); $cat_id = $db->sql_nextid(); //see if subcat exists $sql = 'SELECT cat_id FROM ' . SHOP_CAT_TABLE . ' WHERE cat_name = "'.$db->sql_escape($item['subcat']).'"'; $result = $db->sql_query($sql); $row = $db->sql_fetchrow($result); $db->sql_freeresult($result); // if not exists if($row['cat_id'] == '') { //add subcat to db $sql_ary = array( 'cat_name' => $db->sql_escape($item['subcat']), 'cat_parent' => $cat_id ); $sql = 'INSERT INTO '.SHOP_CAT_TABLE.' '.$db->sql_build_array('INSERT', $sql_ary); $db->sql_query($sql); $item_cat = $db->sql_nextid(); } else //if exists { $item_cat = $row['cat_id']; } } else //if parent cat exists { //see if subcat exists $sql = 'SELECT cat_id FROM ' . SHOP_CAT_TABLE . ' WHERE cat_name = "'.$db->sql_escape($item['subcat']).'"'; $result = $db->sql_query($sql); $row = $db->sql_fetchrow($result); $db->sql_freeresult($result); // if not exists if($row['cat_id'] == '') { //add the subcat to the db $sql_ary = array( 'cat_name' => $db->sql_escape($item['subcat']), 'cat_parent' => $parent['cat_id'] ); $sql = 'INSERT INTO '.SHOP_CAT_TABLE.' '.$db->sql_build_array('INSERT', $sql_ary); $db->sql_query($sql); $item_cat = $db->sql_nextid(); } else //if exists { $item_cat = $row['cat_id']; } } $sql_ary = array( 'item_title' => $db->sql_escape($item['title']), 'item_price' => $db->sql_escape($item['price']), 'item_desc' => $db->sql_escape($item['desc']), 'item_model' => $db->sql_escape($item['model']), 'item_manufacturer' => $db->sql_escape($item['manufacturer']), 'item_image' => $db->sql_escape($item['image']), 'item_cat' => $db->sql_escape($item_cat) ); $sql = 'INSERT INTO ' . SHOP_ITEM_TABLE . ' ' . $db->sql_build_array('INSERT', $sql_ary); $db->sql_query($sql); garbage_collection(); echo 'Done<br />'; } $i++; unset($item); } ?>

Read the article

Automated Error Reporting = More Robust Software

- by Laila

I would like to tell you how to revolutionize your software development process </marketing hyperbole> On a more serious note, we (Red Gate's .NET Development team) recently rolled a new tool into our development process which has made our lives dramatically easier AND improved the quality of our software, and I (& one of our developers, Alex Davies) just wanted to take a quick moment to share the love. I work with a development team that takes pride in what they ship, so we take software testing rather seriously. For every development project we run, we allocate at least one software tester for every two developers, and we never ship software without first shipping early access releases and betas to get user feedback. And therein lies the challenge -encouraging users to provide consistent, useful feedback is a headache, but without that feedback, improving the software is. tricky. Until fairly recently, we used the standard (if long-winded) approach of receiving bug reports of variable quality via email or through our support forums. If that didn't give us enough information to reproduce the problem - which was most of the time - we had to enter into a time-consuming to-and-fro conversation with the end-user, to get scrape together the data we needed to work out where the problem lay. As I'm sure you're aware, this is painfully slow. To the delight of the team, we recently got to work with SmartAssembly, which lets us embed automated exception and error reporting into our software with very little pain, and we decided to do a little dogfooding. As a result, we've have made a really handy (if perhaps slightly obvious) discovery: As soon as we release a beta, or indeed any release of software, we now get tonnes of customer feedback through automated error reports. Making this process easier for our users has dramatically increased the amount (and quality) of feedback we get. From their point of view, they get an experience similar to Microsoft's error reporting, and process is essentially idiot-proof. From our side of things, we can now react much faster to the information we get, fixing the bugs and shipping a new-and-improved release, which our users rather appreciate. Smiles and hugs all round. Even more so because, as we're use SmartAssembly's Automated Error Reporting, we get to avoid having to spend weeks building an exception reporting mechanism. It takes just a few minutes to add reporting to a project, and we get a bunch of useful information back, like a stack trace and the values of all the local variables, which we can use to fix bugs. Happily, "Automated Error Reporting = More Robust Software" can actually be read two ways: we've found that we not only ship higher quality software, but we also release within a shorter time. We can ship stable software that our users are happy to upgrade to, and we then bask in the glory of lots of positive customer feedback. Once we'd starting working with SmartAssembly, we were curious to know how widespread error reporting was as a practice. Our product manager ran a survey in autumn last year, and found that 40% of software developers never really considered deploying error reporting. Considering how we've now got plenty of experience on the subject, one of our dev guys, Alex Davies, thought we should share what we've learnt, and he's kindly offered to host a webinar on delivering robust software with Automated Error Reporting. Drawing on our own in-house development experiences, he'll cover how to add error reporting to your program, how to actually use the error reports to fix bugs (don't snigger, not everyone's as bright as you), how to customize the error report dialog that your users see, and how to automatically get log files from your users' machine. The webinar will take place on Jan 25th (that's next week). It's free to attend, but you'll still need to register to hear Alex's dulcet tones.

Read the article

There are 2 jobs available - which one sounds better all round [closed]

- by Steve Gates

I am currently employed at a company where we scrape by each year breaking even, sometimes having a little profit. The development environment is very relaxed and we have a laugh. My colleagues are not interested in improving their knowledge unless they have to, so trying to get them to adopt things like TDD is a non-starter. My development manager is stuck in .Net 2 land and refuses to use things like LINQ. He over complicates architecture and writes very unreadable code, heres an example SortedList<int,<SortedList<int,SortedList<int, MyClass>>>> The MD of the company has no drive and lets the one sales guy bring in the contracts. We are not busy all the time and this allows me time to look at new technology and learn. In terms of using things like TDD, my development manager has no problem with it and can kind of see the purpose of it, he just wont use it himself. This means I am alone in learning new things and am often resorting to StackOverflow to make sure I get things right. The company has a lot of flexibility, I can work from home if needs be and when my daughter was born they let me work from home 1 day a week however they expect this flexibility in return often asking me to travel occasionally on a Friday afternoon for the following week. Sometimes its abroad. We are also pretty much on call 24/5 as we have engineers in various countries. Also we have no testers so most of the testing is done by us developers and some testing by engineers. Either way no-one likes testing! I have been offered a role at a company I worked at 5 years ago. They were quite Victorian in their working practices but it appears to have relaxed now although I suspect still reasonably formal. There is a new team of developers I don't know and they are about to move to new offices. The team lead is a guy that was there when I was and I get the impression he takes his role seriously and likes his formal procedures and documentation. I think some of the Victorian practices may have rubbed off on him. However he did say if things crop up then as long as I can trust the person they can work at home although he prefers people in the office. The team uses SCRUM, TDD and SOLID design principles so they are quite up to date in technology. They are reasonably Microsoft focused. It appears the Technical Director might be the R&D man and research new technology on his own not allowing developers to play with new technology. He possibly might be a super developer and makes all the decisions that no can argue with. They are currently moving to Entity Framework away from NHibernate based on issues that their queries seem to fail sometimes and they feel NHibernate is stagnant. They have analysts and a QA team. The MD is focused and they are an expanding company making profit each year. I'm not sure what the team morale is and whether they have a laugh. When I had a tour around the office they were there in dead silence. I'm really unsure which role is the best for me and going with my gut instinct is useless as I'm not sure what my gut is telling me. Based on the information above which role would you choose and why?

Read the article

Jquery.get() not returning any data

- by David Waters

Hi I am trying to scrape other people web pages (for the forces of good not evil). I am currently trying to do this with javascript/jquery from with in a browser. I am finding that the no data is returned from the jquery.get() success call back function. My code. $.get('http://www.google.co.uk/', function (data, textStatus, XMLHttpRequest){ alert("status " + textStatus); alert('data:' + data); window.data=data; window.textStatus=textStatus; window.httpReq = XMLHttpRequest}); In my mind this should simply do a get on google store the data in window.data and we are all good. What happens is we get textStatus == success and data == "". the status on the XMLHttpRequest is 4(success). I have looked at the network traffic using a transparent proxy (Charles) and everything looks find there http status 200 plenty of data being returned. I am running this just from the Firebug console in Firefox. Any ideas?

Read the article

Parsing with BeautifulSoup, error message TypeError: coercing to Unicode: need string or buffer, NoneType found

- by Samsun Knight

so I'm trying to scrape an Amazon page for data, and I'm getting an error when I try to parse for where the seller is located. Here's my code: #getting the html request = urllib2.Request('http://www.amazon.com/gp/offer-listing/0393934241/') opener = urllib2.build_opener() #hiding that I'm a webscraper request.add_header('User-Agent', 'Mozilla/5 (Solaris 10) Gecko') #opening it up, putting into soup form html = opener.open(request).read() soup = BeautifulSoup(html, "html5lib") #parsing for the seller info sellers = soup.findAll('div', {'class' : 'a-row a-spacing-medium olpOffer'}) for eachseller in sellers: #parsing for price price = eachseller.find('span', {'class' : 'a-size-large a-color-price olpOfferPrice a-text-bold'}) #parsing for shipping costs shippingprice = eachseller.find('span' , {'class' : 'olpShippingPrice'}) #parsing for condition condition = eachseller.find('span', {'class' : 'a-size-medium'}) #parsing for seller name sellername = eachseller.find('b') #parsing for seller location location = eachseller.find('div', {'class' : 'olpAvailability'}) #printing it all out print "price, " + price.string + ", shipping price, " + shippingprice.string + ", condition," + condition.string + ", seller name, " + sellername.string + ", location, " + location.string I get the error message, pertaining to the 'print' command at the end, "TypeError: coercing to Unicode: need string or buffer, NoneType found" I know that it's coming from this line - location = eachseller.find('div', {'class' : 'olpAvailability'}) - because the code works fine without that line, and I know that I'm getting NoneType because the line isn't finding anything. Here's the html from the section I'm looking to parse: <*div class="olpAvailability"> In Stock. Ships from WI, United States. <*br/><*a href="/gp/aag/details/ref=olp_merch_ship_9/175-0430757-3801038?ie=UTF8&asin=0393934241&seller=A1W2IX7T37FAMZ&sshmPath=shipping-rates#aag_shipping">Domestic shipping rates</a> and <*a href="/gp/aag/details/ref=olp_merch_return_9/175-0430757-3801038?ie=UTF8&asin=0393934241&seller=A1W2IX7T37FAMZ&sshmPath=returns#aag_returns">return policy</a>. <*/div> (but without the stars - just making sure the HTML doesn't compile out of code form) I don't see what's the problem with the 'location' line of code, or why it's not pulling the data I want. Help?

Read the article

Convert a (nested)HTML unordered list of links to PHP array of links

- by Klark

Hi, I have a regular, nested HTML unordered list of links, and I'd like to scrape it with PHP and convert it to an array. The original list looks something like this: <ul> <li><a href="http://someurl.com">First item</a> <ul> <li><a href="http://someotherurl.com/">Child of First Item</a></li> <li><a href="http://someotherurl.com/">Second Child of First Item</a></li> </ul> </li> <li><a href="http://bogusurl.com">Second item</a></li> <li><a href="http://bogusurl.com">Third item</a></li> <li><a href="http://bogusurl.com">Fourth item</a></li> </ul> Any of the items can have children. (The actual screen scraping is not a problem, I can do that.) I'd like to turn this into a PHP array, of just the links, while keeping the hierarchical nature of the list. Any ideas? I've looked at using htmlsimpledom and phpQuery, which both use jQuery like syntax. But, I can't seem to get the syntax right. Thanks.

Read the article

number->string and related procedures in GIMP scheme scripting

- by dim fish

I am frustrated with string-to-number and number-to-string conversion in GIMP scripting. I am runnning GIMP 2.6.8 in Windows Vista. I understand that GIMP's internal Scheme implementation changes over the versions and I can't seem to nail down the documentation. From what I can gather GIMP's Scheme is a subset of TinyScheme and/or supports the R5RS standard procedures. In any case, I usually just look in the packaged script directory for examples when I want to try something new, because that should work for sure, right? For example, grid-system.scm comes with the latest GIMP release and has the expression, (string-append (number->string obj) " ") which is exactly what I want. However, if I use number-string in my own script, or even type it into GIMP's script console (which is how I usually test out new stuff I want to do) it tells me number-string is an unbound variable: > (number->string 3) Error: eval: unbound variable: number->string Other standard procedures from, say R5RS, work just fine: > (string-append "frust" "rated") "frustrated" So, 1) Is there some lurking documentation for current GIMP Scheme scripting other than something drastic like searching GIMP's source code? 2) Can I use the GIMP console to spit out a list of all defined procedures to find something I need? 3) Anyone else confirm that number-string is not defined for the current Windows build, even though it appears in the packaged scripts? My web searches haven't turned up any related problems, and a complete uninstall of all GIMP versions, back to latest puts me in the same scrape.

Read the article

Get coordinates of google maps markers

- by App_beginner

I am creating a databse containing the names and coordinates of all bus stops in my local area. I have all the names stored in my database, and now I need to add the coordinates. I am trying to get these of a website that contains them all as placemarks on a google map. It seems to me like they are being generated from a local server, and then added to the map. However I am unable to find exactly where the server is queried for the coordinates. I am hoping to collect these coordinates through the use of a screen scraper. However unless I am able to find where in the source code the coordinates are created this seems to be impossible. I can of course search and collect all these placemarks manually, but that will be very time consuming. So I am hoping that someone in here can help me. This is the website I am trying to scrape. The placemarks are marked by the blue bus sign: http://reiseplanlegger.skyss.no/scripts/travelmagic/TravelMagicWE.dll/?from=Brimnes+ferjekai+%28Eidfjord%29&to= You can also get the coordinates of a single placemark by writing the name of the stop in the search field and pushing the "Vis i kart" button. I hope someone can help me with this.

Read the article

Get Mechanize to handle cookies from an arbitrary POST (to log into a website programmatically)

- by Horace Loeb

I want to log into https://www.t-mobile.com/ programmatically. My first idea was to use Mechanize to submit the login form: However, it turns out that this isn't even a real form. Instead, when you click "Log in" some javascript grabs the values of the fields, creates a new form dynamically, and submits it. "Log in" button HTML: <button onclick="handleLogin(); return false;" class="btnBlue" id="myTMobile-login"><span>Log in</span></button> The handleLogin() function: function handleLogin() { if (ValidateMsisdnPassword()) { // client-side form validation logic var a = document.createElement("FORM"); a.name = "form1"; a.method = "POST"; a.action = mytmoUrl; // defined elsewhere as https://my.t-mobile.com/Login/LoginController.aspx var c = document.createElement("INPUT"); c.type = "HIDDEN"; c.value = document.getElementById("myTMobile-phone").value; // the value of the phone number input field c.name = "txtMSISDN"; a.appendChild(c); var b = document.createElement("INPUT"); b.type = "HIDDEN"; b.value = document.getElementById("myTMobile-password").value; // the value of the password input field b.name = "txtPassword"; a.appendChild(b); document.body.appendChild(a); a.submit(); return true } else { return false } } I could simulate this form submission by POSTing the form data to https://my.t-mobile.com/Login/LoginController.aspx with Net::HTTP#post_form, but I don't know how to get the resultant cookie into Mechanize so I can continue to scrape the UI available when I'm logged in. Any ideas?

Read the article

Python web scraping involving HTML tags with attributes

- by rohanbk

I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following: <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </body> </html> I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do? At the moment, my code looks like what is below: import re import urllib2,sys import lxml from lxml import etree from lxml.html.soupparser import fromstring from lxml.etree import tostring from lxml.cssselect import CSSSelector from BeautifulSoup import BeautifulSoup, NavigableString address='http://www.example.com/' html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) html=soup.prettify() html=html.replace('&nbsp', '&#160') html=html.replace('&iacute','&#237') root=fromstring(html) I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file. EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.

Read the article

Get Mechanize to handle cookies from an arbitrary POST (to log into https://www.t-mobile.com/ progra

- by Horace Loeb

I want to log into https://www.t-mobile.com/ programmatically. My first idea was to use Mechanize to submit the login form: However, it turns out that this isn't even a real form. Instead, when you click "Log in" some javascript grabs the values of the fields, creates a new form dynamically, and submits it. "Log in" button HTML: <button onclick="handleLogin(); return false;" class="btnBlue" id="myTMobile-login"><span>Log in</span></button> The handleLogin() function: function handleLogin() { if (ValidateMsisdnPassword()) { // client-side form validation logic var a = document.createElement("FORM"); a.name = "form1"; a.method = "POST"; a.action = mytmoUrl; // defined elsewhere as https://my.t-mobile.com/Login/LoginController.aspx var c = document.createElement("INPUT"); c.type = "HIDDEN"; c.value = document.getElementById("myTMobile-phone").value; // the value of the phone number input field c.name = "txtMSISDN"; a.appendChild(c); var b = document.createElement("INPUT"); b.type = "HIDDEN"; b.value = document.getElementById("myTMobile-password").value; // the value of the password input field b.name = "txtPassword"; a.appendChild(b); document.body.appendChild(a); a.submit(); return true } else { return false } } I could simulate this form submission by POSTing the form data to https://my.t-mobile.com/Login/LoginController.aspx with Net::HTTP#post_form, but I don't know how to get the resultant cookie into Mechanize so I can continue to scrape the UI available when I'm logged in. Any ideas?

Read the article

How to isolate a single element from a scraped web page in R

- by PaulHurleyuk

Hello, I'm trying to do soemone a favour, and it's a tad outside my comfort zone, so I'm stuck. I want to use R to scrape this page (http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html ) and others, to get the goal scorers and times. So far, this is what I've got require(RCurl) require(XML) theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html" webpage <- getURL(theURL, header=FALSE, verbose=TRUE) webpagecont <- readLines(tc <- textConnection(webpage)); close(tc) pagetree <- htmlTreeParse(webpagecont, error=function(...){}, useInternalNodes = TRUE) and the pagetree object now contains a pointer to my parsed html (I think). The part I want is <div class="cont")<ul> <div class="bold medium">Goals scored</div> <li>Philipp LAHM (GER) 6', </li> <li>Paulo WANCHOPE (CRC) 12', </li> <li>Miroslav KLOSE (GER) 17', </li> <li>Miroslav KLOSE (GER) 61', </li> <li>Paulo WANCHOPE (CRC) 73', </li> <li>Torsten FRINGS (GER) 87'</li> </ul></div> but I'm now lost as to how to isolate them, and frankly xpathSApply, xpathApply confuse the beejeebies out of me !. So, does anyone know how to fomulate a command to suck out the element conmtaiend within the tags ? Thanks Paul.

Read the article

Translate XSD element names to English

- by Coov

I have an XML Schema Definition file .XSD who's elements are in Spanish. I have an XML data file that is also in Spanish that matches the schema definition. I created a dataset from the xsd using xsd.exe and I'm loading the XML into the dataset. I want to translate the element names to English. I can think of two options off the top of my head. Either scrape the XSD & XML files and translate the elements prior to generating the dataset with esd.exe, or iterate the dataset after I have loaded it with the xml, and translate my objects. I do have a written document that provides the English names for every element name in Spanish. The problem is there are hundreds of elements and I was trying to avoid coding that manually. Getting precise translations is not that important, it just needs to be readable by an English speaking person. Here is an example of what an element may look like "Apellidos": <xs:element name="Apellidos" type="xs:string"/> that I will translate to "SirName": <xs:element name="SirName" type="xs:string"/> I'm looking for ideas and or opinions on a quick way to do this. It's a one time deal so I'm not working about it scaling or being functional for things other than this single xml file. I'll be taking this dataset and writing out a flat file for English speaking users to read.

Search Results

Search found 210 results on 9 pages for 'scrape'.

Page 6/9 | < Previous Page | 2 3 4 5 6 7 8 9 | Next Page >

- by maninwarren

- by Grant unwin

- by vicatcu

- by Kevin

- by cooldude

- by Lost_in_code

- by Simon Martin

- by Alex Brown

- by nilbus

- by Kevin Soviero

- by SamHuckaby

- by Bob Cross

- by Jason

- by Laila

- by Steve Gates

- by David Waters

- by Samsun Knight

- by Klark

- by dim fish

- by App_beginner

- by Horace Loeb

- by rohanbk

- by Horace Loeb

- by PaulHurleyuk

- by Coov

< Previous Page | 2 3 4 5 6 7 8 9 | Next Page >