beautifulsoup - Page 4 - Developer IT

ASP .NET, Javascript, AjaxControlToolkit render results with Selenium?

- by Seth

I'm a newbie to web stuff. However, I wish to scrape some data from multiple websites. I'm currently using the following technologies: Selenium; Python; and BeautifulSoup; I believe the site I am trying to scrape is using a combination of ASP.NET, javascript and the AjaxControlToolkit. I believe the key results I am looking for are in the following script: <script type="text/javascript"> //<![CDATA[ Sys.Application.initialize(); Sys.Application.add_init(function() { $create(AjaxControlToolkit.AutoCompleteBehavior, {"completionInterval":50,"completionListCssClass":"autocomplete_completionListElement","completionListItemCssClass":"autocomplete_listItem","completionSetCount":20,"delimiterCharacters":"","highlightedItemCssClass":"autocomplete_highlightedListItem","id":"ctl00_ContentPlaceHolder1_AutoCompleteExtender1","minimumPrefixLength":4,"serviceMethod":"GetSchoolNames","servicePath":"AutoComplete.asmx"}, {"itemSelected":ItemSelected}, null, $get("ctl00_ContentPlaceHolder1_SchoolNameTextBox")); }); Sys.Application.add_init(function() { $create(AjaxControlToolkit.AutoCompleteBehavior, {"completionInterval":50,"completionListCssClass":"autocomplete_completionListElement","completionListItemCssClass":"autocomplete_listItem","delimiterCharacters":"","highlightedItemCssClass":"autocomplete_highlightedListItem","id":"ctl00_ContentPlaceHolder1_AutoCompleteExtender2","minimumPrefixLength":2,"serviceMethod":"GetSuburbNames","servicePath":"AutoComplete.asmx"}, null, null, $get("ctl00_ContentPlaceHolder1_SuburbTownTextBox")); }); //]]> </script> Is there an easy way to get the results of the above script processed using Selenium so that I may pass it using BeautifulSoup?

Read the article

Python - Problems using mechanize to log into a difficult website

- by user1781599

× 139886 I am trying to log in to betfair.com by using mechanize. I have tried several ways but it always fail. This is the code I have developed so far, can anyone help me to identify what is wrong with it and how I can improve it to log into my betfair account? Thanks, import cookielib import urllib import urllib2 from BeautifulSoup import BeautifulSoup import mechanize from mechanize import Browser import re bf_username_name = "username" bf_password_name = "password" bf_form_name = "loginForm" bf_username = "xxxxx" bf_password = "yyyyy" urlLogIn = "http://www.betfair.com/" accountUrl = "https://myaccount.betfair.com/account/home?rlhm=0&" # This url I will use to verify if log in has been successful br = mechanize.Browser(factory=mechanize.RobustFactory()) br.addheaders = [("User-Agent","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_5_8) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.90 Safari/537.1")] br.open(urlLogIn) br.select_form(nr=0) print br.form br.form[bf_username_name] = bf_username br.form[bf_password_name] = bf_password print br.form #just to check username and psw have been recorded correctly responseSubmit = br.submit() response = br.open(accountUrl) text_file = open("LogInResponse.html", "w") text_file.write(responseSubmit.read()) #this file should show the home page with me logged in, but it show home page as if I was not logged it text_file.close() text_file = open("Account.html", "w") text_file.write(response.read()) #this file should show my account page, but it should a pop up with an error text_file.close()

Read the article

What is meant by namespaced content and what advantages does it have?

- by Geek

I was reading this blog by James Bennett regarding HTML vs XHTML . He writes : I don’t have any need for namespaced content; I’m not displaying any complex mathematical notation here and don’t plan to, and I don’t use SVG for any images. So that’s one advantage of XHTML out the window. I also don’t have any need for XML tools; all the processing I need to do can be handled by HTML-parsing libraries like BeautifulSoup. That’s the other advantage gone. What does he mean by namespaced content and what advantage does it provide us ?

Read the article

Mining Groups of people from Wikipedia

- by AlgoMan

I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section. How should i go about it ? Should i use a crawler and get the pages and search through those using BeautifulSoup ? Or is there any other alternative to get the same from Wikipedia ?

Read the article

Finding inline style with lxml.cssselector

- by ropa

New to this library (no more familiar with BeautifulSoup either, sadly), trying to do something very simple (search by inline style): <td style="padding: 20px">blah blah </td> I just want to select all tds where style="padding: 20px", but I can't seem to figure it out. All the examples show how to select td, such as: for col in page.cssselect('td'): but that doesn't help me much.

Read the article

HTML parser for GAE

- by Richard

Generally I use lxml for my HTML parsing needs, but that isn't available on Google App Engine. The obvious alternative is BeautifulSoup, but I find it chokes too easily on malformed HTML. Currently I am testing libxml2dom and have been getting better results. Which pure Python HTML parser have you found performs best? My priority is the ability to handle bad HTML over speed.

Read the article

Trouble with encoding and urllib

- by Ockonal

Hello, I'm loading web-page using urllib. Ther eis russian symbols, but page encoding is 'utf-8' 1 pageData = unicode(requestHandler.read()).decode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 262: ordinal not in range(128) 2 pageData = requestHandler.read() soupHandler = BeautifulSoup(pageData) print soupHandler.findAll(...) UnicodeEncodeError: 'ascii' codec can't encode characters in position 340-345: ordinal not in range(128)

Read the article

Parsing text file in python

- by Ockonal

Hello, I have html-file. I have to replace all text between this: [%anytext%]. As I understand, it's very easy to do with BeautifulSoup for parsing hmtl. But what is regular expression and how to remove&write back text data?

Read the article

Which XML library for what purposes?

- by John Mee

A search for "python" and "xml" returns a variety of libraries for combining the two. This list probably faulty: xml.dom xml.etree xml.sax xml.parsers.expat PyXML beautifulsoup? HTMLParser htmllib sgmllib Be nice if someone can offer a quick summary of when to use which, and why.

Read the article

Setting up a python screen scraper that could work on Google App engine

- by cozza

I am looking to setup a automated screen scraper that will run on Google app engine using python. I want it to scrape the site and put the specified results into a Entity in app engine. I am looking for some directions on what to use. I have seen beautifulsoup but wonder if people could recommend anything else that could run on Google App engine.

Read the article

jquery-like HTML parsing in Python?

- by Roy Tang

Is there any Python library that allows me to parse an HTML document similar to what jQuery does? i.e. I'd like to be able to use CSS selector syntax to grab an arbitrary set of nodes from the document, read their content/attributes, etc. The only Python HTML parsing lib I've used before was BeautifulSoup, and even though it's fine I keep thinking it would be faster to do my parsing if I had jQuery syntax available. :D

Read the article

Python: How to extract xml embedded in a html file?

- by georgehu

I have a html file with xml snipped embedded, the source code is pasted in the pastbin: http://pastebin.com/Hy0QaWk8 my task is to extract the text enclosed in the first textarea, which is a xml snippet, from the html. Without any change to the original snippet. I'm able to get it by using the BeautifulSoup, but it changes all the tag names into lower case.

Read the article

Downloading a web page and all of its resource files in Python

- by Mark

I want to be able to download a page and all of its associated resources (images, style sheets, script files, etc) using Python. I am (somewhat) familiar with urllib2 and know how to download individual urls, but before I go and start hacking at BeautifulSoup + urllib2 I wanted to be sure that there wasn't already a Python equivalent to "wget --page-requisites http://www.google.com". Specifically I am interested in gathering statistical information about how long it takes to download an entire web page, including all resources. Thanks Mark

Read the article

How to use python and beautfulsoup to print timestamp/last updated time (from HTML:) for each row ?

- by cesalo

How to use python and beautfulsoup to print timestamp/last updated time (from HTML:) for each row ? thanks a lot ! A) 1) can i add the print a)date/time and b)last updated time after row ? a) date/time - display the time when execute the python code b) last updated time from HTML: HTML structure: td x 1 including two tables each table have few "tr" and within "tr" have few "td" data inside HTML: <td> <table width="100%" border="4" cellspacing="0" bordercolor="white" align="center"> <tbody> <tr> <td colspan="2" class="verd_black11">Last Updated: 18/08/2014 10:19</td> </tr> <tr> <td colspan="3" class="verd_black11">All data delayed at least 15 minutes</td> </tr> </tbody> </table> <table width="100%" border="4" cellspacing="0" bordercolor="white" align="center"> <tbody id="tbody"> <tr id="tr0" class="tableHdrB1" align="center"> <td align="centre">C Aug-14 - 15000</td> <td align="right"> - </td> <td align="right">5</td> <td align="right">9,904</td> </tr> </tbody> </table> </td> Code: import urllib2 from bs4 import BeautifulSoup contenturl = "HTML:" soup = BeautifulSoup(urllib2.urlopen(contenturl).read()) table = soup.find('tbody', attrs={'id': 'tbody'}) rows = table.findAll('tr') for tr in rows: cols = tr.findAll('td') for td in cols: t = td.find(text=True) if t: text = t + ';' print text, print Output from above code C Aug-14 - 15000 ; - ; 5 ; 9,904 Expected output: C Aug-14 - 15000 ; - ; 5 ; 9,904 ; 18/08/2014 ; 13:48:00 ; 18/08/2014 ; 10:19 (execute python code) (last updated time)

Read the article

How to remove package from apt-get autoremove "queue"

- by Darth

I just installed Calibre for ebook management via apt-get on Ubuntu 10.04, however I found out that it's one major version behind the current release, so I decided to reinstall it directly from sources. When I uninstalled the packaged version, apt added bunch of dependencies to the autoremove queue, and as I installed newer version of Calibre from sources, it has no knowledge of it being dependent on those packages. Now I basically have all libraries that I want, but they are still in the autoremove queue. The following packages were automatically installed and are no longer required: libqt4-script libqt4-designer libqt4-dbus python-lxml python-cherrypy3 python-encutils libqt4-xmlpatterns libqt4-help python-qt4 python-clientform python-sip python-django python-mechanize libqt4-svg python-django-tagging libphonon4 libqt4-xml libqt4-assistant libqt4-webkit libqt4-scripttools python-beautifulsoup python-pypdf python-dateutil python-cssutils Use 'apt-get autoremove' to remove them. How do I tell apt that I want to keep these packages installed, without reinstalling them manually?

Read the article

Which revision of html5lib is stable?

- by Mat

html5lib notes that it's latest release (0.11) is somewhat old. Using the Python portion, I have recursion problems as noted in Issue 70 and Issue 59 but can't find a recent Mercurial revision that is stable. The latest tip is no good, I got the following error from python setup.py install: byte-compiling build/bdist.linux-x86_64/egg/html5lib/treewalkers/_base.py to _base.pyc File "build/bdist.linux-x86_64/egg/html5lib/treewalkers/_base.py", line 40 "data": []} ^ SyntaxError: invalid syntax And I get the following errors at runtime: soup = parser.parse(page.read()) File "build/bdist.linux-x86_64/egg/html5lib/html5parser.py", line 165, in parse File "build/bdist.linux-x86_64/egg/html5lib/html5parser.py", line 144, in _parse File "build/bdist.linux-x86_64/egg/html5lib/html5parser.py", line 454, in processDoctype TypeError: insertDoctype() takes exactly 4 arguments (2 given) I'm using it on Python 2.5.2 with lxml and BeautifulSoup.

Read the article

Interpreting Search Results

- by Simon

Hi all, I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning. What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated. Thanks, Simon

Read the article

Having trouble scraping an ASP .NET web page

- by Seth

I am trying to scrape an ASP.NET website but am having trouble getting the results from a post. I have the following python code and am using httplib2 and BeautifulSoup: conn = Http() # do a get first to retrieve important values page = conn.request(u"http://somepage.com/Search.aspx", "GET") #event_validation and viewstate variables retrieved from GET here... body = {"__EVENTARGUMENT" : "", "__EVENTTARGET" : "" , "__EVENTVALIDATION": event_validation, "__VIEWSTATE" : viewstate, "ctl00_ContentPlaceHolder1_GovernmentCheckBox" : "On", "ctl00_ContentPlaceHolder1_NonGovernmentCheckBox" : "On", "ctl00_ContentPlaceHolder1_SchoolKeyValue" : "", "ctl00_ContentPlaceHolder1_SchoolNameTextBox" : "", "ctl00_ContentPlaceHolder1_ScriptManager1" : "ctl00_ContentPlaceHolder1_UpdatePanel1|cct100_ContentPlaceHolder1_SearchImageButton", "ct100_ContentPlaceHolder1_SearchImageButton.x" : "375", "ct100_ContentPlaceHolder1_SearchImageButton.y" : "11", "ctl00_ContentPlaceHolder1_SuburbTownTextBox" : "Adelaide,SA,5000", "hiddenInputToUpdateATBuffer_CommonToolkitScripts" : 1} headers = {"Content-type": "application/x-www-form-urlencoded"} resp, content = conn.request(url,"POST", headers=headers, body=urlencode(body)) When I print content I still seem to be getting the same results as the "GET" or is there a fundamental concept I'm missing to retrieve the result values of an ASP .NET post?

Read the article

ImportError and Django driving me crazy

- by John Peebles

OK, I have the following directory structure (it's a django project): - project -- app and within the app folder, there is a scraper.py file which needs to reference a class defined within models.py I'm trying to do the following: import urllib2 import os import sys import time import datetime import re import BeautifulSoup sys.path.append('/home/userspace/Development/') os.environ['DJANGO_SETTINGS_MODULE'] = 'project.settings' from project.app.models import ClassName and this code just isn't working. I get an error of: Traceback (most recent call last): File "scraper.py", line 14, in from project.app.models import ClassName ImportError: No module named project.app.models This code above used to work, but broke somewhere along the line and I'm extremely confused as to why I'm having problems. On SnowLeopard using python2.5.

Read the article

Loading url with cyrillic symbols

- by Ockonal

Hi guys, I have to load some url with cyrillic symbols. My script should work with this: http://wincode.org/%D0%BF%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%BC%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5/ If I'll use this in browser it would replaced into normal symbols, but urllib code fails with 404 error. How to decode correctly this url? When I'm using that url directly in code, like address = 'that address', it works perfect. But I used parsing page for getting this url. I have a list of urls which contents cyrillic. Maybe they have uncorrect encoding? Here is more code: requestData = urllib2.Request( %SOME_ADDRESS%, None, {"User-Agent": user_agent}) requestHandler = pageHandler.open(requestData) pageData = requestHandler.read().decode('utf-8') soupHandler = BeautifulSoup(pageData) topicLinks = [] for postBlock in soupHandler.findAll('a', href=re.compile('%SOME_REGEXP%')): topicLinks.append(postBlock['href']) postAddress = choice(topicLinks) postRequestData = urllib2.Request(postAddress, None, {"User-Agent": user_agent}) postHandler = pageHandler.open(postRequestData) postData = postHandler.read() File "/usr/lib/python2.6/urllib2.py", line 518, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 404: Not Found

Read the article

Python Scraper for Javascript?

- by Diego

Hey all, Can anyone direct me to a good Python screen scraping library for javascript code (hopefully one with good documentation/tutorials)? I'd like to see what options are out there, but most of all the easiest to learn with fastest results... wondering if anyone had experience. I've heard some stuff about spidermonkey, but maybe there are better ones out there? Specifically, I use BeautifulSoup and Mechanize to get to here, but need a way to open the javascript popup, submit data, and download/parse the results in the javascript popup. <a href="javascript:openFindItem(12510109)" onclick="s_objectID="javascript:openFindItem(12510109)_1";return this.s_oc?this.s_oc(e):true">Find Item</a> I'd like to implement this with Google App engine and Django. Thanks!

Read the article

Is there any Python library that allows me to parse an HTML document similar to what jQuery does?

- by Sachin Tendulkar

Is there any Python library that allows me to parse an HTML document similar to what jQuery does? i.e. I'd like to be able to use CSS selector syntax to grab an arbitrary set of nodes from the document, read their content/attributes, etc. The only Python HTML parsing lib I've used before was BeautifulSoup, and even though it's fine I keep thinking it would be faster to do my parsing if I had jQuery syntax available. :D Write an iterative program that finds the largest number of McNuggets that cannot be bought in exact quantity. Your program should print the answer in the following format (where the correct number is provided in place of n): "Largest number of McNuggets that cannot be bought in exact quantity: n"

Read the article

Python: Data Object or class

- by arg20

I enjoy all the python libraries for scraping websites and I am experimenting with BeautifulSoup and IMDB just for fun. As I come from Java, I have some Java-practices incorporated into my programming styles. I am trying to get the info of a certain movie, I can either create a Movie class or just use a dictionary with keys for the attributes. My question is, should I just use dictionaries when a class will only contain data and perhaps almost no behaviour? In other languages creating a type will help you enforce certain restrictions and because of type checks the IDE will help you program, this is not always the case in python, so what should I do? Should I resort to creating a class only when there's both, behaviour and data? Or create a movie class even though it'll probably be just a data container? This all depends on your model, in this particular case either one is fine but I'm wondering about what's a good practice.

Read the article

Scraping html WITHOUT uniquie identifiers using python

- by Nicholas Law

I would like to design an algorithm using python that scrapes thousands of pages like this one and this one, gathers all the data and inserts it into a MySQL database. The script will be run on a weekly or bi-weekly basis to update the database of any new information added to each individual page. Ideally I would like a scraper that is easy to work with for table structured data but also data that does not have unique identifiers (ie. id and classes attributes). Which scraper add-on should I use? BeautifulSoup, Scrapy or Mechanize? Are there any particular tutorials/books I should be looking at for this desired result? In the long-run I will be implementing a mobile app that works with all this data through querying the database.

Read the article

Python: Removing particular character (u"\u2610") from string

- by duhaime

I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character. (You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.) To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove. for work in glob.glob(pathtofiles): openfile = open(work) readfile = openfile.read() stringfile = str(readfile) decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line? soup = BeautifulSoup(decodefile) textwithtags = soup.findAll('text') textwithtagsasstring = str(textwithtags) #this method strips everything between anglebrackets as it should textwithouttags = stripTags(textwithtagsasstring) #clean text nonewlines = textwithouttags.replace("\n", " ") noextrawhitespace = re.sub(' +',' ', nonewlines) print noextrawhitespace #the boxes appear I tried to remove the boxes by using noboxes = noextrawhitespace.replace(u"\u2610", "") But Python threw an error flag: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128) Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.

Search Results

Search found 101 results on 5 pages for 'beautifulsoup'.

Page 4/5 | < Previous Page | 1 2 3 4 5 | Next Page >

- by Seth

- by user1781599

- by Geek

- by AlgoMan

- by ropa

- by Richard

- by Ockonal

- by Ockonal

- by John Mee

- by cozza

- by Roy Tang

- by georgehu

- by Mark

- by cesalo

- by Darth

- by Mat

- by Simon

- by Seth

- by John Peebles

- by Ockonal

- by Diego

- by Sachin Tendulkar

- by arg20

- by Nicholas Law

- by duhaime

< Previous Page | 1 2 3 4 5 | Next Page >