I have code that uses the BeautifulSoup library for parsing, but it is very slow. The code is written in such a way that threads cannot be used.
Can anyone help me with this?
I am using BeautifulSoup for parsing and than save into a DB. If I comment out the save statement, it still takes a long time, so there is no problem with the database.
def parse(self,text):
soup = BeautifulSoup(text)
arr = soup.findAll('tbody')
for i in range(0,len(arr)-1):
data=Data()
soup2 = BeautifulSoup(str(arr[i]))
arr2 = soup2.findAll('td')
c=0
for j in arr2:
if str(j).find("<a href=") > 0:
data.sourceURL = self.getAttributeValue(str(j),'<a href="')
else:
if c == 2:
data.Hits=j.renderContents()
#and few others...
c = c+1
data.save()
Any suggestions?
Note: I already ask this question here but that was closed due to incomplete information.
I need to remove all the html tags from a given webpage data. I tried this using regular expressions:
import urllib2
import re
page = urllib2.urlopen("http://www.frugalrules.com")
from bs4 import BeautifulSoup, NavigableString, Comment
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
description_tag = souprss.find_all('description')
content_tag = souprss.find_all('content:encoded')
print re.sub('<[^>]*>', '', content_tag)
But the syntax of the re.sub is:
re.sub(pattern, repl, string, count=0)
So, I modified the code as (instead of the print statement above):
for row in content_tag:
print re.sub(ur"<[^>]*>",'',row,re.UNICODE
But it gives the following error:
Traceback (most recent call last):
File "C:\beautifulsoup4-4.3.2\collocation.py", line 20, in <module>
print re.sub(ur"<[^>]*>",'',row,re.UNICODE)
File "C:\Python27\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
What am I doing wrong?
I know that I can dynamically add an instance method to an object by doing something like:
import types
def my_method(self):
# logic of method
# ...
# instance is some instance of some class
instance.my_method = types.MethodType(my_method, instance)
Later on I can call instance.my_method() and self will be bound correctly and everything works.
Now, my question: how to do the exact same thing to obtain the behavior that decorating the new method with @property would give?
I would guess something like:
instance.my_method = types.MethodType(my_method, instance)
instance.my_method = property(instance.my_method)
But, doing that instance.my_method returns a property object.
from google.appengine.ext import db
class Log(db.Model):
content = db.StringProperty(multiline=True)
class MyThread(threading.Thread):
def run(self,request):
#logs_query = Log.all().order('-date')
#logs = logs_query.fetch(3)
log=Log()
log.content=request.POST.get('content',None)
log.put()
def Log(request):
thr = MyThread()
thr.start(request)
return HttpResponse('')
error is :
Exception in thread Thread-1:
Traceback (most recent call last):
File "D:\Python25\lib\threading.py", line 486, in __bootstrap_inner
self.run()
File "D:\zjm_code\helloworld\views.py", line 33, in run
log.content=request.POST.get('content',None)
NameError: global name 'request' is not defined
Hi,
I have some strings that I want to delete some unwanted characters from them.
For example: Adam'sApple ---- AdamsApple.(case insensitive)
Can someone help me, I need the fastest way to do it, cause I have a couple of millions of records that have to be polished.
Thanks
Hello, I have got some code to pass in a variable into a script from the command line. The script is:
import sys, os
def function(var):
print var
class function_call(object):
def __init__(self, sysArgs):
try:
self.function = None
self.args = []
self.modulePath = sysArgs[0]
self.moduleDir, tail = os.path.split(self.modulePath)
self.moduleName, ext = os.path.splitext(tail)
__import__(self.moduleName)
self.module = sys.modules[self.moduleName]
if len(sysArgs) > 1:
self.functionName = sysArgs[1]
self.function = self.module.__dict__[self.functionName]
self.args = sysArgs[2:]
except Exception, e:
sys.stderr.write("%s %s\n" % ("PythonCall#__init__", e))
def execute(self):
try:
if self.function:
self.function(*self.args)
except Exception, e:
sys.stderr.write("%s %s\n" % ("PythonCall#execute", e))
if __name__=="__main__":
test = test()
function_call(sys.argv).execute()
This works by entering ./function <function> <arg1 arg2 ....>. The problem is that I want to to select the function I want that is in a class rather than just a function by itself. The code I have tried is the same except that function(var): is in a class. I was hoping for some ideas on how to modify my function_call class to accept this.
Thanks for any help.
I'm trying to write a script to import a database file. I wrote the script to export the file like so:
import sqlite3
con = sqlite3.connect('../sqlite.db')
with open('../dump.sql', 'w') as f:
for line in con.iterdump():
f.write('%s\n' % line)
Now I want to be able to import that database. I tried:
import sqlite3
con = sqlite3.connect('../sqlite.db')
f = open('../dump.sql','r')
str = f.read()
con.execute(str)
but I'm not allowed to execute more than one statement. Is there a way to get it to run a .sql script directly?
unique.txt file contains: 2 columns with columns separated by tab. total.txt file contains: 3 columns each column separated by tab.
I take each row from unique.txt file and find that in total.txt file. If present then extract entire row from total.txt and save it in new output file.
###Total.txt
column a column b column c
interaction1 mitochondria_205000_225000 mitochondria_195000_215000
interaction2 mitochondria_345000_365000 mitochondria_335000_355000
interaction3 mitochondria_345000_365000 mitochondria_5000_25000
interaction4 chloroplast_115000_128207 chloroplast_35000_55000
interaction5 chloroplast_115000_128207 chloroplast_15000_35000
interaction15 2_10515000_10535000 2_10505000_10525000
###Unique.txt
column a column b
mitochondria_205000_225000 mitochondria_195000_215000
mitochondria_345000_365000 mitochondria_335000_355000
mitochondria_345000_365000 mitochondria_5000_25000
chloroplast_115000_128207 chloroplast_35000_55000
chloroplast_115000_128207 chloroplast_15000_35000
mitochondria_185000_205000 mitochondria_25000_45000
2_16595000_16615000 2_16585000_16605000
4_2785000_2805000 4_2775000_2795000
4_11395000_11415000 4_11385000_11405000
4_2875000_2895000 4_2865000_2885000
4_13745000_13765000 4_13735000_13755000
My program:
file=open('total.txt')
file2 = open('unique.txt')
all_content=file.readlines()
all_content2=file2.readlines()
store_id_lines = []
ff = open('match.dat', 'w')
for i in range(len(all_content)):
line=all_content[i].split('\t')
seq=line[1]+'\t'+line[2]
for j in range(len(all_content2)):
if all_content2[j]==seq:
ff.write(seq)
break
Problem:
but istide of giving desire output (values of those 1st column that fulfile the if condition). i nead somthing like if jth of unique.txt == ith of total.txt then write ith row of total.txt into new file.
Here's the deal. I'm trying to write an arkanoid clone game and the thing is that I need a window menu like you get in pyGTK. For example File-(Open/Save/Exit) .. something like that and opening an "about" context where the author should be written.
I'm already using pyGame for writting the game logic. I've tried pgu to write the GUI but that doesn't help me, altough it has those menu elements I'm taking about, you can't include the screen of the game in it's container.
Does anybody know how to include such window menus with the usage of pyGame ?
Hello everybody,
I have two nested lists of different sizes:
A = [[1, 7, 3, 5], [5, 5, 14, 10]]
B = [[1, 17, 3, 5], [1487, 34, 14, 74], [1487, 34, 3, 87], [141, 25, 14, 10]]
I'd like to gather all nested lists from list B if A[2:4] == B[2:4] and put it into list L:
L = [[1, 17, 3, 5], [141, 25, 14, 10]]
Would you help me with this?
I am looking into the unittest package, and I'm not sure of the proper way to structure my test cases when writing a lot of them for the same method. Say I have a fact function which calculates the factorial of a number; would this testing file be OK?
import unittest
class functions_tester(unittest.TestCase):
def test_fact_1(self):
self.assertEqual(1, fact(1))
def test_fact_2(self):
self.assertEqual(2, fact(2))
def test_fact_3(self):
self.assertEqual(6, fact(3))
def test_fact_4(self):
self.assertEqual(24, fact(4))
def test_fact_5(self):
self.assertFalse(1==fact(5))
def test_fact_6(self):
self.assertRaises(RuntimeError, fact, -1)
#fact(-1)
if __name__ == "__main__":
unittest.main()
It seems sloppy to have so many test methods for one method. I'd like to just have one testing method and put a ton of basic test cases (ie 4! ==24, 3!==6, 5!==120, and so on), but unittest doesn't let you do that.
What is the best way to structure a testing file in this scenario?
Thanks in advance for the help.
The tutorial on the django website shows this code for the models:
from django.db import models
class Poll(models.Model):
question = models.CharField(max_length=200)
pub_date = models.DateTimeField('date published')
class Choice(models.Model):
poll = models.ForeignKey(Poll)
choice = models.CharField(max_length=200)
votes = models.IntegerField()
Now, each of those attribute, is a class attribute, right? So, the same attribute should be shared by all instances of the class. A bit later, they present this code:
class Poll(models.Model):
# ...
def __unicode__(self):
return self.question
class Choice(models.Model):
# ...
def __unicode__(self):
return self.choice
How did they turn from class attributes into instance attributes? Did I get class attributes wrong?
Trying to integrate openmeetings with django website, but can't understand how properly configure ImportDoctor:
(here :// replaced with __ 'cause spam protection)
print url
http://sovershenstvo.com.ua:5080/openmeetings/services/UserService?wsdl
imp = Import('http__schemas.xmlsoap.org/soap/encoding/')
imp.filter.add('http__services.axis.openmeetings.org')
imp.filter.add('http__basic.beans.hibernate.app.openmeetings.org/xsd')
imp.filter.add('http__basic.beans.data.app.openmeetings.org/xsd')
imp.filter.add('http__services.axis.openmeetings.org')
d = ImportDoctor(imp)
client = Client(url, doctor = d)
client.service.getSession()
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.6/site-packages/suds/client.py", line 539, in call
return client.invoke(args, kwargs)
File "/usr/lib/python2.6/site-packages/suds/client.py", line 598, in invoke
result = self.send(msg)
File "/usr/lib/python2.6/site-packages/suds/client.py", line 627, in send
result = self.succeeded(binding, reply.message)
File "/usr/lib/python2.6/site-packages/suds/client.py", line 659, in succeeded
r, p = binding.get_reply(self.method, reply)
File "/usr/lib/python2.6/site-packages/suds/bindings/binding.py", line 159, in get_reply
resolved = rtypes[0].resolve(nobuiltin=True)
File "/usr/lib/python2.6/site-packages/suds/xsd/sxbasic.py", line 63, in resolve
raise TypeNotFound(qref)
suds.TypeNotFound: Type not found: '(Sessiondata, http__basic.beans.hibernate.app.openmeetings.org/xsd, )'
what i'm doing wrong? please help and sorry for my english, but you are my last chance to save position :(
need webinars at morning (2.26 am now)
I have used the 2to3 utility to convert code from the command line. What I would like to do is run it basically as a unittest. Even if it tests the file rather than parts(funtions, methods...) as would be normal for a unittest.
It does not need to be a unittest and I don't what to automatically convert the files I just want to monitor the py3 compliance of files in a unittest like manor. I can't seem to find any documentation or examples for this.
An example and/or documentation would be great.
Thanks
Implement this loop: total up the product of the numbers from 1 to x.
Implement this loop: total up the product of the numbers from a to b.
Implement this loop: total up the sum of the numbers from a to b.
Implement this loop: total up the sum of the numbers from 1 to x.
Implement this loop: count the number of characters in a string s.
i'm very lost on implementing loops these are just some examples that i am having trouble with-- if someone could help me understand how to do them that would be awesome
Hi,
I'm using windows and linux machines for the same project. The default encoding for stdin on windows is cp1252 and on linux is utf-8.
I would like to change everything to uft-8.
Is it possible? How can I do it?
Thanks
Eduardo
I'm trying to use reserved words in my grammar:
reserved = {
'if' : 'IF',
'then' : 'THEN',
'else' : 'ELSE',
'while' : 'WHILE',
}
tokens = [
'DEPT_CODE',
'COURSE_NUMBER',
'OR_CONJ',
'ID',
] + list(reserved.values())
t_DEPT_CODE = r'[A-Z]{2,}'
t_COURSE_NUMBER = r'[0-9]{4}'
t_OR_CONJ = r'or'
t_ignore = ' \t'
def t_ID(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
if t.value in reserved.values():
t.type = reserved[t.value]
return t
return None
However, the t_ID rule somehow swallows up DEPT_CODE and OR_CONJ. How can I get around this? I'd like those two to take higher precedence than the reserved words.
So I have a list that I want to convert to a list that contains a list for each group of objects.
ie
['objA.attr1', 'objC', 'objA.attr55', 'objB.attr4']
would return
[['objA.attr1', 'objA.attr55'], ['objC'], ['objB.attr4']]
currently this is what I use:
givenList = ['a.attr1', 'b', 'a.attr55', 'c.attr4']
trgList = []
objNames = []
for val in givenList:
obj = val.split('.')[0]
if obj in objNames:
id = objNames.index(obj)
trgList[id].append(val)
else:
objNames.append(obj)
trgList.append([val])
#print trgList
It seems to run a decent speed when the original list has around 100,000 ids... but I am curious if there is a better way to do this. Order of the objects or attributes does not matter. Any ideas?
Any time I want to replace a piece of text that is part of a larger piece of text, I always have to do something like:
"(?P<start>some_pattern)(?P<replace>foo)(?P<end>end)"
And then concatenate the start group with the new data for replace and then the end group.
Is there a better method for this?
I'm having a new problem here ..
CODE 1:
try:
urlParams += "%s=%s&"%(val['name'], data.get(val['name'], serverInfo_D.get(val['name'])))
except KeyError:
print "expected parameter not provided - "+val["name"]+" is missing"
exit(0)
CODE 2:
try:
urlParams += "%s=%s&"%(val['name'], data.get(val['name'], serverInfo_D[val['name']]))
except KeyError:
print "expected parameter not provided - "+val["name"]+" is missing"
exit(0)
see the diffrence in serverInfo_D[val['name']] & serverInfo_D.get(val['name'])
code 2 fails but code 1 works
the data
serverInfo_D:{'user': 'usr', 'pass': 'pass'}
data: {'par1': 9995, 'extraparam1': 22}
val: {'par1','user','pass','extraparam1'}
exception are raised for for data dict .. and all code in for loop which iterates over val
how do i strip comma from the end of an string, i tried
awk = subprocess.Popen([r"awk", "{print $10}"], stdin=subprocess.PIPE)
awk_stdin = awk.communicate(uptime_stdout)[0]
print awk_stdin
temp = awk_stdin
t = temp.strip(",")
also tried t = temp.rstrip(","), both don't work.
Is there a more Pythonic way to put this loop together?:
while True:
children = tree.getChildren()
if not children:
break
tree = children[0]
UPDATE:
I think this syntax is probably what I'm going to go with:
while tree.getChildren():
tree = tree.getChildren()[0]