Python, dictionaries, and chi-square contingency table
Posted
by rohanbk
on Stack Overflow
See other posts from Stack Overflow
or by rohanbk
Published on 2010-06-12T17:57:05Z
Indexed on
2010/06/12
18:03 UTC
Read the original article
Hit count: 402
I have a file which contains several lines in the following format (word, time that the word occurred in, and frequency of documents containing the given word within the given instance in time):
#inputfile
<word, time, frequency>
apple, 1, 3
banana, 1, 2
apple, 2, 1
banana, 2, 4
orange, 3, 1
I have Python class below that I used to create 2-D dictionaries to store the above file using as the key, and frequency as the value:
class Ddict(dict):
'''
2D dictionary class
'''
def __init__(self, default=None):
self.default = default
def __getitem__(self, key):
if not self.has_key(key):
self[key] = self.default()
return dict.__getitem__(self, key)
wordtime=Ddict(dict) # Store each inputfile entry with a <word,time> key
timeword=Ddict(dict) # Store each inputfile entry with a <time,word> key
# Loop over every line of the inputfile
for line in open('inputfile'):
word,time,count=line.split(',')
# If <word,time> already a key, increment count
try:
wordtime[word][time]+=count
# Otherwise, create the key
except KeyError:
wordtime[word][time]=count
# If <time,word> already a key, increment count
try:
timeword[time][word]+=count
# Otherwise, create the key
except KeyError:
timeword[time][word]=count
The question that I have pertains to calculating certain things while iterating over the entries in this 2D dictionary. For each word 'w' at each time 't', calculate:
- The number of documents with word 'w' within time 't'. (a)
- The number of documents without word 'w' within time 't'. (b)
- The number of documents with word 'w' outside time 't'. (c)
- The number of documents without word 'w' outside time 't'. (d)
Each of the items above represents one of the cells of a chi-square contingency table for each word and time. Can all of these be calculated within a single loop or do they need to be done one at a time?
Ideally, I would like the output to be what's below, where a,b,c,d are all the items calculated above:
print "%s, %s, %s, %s" %(a,b,c,d)
© Stack Overflow or respective owner