Python unicode search not giving correct answer

Posted by user1318912 on Stack Overflow See other posts from Stack Overflow or by user1318912
Published on 2012-04-07T10:40:47Z Indexed on 2012/04/07 11:29 UTC
Read the original article Hit count: 249

Filed under:
|
|
|

I am trying to search hindi words contained one line per file in file-1 and find them in lines in file-2. I have to print the line numbers with the number of words found. This is the code:

import codecs

hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines()
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
count_arr = []

for counter, line in enumerate(hypernyms):
    count_arr.append(0)
    for word in words:
        if line.find(word) >=0:
            count_arr[counter] +=1

for iterator, count in enumerate(count_arr):
if count>0:
    print iterator, ' ', count

This is finding some words, but ignoring some others The input files are: File-1:
????
???????

File-2:
???????, ????-????
?????-???, ?????-???, ?????_???, ?????_???
????_????, ????-????, ???????_????
????-????

This gives output:
0 1
3 1

Clearly, it is ignoring ??????? and searching for ???? only. I have tried with other inputs as well. It only searches for one word. Any idea how to correct this?

© Stack Overflow or respective owner

Related posts about python

Related posts about unicode