Python unicode search not giving correct answer
Posted
by
user1318912
on Stack Overflow
See other posts from Stack Overflow
or by user1318912
Published on 2012-04-07T10:40:47Z
Indexed on
2012/04/07
11:29 UTC
Read the original article
Hit count: 251
I am trying to search hindi words contained one line per file in file-1 and find them in lines in file-2. I have to print the line numbers with the number of words found. This is the code:
import codecs
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines()
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
count_arr = []
for counter, line in enumerate(hypernyms):
count_arr.append(0)
for word in words:
if line.find(word) >=0:
count_arr[counter] +=1
for iterator, count in enumerate(count_arr):
if count>0:
print iterator, ' ', count
This is finding some words, but ignoring some others
The input files are:
File-1:
????
???????
File-2:
???????, ????-????
?????-???, ?????-???, ?????_???, ?????_???
????_????, ????-????, ???????_????
????-????
This gives output:
0 1
3 1
Clearly, it is ignoring ??????? and searching for ???? only. I have tried with other inputs as well. It only searches for one word. Any idea how to correct this?
© Stack Overflow or respective owner