Filtering Wikipedia's XML dump: error on some accents

Posted by streetpc on Stack Overflow See other posts from Stack Overflow or by streetpc
Published on 2010-03-31T10:40:41Z Indexed on 2010/03/31 10:43 UTC
Read the original article Hit count: 548

Filed under:

encoding

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.

I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.

In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).

I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…

Any idea ?

Alternatives I tried:

Getting the file (commented lines were tried wihtout success):

FileInputStream fis = new FileInputStream(new File(xmlFileName));
        //ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" ); //(custom function opening the stream, reading it as UFT-8 into a Reader and returning another byte stream)
        //InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
        parser.parse(fis, handler);

Filtered prefixes:

ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
    "Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
    "CatÃ©gorie:", "ModÃ¨le:", "WikipÃ©dia:", // UTF-8 as ISO-8859-1
    "Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work

Developer IT

Filtering Wikipedia's XML dump: error on some accents - Developer IT

Filtering Wikipedia's XML dump: error on some accents

sax

encoding

Related posts about sax

BlackBerry/J2ME - SAX parse collection of objects with attributes

Use of SAX parser in Android - OutOfMemory Issue

Java SAX ContentHandler to create new objects for every root node

Parsing unicode XML with Python SAX on App Engine

Insert a doctype into an XML document (Java/ SAX)

Related posts about encoding

<?xml version=“1.0” encoding=“UTF-8”?> not <?xml version='1.0' encoding='UTF-8'?>

Ivar definitions show 'long' type encoding as 'long long' type encoding

How to avoid encoding the key of request parameters being encoding

C# Check if character exists in encoding

How to detect the character encoding of a text file?

Categories cloud