Parsing Wiki XML Dumps ver0.4 just got tough
Posted
by syed
on Stack Overflow
See other posts from Stack Overflow
or by syed
Published on 2010-06-05T16:51:36Z
Indexed on
2010/06/05
19:12 UTC
Read the original article
Hit count: 528
Hello, I am trying to parse Wikipedia XML Dump using "Parse-MediaWikiDump-1.0.4" along with "Wikiprep.pl" script. I guess this script works fine with ver0.3 Wiki XML Dumps but not with the latest ver0.4 Dumps. I get the following error.
Can't locate object method "page" via package "Parse::MediaWikiDump::Pages" at wikiprep.pl line 390.
Also, under the "Parse-MediaWikiDump-1.0.4" documentation @ http://search.cpan.org/~triddle/Parse-MediaWikiDump-1.0.4/lib/Parse/MediaWikiDump/Pages.pm, I read "LIMITATIONS Version 0.4 This class was updated to support version 0.4 dump files from a MediaWiki instance but it does not currently support any of the new information available in those files."
Any work arounds would help me get to the next level.
Note: one may wonder why cannot we directly use SAX or STAX parser instead, wikipedia dump is a 25GB plus single file, stack/memory issues are obvious. Hence, the above perl script resolves this issue but currently I am stuck with this version problem.
© Stack Overflow or respective owner