Monday, June 15, 2009

working with stackoverflow xml dump

Even though the extension says .zip, it is actually 7zip.

Add some newlines to make it easier to fiddle with:

mh@schrute /Volumes/data/Users/mh/so/xml --> cat splitxml.c

#include
int main()
{
int c;
int first = 1;
while ((c = getchar()) != EOF) {
if (c == '<') {
if (first == 1)
first = 0;
else
putchar('\n');
}
putchar(c);
}
}

python xml.sax reports errors on these chars:

grep -l '&#x[0-9][0-9];' *.xml

to fix:

perl -pi -e 's/&#x[0-9][0-9];//g' *.xml