How to remove a node based on contents of a subnode and then convert to CSV with headers intact
- by Morris Cox
I'm downloading a 10MB zipped XML file with wget, unzipping it to 40MB, trying to weed out test and expired entries (if the text of a certain node is "This is a test opportunity. Please DO NOT apply!" or if is in the past), and then convert to CSV.
However, I get this error:
PHP Notice: Array to string conversion in /home/morris/projects/grantsgov/xml2csv.php on line 46
I get Array errors because some entries in the XML file have more than one occurrence of a node (different text contents). The test entries are still present.
Contents of grantsgov.sh:
#!/bin/bash
wget --clobber "http://www.grants.gov/search/downloadXML.do;jsessionid=1n7GNpNF2tKZnRGLyQqf7Tl32hFJ1zndhfQpLrJJD11TTNzWMwDy!368676377?fname=GrantsDBExtract$(date +'%Y%m%d').zip" -O GrantsDBExtract$(date +"%Y%m%d").zip
unzip GrantsDBExtract$(date +"%Y%m%d").zip
php xml2csv.php
zip GrantsDBExtracted$(date +"%Y%m%d").zip GrantsDBExtracted$(date +"%Y%m%d").csv
Contents of xml2csv.php:
<?
$current_date = date("Ymd");
//$getfile=fopen("http://www.grants.gov/search/downloadXML.do;jsessionid=GqJNNmdLyJyMlsQqTzS2KdzgT5NMdhPp0QhG946JTmHzRltNTpMQ!368676377?fname=GrantsDBExtract$current_date.zip", "r");
$zip = zip_open("GrantsDBExtract$current_date.zip");
if(is_resource($zip)) {
while ($zip_entry = zip_read($zip)) {
$fp = fopen("./".zip_entry_name($zip_entry), "w");
if (zip_entry_open($zip, $zip_entry, "r")) {
$buf = zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
fwrite($fp,"$buf");
zip_entry_close($zip_entry);
fclose($fp);
}
}
zip_close($zip);
}
// For future potential use
$xml = new DOMDocument('1.0', 'ascii');
$xpath = new DOMXpath($xml);
$xslFile = "feddata.xsl";
$filexml="GrantsDBExtract$current_date.xml";
if (file_exists($filexml)) {
$xml = simplexml_load_file($filexml);
echo "Loaded $filexml\n";
$xslt = new XSLTProcessor();
$xsl = new DOMDocument();
//$XSL->load('feddata.xsl', LIBXML_NOCDATA);
$xsl->load($xslFile, LIBXML_NOCDATA);
$xslt->importStylesheet($xsl);
$xslt->transformToXML($xml);
$f = fopen("GrantsDBExtracted$current_date.csv", 'w');
// create the CSV header row on the first time here
$first = FALSE;
$fields = array();
foreach($record as $key => $value)
{
$fields[] = $key;
}
fwrite($f,implode(";",$fields)."\n");
foreach ($xml->FundingOppSynopsis as $fos) {
fputcsv($f, get_object_vars($fos),',','"');
}
}
fclose($f);
} else {
exit('Failed to open file.');
}
?>
Contents of feddata.xsl:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="FundingOppSynopsis[AgencyMailingAddress = 'This is a test opportunity. Please DO NOT apply!']">
</xsl:template>
</xsl:stylesheet><xsl:strip-space elements="*"/>
Part of the XML file:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Grants SYSTEM "http://www.grants.gov/search/dtd/XMLExtract.dtd">
<Grants>
<FundingOppSynopsis>
<PostDate>08312007</PostDate>
<UserID>None</UserID>
<Password>None</Password>
<FundingInstrumentType>CA</FundingInstrumentType>
<FundingActivityCategory>DPR</FundingActivityCategory>
<OtherCategoryExplanation>This is a test opportunity. Please DO NOT apply!</OtherCategoryExplanation>
<NumberOfAwards>5</NumberOfAwards>
<EstimatedFunding>4</EstimatedFunding>
<AwardCeiling>2</AwardCeiling>
<AwardFloor>1</AwardFloor>
<AgencyMailingAddress>This is a test opportunity. Please DO NOT apply!</AgencyMailingAddress>
<FundingOppTitle>This is a test opportunity. Please DO NOT apply!</FundingOppTitle>
<FundingOppNumber>IVV-08312007-RG-OPP5</FundingOppNumber>
<ApplicationsDueDate>09102007</ApplicationsDueDate>
<ApplicationsDueDateExplanation>This is a test opportunity. Please DO NOT apply!</ApplicationsDueDateExplanation>
<ArchiveDate>10102007</ArchiveDate>
<Location>None</Location>
<Office>None</Office>
<Agency>None</Agency>
<FundingOppDescription>This is a test opportunity. Please DO NOT apply!</FundingOppDescription>
<CFDANumber>000000</CFDANumber>
<EligibilityCategory>21</EligibilityCategory>
<AdditionalEligibilityInfo>This is a test opportunity. Please DO NOT apply!</AdditionalEligibilityInfo>
<CostSharing>N</CostSharing>
<ObtainFundingOppText FundingOppURL="">Not Available</ObtainFundingOppText>
<AgencyContact AgencyEmailDescriptor="This is a test opportunity. Please DO NOT apply!" AgencyEmailAddress="This is a test opportunity. Please DO NOT apply!">This is a test opportunity. Please DO NOT apply!</AgencyContact>
</FundingOppSynopsis>
How can I fix the error and remove expired entries (based on ArchiveDate)? I suspect the second to last line in feddata.xsl needs to be fixed.