How To Parse CSV Using XSLT

Parsing simple CSV files is easy using the EXSLT String extension.
Let’s have the following source file.

Source File

<ROOT>
  <WorkQueueEntry>
    <Data>first1,first2,first3
second1,second2,second3
third1,third2,third3</Data>
  </WorkQueueEntry>
</ROOT>

We will use the following transformation.

Don’t forget to include the xmlns:str="http://exslt.org/strings" namespace declaration and to include it also in exclude-result-prefixes so the str namespace does not appear in the target document.

Transformation

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
    xmlns:AS="http://schema.advantages.cz/AsapFunctions"
    xmlns:date="http://exslt.org/dates-and-times"
    xmlns:str="http://exslt.org/strings"
    exclude-result-prefixes="AS date str">
    <xsl:template match="ROOT">
        <ROOT>
            <xsl:apply-templates select="WorkQueueEntry"/>
        </ROOT>
    </xsl:template>
    <xsl:template match="WorkQueueEntry">
        <!-- We split the source text into lines in this case by CR LF -->
        <xsl:variable name="lines" select="str:split(Data, '&#xD;&#xA;')"/>

        <!-- Now we process each line -->
        <xsl:for-each select="$lines">
            <!-- We split the line into fields in this case by comma -->
            <xsl:variable name="fields" select="str:split(., ',')"/>
            <!-- Now we can process each field -->
            <TargetEntity
                field1="{$fields[1]}"
                field2="{$fields[2]}"
                field3="{$fields[3]}"
            />
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

This will produce the following result.

Result

<ROOT>
  <TargetEntity field1="first1" field2="first2" field3="first3" />
  <TargetEntity field1="second1" field2="second2" field3="second3" />
  <TargetEntity field1="third1" field2="third2" field3="third3" />
</ROOT>

I agree. But what if you have different data input source file, where numbers are without quotation marks and strings are closed inside quotation marks. Then comes very nice situation, that inside quotation markes can appear COMMA. It only complicates it, of course.

Source file

 <ROOT>
    <WorkQueueEntry>
      <Data>"first1",first2,"first3, first4"
  second1,second2,second3
  third1,"third2,third3",third4</Data>
    </WorkQueueEntry>
 </ROOT>

Then I suggest using the following transformation.

Don’t forget to include the xmlns:regex="http://exslt.org/regular-expressions" namespace declaration and to include it also in exclude-result-prefixes so the regex namespace does not appear in the target document and you can use regular expression.

Transformation

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
  xmlns:AS="http://schema.advantages.cz/AsapFunctions"
  xmlns:date="http://exslt.org/dates-and-times"
  xmlns:str="http://exslt.org/strings"
  xmlns:regex="http://exslt.org/regular-expressions"
  exclude-result-prefixes="AS date str regex">

  <xsl:template match="ROOT">
    <ROOT>
      <xsl:apply-templates select="WorkQueueEntry"/>
    </ROOT>
  </xsl:template>

  <xsl:template match="WorkQueueEntry">
    <!-- We split the source text into lines in this case by CR LF -->
    <xsl:variable name="lines" select="str:split(Data,'&#xD;&#xA;')"/>
    <!-- Now we process each line -->
    <xsl:for-each select="$lines">
      <!-- We split the line into Fld/row in this case by comma and replace all commas inside quotation marks for "_"-->
      <xsl:variable name="fields" select="str:split(regex:replace(.,',(?=[^&quot;]*&quot;(?:[^&quot;\r\n]*&quot;[^&quot;]*&quot;)[^&quot;\r\n]*$)','ig','_'),',')"/>
      <xsl:variable name="fldClr">
        <xsl:for-each select="$fields">
          <row>
            <xsl:value-of select="str:replace(.,'&quot;','')" />
          </row>
        </xsl:for-each>
      </xsl:variable>

      <xsl:variable name="Fld" select="AS:NodeSet($fldClr)" />

      <!-- now the result NodeSet can be accessed and processed -->
      <TargetEntity>
        <xsl:for-each select="$Fld/row">
          <xsl:attribute name="concat('field',position())">
            <xsl:value-of select="current()" />
          </xsl:attribute>
        </xsl:for-each>
      </TargetEntity>
      </xls:for-each>
    </xsl:template>
</xsl:stylesheet>

This will produce the following result.

 <ROOT>
   <TargetEntity field1="first1" field2="first2" field3="first3_ first4" />
   <TargetEntity field1="second1" field2="second2" field3="second3" />
   <TargetEntity field1="third1" field2="third2_third3" field3="third4" />
 </ROOT>
2 Likes