Split giant XML file into n-child versions

split one xml into multiple files
split xml file based on tags in unix
linux split file into parts by lines
splitting a file unix
unix split file every n lines
how to split a file
split text file into multiple files linux
split large files into smaller pieces

For example the giant file has 50 million lines of such:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
  <activity>
    <deliv>
      <subitem1>text</subitem1>
      <subitem2>text</subitem2>
    </deliv>
    <deliv>
      <subitem1>text</subitem1>
      <subitem2>text</subitem2>
    </deliv>
    <deliv>
      <subitem1>text</subitem1>
      <subitem2>text</subitem2>
    </deliv>
  </activity>
</root>

And each 'child' file would have the same structure, but be 5 million lines or so, or 1/10th of the original.

The reason for this is to make the import of such into a database more manageable, without blowing out the memory (SQL Server's OPENXML).

Is XSLT the best choice here?

The Enterprise Edition of Saxon 9.8 (Saxon 9.8 EE) supports the streaming feature of the one year old XSLT 3.0 specification which allows you to use a subset of XSLT to read through an XML documents in a forwards only way, using only the memory necessary to store the currently visited node and its ancestors.

Using that approach you can write code like for-each-group select="activity/deliv" group-adjacent="(position() - 1) idiv $size" to do a positional grouping that reads through the file deliv by deliv element and collects them into groups of $size:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    exclude-result-prefixes="xs math"
    version="3.0">

    <xsl:param name="size" as="xs:integer" select="1000"/>

    <xsl:mode on-no-match="shallow-copy" streamable="yes"/>

    <xsl:template match="root">
        <xsl:for-each-group select="activity/deliv" group-adjacent="(position() - 1) idiv $size">
            <xsl:result-document href="split-{format-number(current-grouping-key() + 1, '00000')}.xml" indent="yes">
                <root>
                    <activity>
                        <xsl:copy-of select="current-group()"/>
                    </activity>
                </root>
            </xsl:result-document>
        </xsl:for-each-group>
    </xsl:template>

</xsl:stylesheet>

That splits up your input into a number of files, each file having $size deliv elements (respectively the last one the remaining deliv elements if there are less than $size left).

Using Saxon EE requires obtaining a commercial license but trial licences exist.

Split giant XML file into n-child versions - xml - html, XSLT-2.0 and above is a good fit for this task. XSLT-3.0 even supports streaming. The following stylesheet splits an XML file in a configurable amount of files  A file with the XML file extension is an Extensible Markup Language file. They are plain text files that don't do anything in and of themselves except describe the transportation, structure, and storage of data. An RSS feed is one common example of an XML-based file.

xml_split: cut a big XML file into smaller chunks, "xml_split" takes a (presumably big) XML file and split it in several smaller files. It can split at a given level in the tree (the default, splits children of the root), -V: outputs version and exit; -h: short help; -m: man (requires pod2text to be in the  Although my script created a different mix of XML elements than the above example, it wasn't any more complex, and had fairly reasonable performance. Processing of the 6.4 gig CSV file into a 6.5 gig XML file took between 19 - 24 minutes, which means it was able to read-process-write about five megabytes per second.

XSLT could do this job. I'd recommend getting your hands on an XSLT v2.0 processor so that you can use xsl:result-document. Then you'd need a little bit of logic to decide when to split between your files. You could base this off the position() of the deliv elements, or try using xsl:for-each-group to make groups that are sent to each file.

Solved: [resolved] Split really big xml file in multiple X, Solved: Hello, I have to split a 160 Go XML file. I found a solution in this How Save the Children Sped Up Data Integration This linux command split the file in file of the chosen size and keep the sml structure. But I think I'm <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output  Found a download - PROCIO was an FS9 utility that took a giant proceduresdb.xml file and split it into parts for specific airports. It's freeware, and the files I found would be freeware also (they had created a proceduresdb.xml that could be downloaded along with PROCIO).

xml_split - cut a big XML file into smaller chunks, "xml_split" takes a (presumably big) XML file and split it in several smaller files. a given level in the tree (the default, splits children of the root), or on a condition magnitude) when generating lots of small documents -V outputs version and  Iso Style Editor Public Beta! Hi everyone! How could our plant team sleep at night knowing that our awesome customers were having to struggle with giant XML files in order to edit their Isometric outputs? That's right, they couldn't! So

XSL Transformations (XSLT) Version 3.0, 4.1 XML Versions; 4.2 XDM versions; 4.3 Stripping Whitespace from the Stylesheet of xsl:attribute 19.8.4.8 Streamability of xsl:break 19.8.4.9 Streamability of This is needed when source documents become too large to hold in main elements, and to apply templates recursively to its own children. You can directly split your files directly from Windows Explorer: select the file you would like to split, then you have two ways: you can drag it from Windows Explorer and drop it on the GSplit’s main window. you can use the context menu (mouse right button click) and select the “Split file with GSplit” command.

Advances in XML Information Retrieval and Evaluation: 4th , 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX because it is possible that both the parent and the child elements qualify. collection F = ∅ and the current node c which is set to be the document root. Gmax // Discard node (too big) for each child: SelectFullText(node.child); else if  Split a mbox file into two parts There are utilities to split a file in two but there is no guarantee it won't split one of the messages into two parts. There are some utilities that will try to parse a mbox file and split each message into a separate file, but they don't work well if the mbox file is badly corrupted.

Comments
  • Thanks @zx485, I am new to XSLT so this is helpful.
  • Tried this on my 3.6Gb file, but running out of memory unfortunately: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
  • That's unfortunate. I have never used XSLT-3.0 streaming which would be your number one choice for this case. Maybe someone else can help out.
  • Martin Honnen shows how to make the solution streamable.