Reading Big XLS and XLSX files

reading large excel file in java using poi examples
xlsx file opener
how to write huge data in excel sheet using java
xlsx vs xls
read xlsx file in java
open xls file without excel
read and write large excel file in java
read excel file in java

I'm aware of the posts that are around, I've tried several attempts to reach my objective, as I will elaborate below:

I have a .zip/.rar, that contains multiple xls & xlsx files.

Each excel file contains duzens up to thousands of rows, around 90 columns give or take (each excel file can have more or less columns).

I've created a java windowbuilder application, where I select a .zip/.rar file and select where to unzip these files to and create them using FileOutputStream. After each file being saved, I'm reading the file for it's content.

So far so good. After several attempts to avoid OOM (OutOfMemory) and speed things up, I've reached the 'final version' (which is quite awful but it's until I figure out how to read things properly) which I will explain:

File file = new File('certainFile.xlsx'); //or xls, For example purposes
Workbook wb;
Sheet sheet;
/*
There is a ton of other things up to this point that I don't consider relevant, as it's related to unzipping and renaming, etc. 
This is within a cycle

/
In every zip file, there is at least 1 or 2 files that somehow, when it goes to
WorkbookFactory.create(), it still gives an OOM because it recognizes is has 
a bit over a million rows, meaning it's an 2007 format file (according to our friend Google.com), or so I believe so.
When I open the xlsx file, it indeed has like 10-20mb size and thousands of empty rows. When I save it again
it has 1mb and a couple thousand. After many attempts to read as InputStream, File or trying to save it in 
an automatic way, I've worked with converting it to a CSV and read it differently, 
ence, this 'solution'. if parseAsXLS is true, it applies my regular logic 
per row per cell, otherwise I parse the CSV.
*/
if (file.getName().contains("xlsx")) {
    this.parseAsXLS = false;
    OPCPackage pkg = OPCPackage.open(file);
    //This is just to output the content into a csv file, that I will read later on and it gets overwritten everytime it comes by
    FileOutputStream fo = new FileOutputStream(this.filePath + File.separator + "excel.csv");
    PrintStream ps = new PrintStream(fo);
    XLSX2CSV xlsxCsvConverter = new XLSX2CSV(pkg, ps, 90);
    try {
        xlsxCsvConverter.process();
    } catch (Exception e) {
        //I've added a count at the XLSX2CSV class in order to limit the ammount of rows I want to fetch and throw an Exception on purpose
        System.out.println("Limited the file at 60k rows");
    }
} else {
    this.parseAsXLS = true;
    this.wb = WorkbookFactory.create(file);
    this.sheet = wb.getSheetAt(0);
}

What happens now is that a .xlsx (from a .zip file with several other .xls and .xlsx) has somewhat a certain character in a row and the XLSX2CSV considers it as endRow, which results in a incorrect output.

This is an example: imagelink

Note: The objective is to only fetch a certain set of columns that they have in commum (or might have, not obliged) from each excel file and put them together in a new Excel. The email column (that contains multiple emails seperated by a comma), has what I believe to be an 'enter' before the email, because if I erase it manually, it fixes the problem. However, the objective is to not manually open every excel and fix it, otherwise I'd just open every excel and copy-paste the columns I'd need. In that example, I'd require columns: fieldAA, fieldAG, fieldAL and fieldAN.

XLSX2CSV.java (I'm not the creator of this file, I just applied my needs to it)

import java.awt.List;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintStream;

import javax.xml.parsers.ParserConfigurationException;

import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellAddress;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * A rudimentary XLSX -> CSV processor modeled on the
 * POI sample program XLS2CSVmra from the package
 * org.apache.poi.hssf.eventusermodel.examples.
 * As with the HSSF version, this tries to spot missing
 *  rows and cells, and output empty entries for them.
 * <p>
 * Data sheets are read using a SAX parser to keep the
 * memory footprint relatively small, so this should be
 * able to read enormous workbooks.  The styles table and
 * the shared-string table must be kept in memory.  The
 * standard POI styles table class is used, but a custom
 * (read-only) class is used for the shared string table
 * because the standard POI SharedStringsTable grows very
 * quickly with the number of unique strings.
 * <p>
 * For a more advanced implementation of SAX event parsing
 * of XLSX files, see {@link XSSFEventBasedExcelExtractor}
 * and {@link XSSFSheetXMLHandler}. Note that for many cases,
 * it may be possible to simply use those with a custom 
 * {@link SheetContentsHandler} and no SAX code needed of
 * your own!
 */
public class XLSX2CSV {
    /**
     * Uses the XSSF Event SAX helpers to do most of the work
     *  of parsing the Sheet XML, and outputs the contents
     *  as a (basic) CSV.
     */
    private class SheetToCSV implements SheetContentsHandler {
        private boolean firstCellOfRow;
        private int currentRow = -1;
        private int currentCol = -1;
        private int maxrows = 60000;



        private void outputMissingRows(int number) {

            for (int i=0; i<number; i++) {
                for (int j=0; j<minColumns; j++) {
                    output.append(',');
                }
                output.append('\n');
            }
        }

        @Override
        public void startRow(int rowNum) {
            // If there were gaps, output the missing rows
            outputMissingRows(rowNum-currentRow-1);
            // Prepare for this row
            firstCellOfRow = true;
            currentRow = rowNum;
            currentCol = -1;

            if (rowNum == maxrows) {
                    throw new RuntimeException("Force stop at maxrows");
            }
        }

        @Override
        public void endRow(int rowNum) {
            // Ensure the minimum number of columns
            for (int i=currentCol; i<minColumns; i++) {
                output.append(',');
            }
            output.append('\n');
        }

        @Override
        public void cell(String cellReference, String formattedValue,
                XSSFComment comment) {
            if (firstCellOfRow) {
                firstCellOfRow = false;
            } else {
                output.append(',');
            }            

            // gracefully handle missing CellRef here in a similar way as XSSFCell does
            if(cellReference == null) {
                cellReference = new CellAddress(currentRow, currentCol).formatAsString();
            }

            // Did we miss any cells?
            int thisCol = (new CellReference(cellReference)).getCol();
            int missedCols = thisCol - currentCol - 1;
            for (int i=0; i<missedCols; i++) {
                output.append(',');
            }
            currentCol = thisCol;

            // Number or string?
            try {
                //noinspection ResultOfMethodCallIgnored
                Double.parseDouble(formattedValue);
                output.append(formattedValue);
            } catch (NumberFormatException e) {
                output.append('"');
                output.append(formattedValue);
                output.append('"');
            }
        }

        @Override
        public void headerFooter(String arg0, boolean arg1, String arg2) {
            // TODO Auto-generated method stub

        }
    }


    ///////////////////////////////////////

    private final OPCPackage xlsxPackage;

    /**
     * Number of columns to read starting with leftmost
     */
    private final int minColumns;

    /**
     * Destination for data
     */
    private final PrintStream output;

    /**
     * Creates a new XLSX -> CSV converter
     *
     * @param pkg        The XLSX package to process
     * @param output     The PrintStream to output the CSV to
     * @param minColumns The minimum number of columns to output, or -1 for no minimum
     */
    public XLSX2CSV(OPCPackage pkg, PrintStream output, int minColumns) {
        this.xlsxPackage = pkg;
        this.output = output;
        this.minColumns = minColumns;
    }

    /**
     * Parses and shows the content of one sheet
     * using the specified styles and shared-strings tables.
     *
     * @param styles The table of styles that may be referenced by cells in the sheet
     * @param strings The table of strings that may be referenced by cells in the sheet
     * @param sheetInputStream The stream to read the sheet-data from.

     * @exception java.io.IOException An IO exception from the parser,
     *            possibly from a byte stream or character stream
     *            supplied by the application.
     * @throws SAXException if parsing the XML data fails.
     */
    public void processSheet(
            StylesTable styles,
            ReadOnlySharedStringsTable strings,
            SheetContentsHandler sheetHandler, 
            InputStream sheetInputStream) throws IOException, SAXException {
        DataFormatter formatter = new DataFormatter();
        InputSource sheetSource = new InputSource(sheetInputStream);
        try {
            XMLReader sheetParser = SAXHelper.newXMLReader();
            ContentHandler handler = new XSSFSheetXMLHandler(
                  styles, null, strings, sheetHandler, formatter, false);
            sheetParser.setContentHandler(handler);
            sheetParser.parse(sheetSource);
         } catch(ParserConfigurationException e) {
            throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
         }
    }

    /**
     * Initiates the processing of the XLS workbook file to CSV.
     *
     * @throws IOException If reading the data from the package fails.
     * @throws SAXException if parsing the XML data fails.
     */
    public void process() throws IOException, OpenXML4JException, SAXException {
        ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(this.xlsxPackage);
        XSSFReader xssfReader = new XSSFReader(this.xlsxPackage);
        StylesTable styles = xssfReader.getStylesTable();
        XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
        int index = 0;
        while (iter.hasNext()) {
            try (InputStream stream = iter.next()) {
                processSheet(styles, strings, new SheetToCSV(), stream);
            }
            ++index;
        }
    }
} 

I'm in search of different (and working) approaches to my objective.

Thank you for your time


Okay, so I've tried replicating your excel file and I completly threw the XLSX2CSV out the window. I don't think the approach of converting the xlsx into csv is the right one because, as depending on your XLSX format, it can read all the empty rows (you probably know that because you've set a row counter of 60k). not only that but if we're taking into consideration fields, it may or may not cause incorrect output with special characters, like your problem.

What I've done is I've used this library https://github.com/davidpelfree/sjxlsx to read and re-write the file. It's pretty much straight-forward and the new xlsx generated file has the fields corrected.

I suggest you try this approach (maybe not with this lib), of trying to re-write the file in order to correct it.

Anatomy of an Excel File and Large Excel File Operation With the , SXSSF (Streaming extension of XSSF) - Used to read .xlsx format Excel files with the stream concept and popular for reading large files  Read xls and xlsx files. read_excel() calls excel_format() to determine if path is xls or xlsx, based on the file extension and the file itself, in that order. Use read_xls() and read_xlsx() directly if you know better and want to prevent such guessing.


how about this:

//get zip stream

ZipFile zipFile = new ZipFile(billWater, Charset.forName("gbk"));


ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(billWater),  Charset.forName("gbk"));
//ZipEntry zipEntry;
//use openCsv 
 public static <T> List<T> processCSVFileByZip(ZipFile zipFile, ZipEntry zipEntry, Class<? extends T> clazz, Charset charset) throws IOException {
    Reader in = new InputStreamReader(zipFile.getInputStream(zipEntry), charset);
    return processCSVFile(in, clazz, charset, ',');
}

public static <T> List<T> processCSVFile(Reader in, Class<? extends T> clazz, Charset charset, char sep) {
    CsvToBean<T> csvToBean = new CsvToBeanBuilder(in)
            .withType(clazz).withSkipLines(1)
            .withIgnoreLeadingWhiteSpace(true).withSeparator(sep)
            .build();
    return csvToBean.parse();
}

//it seem dependency the xlsx file format

Difference between XLS and XLSX files in Excel, An Excel file can be saved with the extension xlsx or xls. Many users don't know the difference between this 2 extensions. Read this article. The readxl package comes with the function read_excel () to read xls and xlsx files Read both xls and xlsx files The above R code, assumes that the file “my_file.xls” and “my_file.xlsx” is in your current working directory. To know your current working directory, type the function getwd () in R console.


I think there are at least two open questions in here:

  1. Out of memory in WorkbookFactory.create() when opening old-style XLS files which are sparse

  2. XLSX2CSV is corrupting your new-style XLSX files, possibly due to "a certain character [incorrectly treated as] endRow"

For (1), I would say that you need to find a Java XLS library which either handles sparse files without allocating empty spaces, or a Java XLS library which can process the file in a streaming manner instead of the batch approach taken by WorkbookFactory

For (2), you need to find a Java XLSX library which won't corrupt your data.

I don't know of any good Java libraries for (1) or (2), sorry.

However, I would like to suggest that you write this script in Excel, rather than in Java. Excel has an excellent scripting language built in, Excel VBA, which can handle opening multiple files, extracting data from them etc.. Also, you can be confident that a script running in Excel VBA will not have any trouble with Excel features like sparse tables or XLSX parsing that you are encountering in Java.

(You might also like to take a step back and evaluate how long it might take to do this by hand, if it is a one-off job, compared to how long you will need to spend to script this task.)

Good luck!

Parsing Huge XLSX files in fastest possible way?, Feb 5, 2018 · 4 min read. Apache POI is the most Create any xlsx file using MS excel and save it , after saving it change the extension to zip. Now Right click  Using the SAX approach, you can employ an OpenXMLReader to read the XML in the file one element at a time, without having to load the entire file into memory. Consider using SAX when you need to handle very large files. The following code segment is used to read a very large Excel file using the DOM approach.


Open Microsoft Excel XLS and XLSX Files, How to open Microsoft Excel XLSX and XLS files for free with File Viewer Lite. in its native format as if you were viewing the document with Microsoft Excel. So when you save your XLS file to XLSX, the size of your document will be reduce. Extension of the working area. In a Xls workbook, the row limit is 65,536 (2 16) and 256 columns (2 8) which corresponds to the column IV. Now with xlsx workbooks (and xlsm), limits are 1,048,576 rows (2 20) and 16,384 columns (2 14) or the column XFD.


Using Pandas to Read Large Excel Files in Python – Real Python, 16.1), and XlsxWriter (v0.7.3). We recommend using the Anaconda distribution to quickly get started, as it comes pre-installed with all the needed libraries. Reading the File. The first file we’ll work with is a compilation of all the car accidents in England from 1979-2004, to extract all accidents that happened in London in the year 2000. Start by downloading the source ZIP file from data.gov.uk, and extract the contents. Then try to open Accidents7904.csv in Excel.


Read Extra Large SpreadSheets, xlsx files can be read with this activity. Reading Large XLS files can be done in conjunction with the Conversions Package which will convert the file to an xlsx file,  In that tutorial, you are going to learn how to read an Excel XLSX file in a Java Application with Eclipse. Note that the logic will be the same to read old Excel XLS format.