reading Excel Open XML is ignoring blank cells

I am using the accepted solution here to convert an excel sheet into a datatable. This works fine if I have "perfect" data but if I have a blank cell in the middle of my data it seems to put the wrong data in each column.

I think this is because in the below code:

row.Descendants<Cell>().Count()

is number of populated cells (not all columns) AND:

GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));

seems to find the next populated cell (not necessarily what is in that index) so if the first column is empty and i call ElementAt(0), it returns the value in the second column.

Here is the full parsing code.

DataRow tempRow = dt.NewRow();

for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
    tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
    if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
    {
        Console.Write(tempRow[i].ToString());
    }
}

This makes sense since Excel will not store a value for a cell that is null. If you open your file using the Open XML SDK 2.0 Productivity Tool and traverse the XML down to the cell level you will see that only the cells that have data are going to be in that file.

Your options are to insert blank data in the range of cells you are going to traverse or programmatically figure out a cell was skipped and adjust your index appropriately.

I made an example excel document with a string in cell reference A1 and C1. I then opened up the excel document in the Open XML Productivity Tool and here is the XML that was stored:

<x:row r="1" spans="1:3" 
   xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:c r="A1" t="s">
    <x:v>0</x:v>
  </x:c>
  <x:c r="C1" t="s">
    <x:v>1</x:v>
  </x:c>
</x:row>

Here you will see that the data corresponds to the first row and that only two cells worth of data are saved for that row. The data saved corresponds to A1 and C1 and that no cells with null values are saved.

To get the functionality that you need, you can traverse over the Cells as you are doing above, but you will need to check what the value the Cell is referencing and determine if any Cells have been skipped. to do that you will need two utility functions to get the Column Name from the cell reference and to then translate that column name into a zero based index:

    private static List<char> Letters = new List<char>() { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', ' ' };

    /// <summary>
    /// Given a cell name, parses the specified cell to get the column name.
    /// </summary>
    /// <param name="cellReference">Address of the cell (ie. B2)</param>
    /// <returns>Column Name (ie. B)</returns>
    public static string GetColumnName(string cellReference)
    {
        // Create a regular expression to match the column name portion of the cell name.
        Regex regex = new Regex("[A-Za-z]+");
        Match match = regex.Match(cellReference);

        return match.Value;
    }

    /// <summary>
    /// Given just the column name (no row index), it will return the zero based column index.
    /// Note: This method will only handle columns with a length of up to two (ie. A to Z and AA to ZZ). 
    /// A length of three can be implemented when needed.
    /// </summary>
    /// <param name="columnName">Column Name (ie. A or AB)</param>
    /// <returns>Zero based index if the conversion was successful; otherwise null</returns>
    public static int? GetColumnIndexFromName(string columnName)
    {
        int? columnIndex = null;

        string[] colLetters = Regex.Split(columnName, "([A-Z]+)");
        colLetters = colLetters.Where(s => !string.IsNullOrEmpty(s)).ToArray();

        if (colLetters.Count() <= 2)
        {
            int index = 0;
            foreach (string col in colLetters)
            {
                List<char> col1 = colLetters.ElementAt(index).ToCharArray().ToList();
                int? indexValue = Letters.IndexOf(col1.ElementAt(index));

                if (indexValue != -1)
                {
                    // The first letter of a two digit column needs some extra calculations
                    if (index == 0 && colLetters.Count() == 2)
                    {
                        columnIndex = columnIndex == null ? (indexValue + 1) * 26 : columnIndex + ((indexValue + 1) * 26);
                    }
                    else
                    {
                        columnIndex = columnIndex == null ? indexValue : columnIndex + indexValue;
                    }
                }

                index++;
            }
        }

        return columnIndex;
    }

Then you can iterate over the Cells and check to see what the cell reference is compared to the columnIndex. If it is less than then you add blank data to your tempRow, otherwise just read in the value contained in the cell. (Note: I did not test the code below, but the general idea should help):

DataRow tempRow = dt.NewRow();

int columnIndex = 0;
foreach (Cell cell in row.Descendants<Cell>())
{
   // Gets the column index of the cell with data
   int cellColumnIndex = (int)GetColumnIndexFromName(GetColumnName(cell.CellReference));

   if (columnIndex < cellColumnIndex)
   {
      do
      {
         tempRow[columnIndex] = //Insert blank data here;
         columnIndex++;
      }
      while(columnIndex < cellColumnIndex);
    }
    tempRow[columnIndex] = GetCellValue(spreadSheetDocument, cell);

    if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
    {
       Console.Write(tempRow[i].ToString());
    }
    columnIndex++;
}

If you want to iterate cells via Open XML look at the free library Spreadsheet light. All of the code samples are in C# but easy enough to transpose to vb.net. All of the code samples are in C# but easy enough to transpose to vb.net.

Here's an implementation of IEnumerable that should do what you want, compiled and unit tested.

    ///<summary>returns an empty cell when a blank cell is encountered
    ///</summary>
    public IEnumerator<Cell> GetEnumerator()
    {
        int currentCount = 0;

        // row is a class level variable representing the current
        // DocumentFormat.OpenXml.Spreadsheet.Row
        foreach (DocumentFormat.OpenXml.Spreadsheet.Cell cell in
            row.Descendants<DocumentFormat.OpenXml.Spreadsheet.Cell>())
        {
            string columnName = GetColumnName(cell.CellReference);

            int currentColumnIndex = ConvertColumnNameToNumber(columnName);

            for ( ; currentCount < currentColumnIndex; currentCount++)
            {
                yield return new DocumentFormat.OpenXml.Spreadsheet.Cell();
            }

            yield return cell;
            currentCount++;
        }
    }

Here are the functions it relies on:

    /// <summary>
    /// Given a cell name, parses the specified cell to get the column name.
    /// </summary>
    /// <param name="cellReference">Address of the cell (ie. B2)</param>
    /// <returns>Column Name (ie. B)</returns>
    public static string GetColumnName(string cellReference)
    {
        // Match the column name portion of the cell name.
        Regex regex = new Regex("[A-Za-z]+");
        Match match = regex.Match(cellReference);

        return match.Value;
    }

    /// <summary>
    /// Given just the column name (no row index),
    /// it will return the zero based column index.
    /// </summary>
    /// <param name="columnName">Column Name (ie. A or AB)</param>
    /// <returns>Zero based index if the conversion was successful</returns>
    /// <exception cref="ArgumentException">thrown if the given string
    /// contains characters other than uppercase letters</exception>
    public static int ConvertColumnNameToNumber(string columnName)
    {
        Regex alpha = new Regex("^[A-Z]+$");
        if (!alpha.IsMatch(columnName)) throw new ArgumentException();

        char[] colLetters = columnName.ToCharArray();
        Array.Reverse(colLetters);

        int convertedValue = 0;
        for (int i = 0; i < colLetters.Length; i++)
        {
            char letter = colLetters[i];
            int current = i == 0 ? letter - 65 : letter - 64; // ASCII 'A' = 65
            convertedValue += current * (int)Math.Pow(26, i);
        }

        return convertedValue;
    }

Throw it in a class and give it a try.

To read your blank cell : If you don't want format the result depending what the cell contain, try private static string GetCellValue(Cell cell) { return cell.InnerText; } if you want to format your cell before returning the value of it

Here's a slightly modified version of Waylon's answer which also relied on other answers. It encapsulates his method in a class.

I changed

IEnumerator<Cell> GetEnumerator()

to

IEnumerable<Cell> GetRowCells(Row row)

Here's the class, you don't need to instantiate it, it just serves as an utility class:

public class SpreedsheetHelper
{
    ///<summary>returns an empty cell when a blank cell is encountered
    ///</summary>
    public static IEnumerable<Cell> GetRowCells(Row row)
    {
        int currentCount = 0;

        foreach (DocumentFormat.OpenXml.Spreadsheet.Cell cell in
            row.Descendants<DocumentFormat.OpenXml.Spreadsheet.Cell>())
        {
            string columnName = GetColumnName(cell.CellReference);

            int currentColumnIndex = ConvertColumnNameToNumber(columnName);

            for (; currentCount < currentColumnIndex; currentCount++)
            {
                yield return new DocumentFormat.OpenXml.Spreadsheet.Cell();
            }

            yield return cell;
            currentCount++;
        }
    }

    /// <summary>
    /// Given a cell name, parses the specified cell to get the column name.
    /// </summary>
    /// <param name="cellReference">Address of the cell (ie. B2)</param>
    /// <returns>Column Name (ie. B)</returns>
    public static string GetColumnName(string cellReference)
    {
        // Match the column name portion of the cell name.
        var regex = new System.Text.RegularExpressions.Regex("[A-Za-z]+");
        var match = regex.Match(cellReference);

        return match.Value;
    }

    /// <summary>
    /// Given just the column name (no row index),
    /// it will return the zero based column index.
    /// </summary>
    /// <param name="columnName">Column Name (ie. A or AB)</param>
    /// <returns>Zero based index if the conversion was successful</returns>
    /// <exception cref="ArgumentException">thrown if the given string
    /// contains characters other than uppercase letters</exception>
    public static int ConvertColumnNameToNumber(string columnName)
    {
        var alpha = new System.Text.RegularExpressions.Regex("^[A-Z]+$");
        if (!alpha.IsMatch(columnName)) throw new ArgumentException();

        char[] colLetters = columnName.ToCharArray();
        Array.Reverse(colLetters);

        int convertedValue = 0;
        for (int i = 0; i < colLetters.Length; i++)
        {
            char letter = colLetters[i];
            int current = i == 0 ? letter - 65 : letter - 64; // ASCII 'A' = 65
            convertedValue += current * (int)Math.Pow(26, i);
        }

        return convertedValue;
    }
}

Now you're able to get all rows' cells in this way:

// skip the part that retrieves the worksheet sheetData
IEnumerable<Row> rows = sheetData.Descendants<Row>();
foreach(Row row in rows)
{
    IEnumerable<Cell> cells = SpreedsheetHelper.GetRowCells(row);
    foreach (Cell cell in cells)
    {
         // skip part that reads the text according to the cell-type
    }
}

It will contain all cells even if they are empty.

If you open your file using the Open XML SDK 2.0 Productivity Tool and traverse the XML down to the cell level you will see that only the cells that have data are going to be in that file. Your options are to insert blank data in the range of cells you are going to traverse or programmatically figure out a cell was skipped and adjust your index appropriately.

See my implementation:

  Row[] rows = worksheet.GetFirstChild<SheetData>()
                .Elements<Row>()
                .ToArray();

  string[] columnNames = rows.First()
                .Elements<Cell>()
                .Select(cell => GetCellValue(cell, document))
                .ToArray();

  HeaderLetters = ExcelHeaderHelper.GetHeaderLetters((uint)columnNames.Count());

  if (columnNames.Count() != HeaderLetters.Count())
  {
       throw new ArgumentException("HeaderLetters");
  }

  IEnumerable<List<string>> cellValues = GetCellValues(rows.Skip(1), columnNames.Count(), document);

//Here you can enumerate through the cell values, based on the cell index the column names can be retrieved.

HeaderLetters are collected using this class:

    private static class ExcelHeaderHelper
    {
        public static string[] GetHeaderLetters(uint max)
        {
            var result = new List<string>();
            int i = 0;
            var columnPrefix = new Queue<string>();
            string prefix = null;
            int prevRoundNo = 0;
            uint maxPrefix = max / 26;

            while (i < max)
            {
                int roundNo = i / 26;
                if (prevRoundNo < roundNo)
                {
                    prefix = columnPrefix.Dequeue();
                    prevRoundNo = roundNo;
                }
                string item = prefix + ((char)(65 + (i % 26))).ToString(CultureInfo.InvariantCulture);
                if (i <= maxPrefix)
                {
                    columnPrefix.Enqueue(item);
                }
                result.Add(item);
                i++;
            }
            return result.ToArray();
        }
    }

And the helper methods are:

    private static IEnumerable<List<string>> GetCellValues(IEnumerable<Row> rows, int columnCount, SpreadsheetDocument document)
    {
        var result = new List<List<string>>();
        foreach (var row in rows)
        {
            List<string> cellValues = new List<string>();
            var actualCells = row.Elements<Cell>().ToArray();

            int j = 0;
            for (int i = 0; i < columnCount; i++)
            {
                if (actualCells.Count() <= j || !actualCells[j].CellReference.ToString().StartsWith(HeaderLetters[i]))
                {
                    cellValues.Add(null);
                }
                else
                {
                    cellValues.Add(GetCellValue(actualCells[j], document));
                    j++;
                }
            }
            result.Add(cellValues);
        }
        return result;
    }


private static string GetCellValue(Cell cell, SpreadsheetDocument document)
{
    bool sstIndexedcell = GetCellType(cell);
    return sstIndexedcell
        ? GetSharedStringItemById(document.WorkbookPart, Convert.ToInt32(cell.InnerText))
        : cell.InnerText;
}

private static bool GetCellType(Cell cell)
{
    return cell.DataType != null && cell.DataType == CellValues.SharedString;
}

private static string GetSharedStringItemById(WorkbookPart workbookPart, int id)
{
    return workbookPart.SharedStringTablePart.SharedStringTable.Elements<SharedStringItem>().ElementAt(id).InnerText;
}

The solution deals with shared cell items (SST indexed cells).

In Open XML, xml file does not contain an entry for the blank cell that's why blank cells are skipped. I faced the same problem. The only solution is apply some logic. For Example: When we read a cell we can get its ColumnName (A,B,C etc.) by the following code. string cellIndex = GetColumnName( objCurrentSrcCell.CellReference ); where

All good examples. Here is the one I am using since I need to keep track of all rows, cells, values, and titles for correlation and analysis.

The method ReadSpreadsheet opens an xlxs file and goes through each worksheet, row, and column. Since the values are stored in a referenced string table, I also explicitly use that per worksheet. There are other classes used: DSFunction and StaticVariables. The latter holds oft used parameter values, such as the referenced 'quotdouble' ( quotdouble = "\u0022"; ) and 'crlf' (crlf = "\u000D" + "\u000A"; ).

The relevant DSFunction method GetIntColIndexForLetter is included below. It returns an integer value for the column index corresponding to letter names such as (A,B, AA, ADE, etc.). This is used along with the parameter 'ncellcolref' to determine if any columns have been skipped and to enter empty string values for each one that is missing.

I also do some cleaning of the values before storing temporarily in a List object (using Replace method).

Subsequently, I use the hash table (Dictionary) of column names to extract values across different worksheets, correlate them, create normalized values, and then create an object used in our product which is then stored as an XML file. None of this is shown but is why this approach is used.

    public static class DSFunction {

    /// <summary>
    /// Creates an integer value for a column letter name starting at 1 for 'a'
    /// </summary>
    /// <param name="lettstr">Column name as letters</param>
    /// <returns>int value</returns>
    public static int GetIntColIndexForLetter(string lettstr) {
        string txt = "", txt1="";
        int n1, result = 0, nbeg=-1, nitem=0;
        try {
            nbeg = (int)("a".ToCharArray()[0]) - 1; //1 based
            txt = lettstr;
            if (txt != "") txt = txt.ToLower().Trim();
            while (txt != "") {
                if (txt.Length > 1) {
                    txt1 = txt.Substring(0, 1);
                    txt = txt.Substring(1);
                }
                else {
                    txt1 = txt;
                    txt = "";
                }
                if (!DSFunction.IsNumberString(txt1, "real")) {
                    nitem++;
                    n1 = (int)(txt1.ToCharArray()[0]) - nbeg;
                    result += n1 + (nitem - 1) * 26;
                }
                else {
                    break;
                }
            }
        }
        catch (Exception ex) {
            txt = ex.Message;
        }
        return result;
    }


}


    public static class Extractor {

    public static string ReadSpreadsheet(string fileUri) {
        string msg = "", txt = "", txt1 = "";
        int i, n1, n2, nrow = -1, ncell = -1, ncellcolref = -1;
        Boolean haveheader = true;
        Dictionary<string, int> hashcolnames = new Dictionary<string, int>();
        List<string> colvalues = new List<string>();
        try {
            if (!File.Exists(fileUri)) { throw new Exception("file does not exist"); }
            using (SpreadsheetDocument ssdoc = SpreadsheetDocument.Open(fileUri, true)) {
                var stringTable = ssdoc.WorkbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
                foreach (Sheet sht in ssdoc.WorkbookPart.Workbook.Descendants<Sheet>()) {
                    nrow = 0;
                    foreach (Row ssrow in ((WorksheetPart)(ssdoc.WorkbookPart.GetPartById(sht.Id))).Worksheet.Descendants<Row>()) {
                        ncell = 0;
                        ncellcolref = 0;
                        nrow++;
                        colvalues.Clear();
                        foreach (Cell sscell in ssrow.Elements<Cell>()) {
                            ncell++;
                            n1 = DSFunction.GetIntColIndexForLetter(sscell.CellReference);
                            for (i = 0; i < (n1 - ncellcolref - 1); i++) {
                                if (nrow == 1 && haveheader) {
                                    txt1 = "-missing" + (ncellcolref + 1 + i).ToString() + "-";
                                    if (!hashcolnames.TryGetValue(txt1, out n2)) {
                                        hashcolnames.Add(txt1, ncell - 1);
                                    }
                                }
                                else {
                                    colvalues.Add("");
                                }
                            }
                            ncellcolref = n1;
                            if (sscell.DataType != null) {
                                if (sscell.DataType.Value == CellValues.SharedString && stringTable != null) {
                                    txt = stringTable.SharedStringTable.ElementAt(int.Parse(sscell.InnerText)).InnerText;
                                }
                                else if (sscell.DataType.Value == CellValues.String) {
                                    txt = sscell.InnerText;
                                }
                                else txt = sscell.InnerText.ToString();
                            }
                            else txt = sscell.InnerText;
                            if (txt != "") txt1 = txt.ToLower().Trim(); else txt1 = "";
                            if (nrow == 1 && haveheader) {
                                txt1 = txt1.Replace(" ", "");
                                if (txt1 == "table/viewname") txt1 = "tablename";
                                else if (txt1 == "schemaownername") txt1 = "schemaowner";
                                else if (txt1 == "subjectareaname") txt1 = "subjectarea";
                                else if (txt1.StartsWith("column")) {
                                    txt1 = txt1.Substring("column".Length);
                                }
                                if (!hashcolnames.TryGetValue(txt1, out n1)) {
                                    hashcolnames.Add(txt1, ncell - 1);
                                }
                            }
                            else {
                                txt = txt.Replace(((char)8220).ToString(), "'");  //special "
                                txt = txt.Replace(((char)8221).ToString(), "'"); //special "
                                txt = txt.Replace(StaticVariables.quotdouble, "'");
                                txt = txt.Replace(StaticVariables.crlf, " ");
                                txt = txt.Replace("  ", " ");
                                txt = txt.Replace("<", "");
                                txt = txt.Replace(">", "");
                                colvalues.Add(txt);
                            }
                        }
                    }
                }
            }
        }
        catch (Exception ex) {
            msg = "notok:" + ex.Message;
        }
        return msg;
    }





}

Count if cell is not blank; ISBLANK in Excel not working; Excel ISBLANK function. The ISBLANK function in Excel checks whether a cell is blank or not. Like other IS functions, it always returns a Boolean value as the result: TRUE if a cell is empty and FALSE if a cell is not empty. The syntax of ISBLANK assumes just one argument:

For example, you want to add 1 to all cells in range D2:D7, and if you directly use the formula =D2+1, and the blank cells also will be added 1. So if you want to ignore all blank cells in this formula, you need to use IF function to check if it is a blank cell, if TRUE, then ignore it. Otherwise, apply this formula.

If you examine the Xml of a Worksheet part, you will see that its cell data is stored in the "sheetData" element, which is empty when no cells have values. In order to test the value of a cell, you first need to test whether it exists, which, in turn, entails testing whether its containing row exists.

This example shows how to loop through Excel cells and how to deal with empty cells which are omitted or ignored by OpenXML More details with ready to copy and paste code: http

Comments
  • Please see this answer in the same thread that you've mentioned. It has the fix for empty cells.
  • do you know anyway to detect if there was a blank cell. that is my problem. i want a solution that reads in exactly what is on the sheet (including blanks)
  • @ooo - The only way to detect it is if the cell reference doesn't exist in the list of child descendents
  • See @amurra's answer here to see a definition for the "Letters" List.
  • What if I have columns till AH?
  • Also sometimes cell.CellReference is null
  • Can someone show me how to implement this explicitly? Thanks!
  • Some context would be much better for the enumerable example.