How i can split Data from String with Unicode?

c# split string into characters
c# split string by comma
how to split string in c# and store it in array
c# string to array

Good morning, I have a question. I need to recover Data from a String with Unicode for example

"\u001f\u0001\u0013FERREIRA RAMOS MUZI\u001f\u0002\0\u001f\u0003\aRICARDO\u001f\u0004\u0003URY\u001f\u0005\b09031979\u001f\u0006\u000eMONTEVIDEO/URY\u001f\a\b34946682\u001f\b\u0004\"\a \u0016\u001f\t\b22072026\u001f\n\0"

The String in Bytes

1F011346455252454952412052414D4F53204D555A491F02001F03075249434152444F1F04035552591F050830393033313937391F060E4D4F4E5445564944454F2F5552591F070833343934363638321F0804220720161F090832323037323032361F0A00

I need to recover Name, LastName etc in an ArrayList or Arraystring, for example

string[] array = {"Stephen", "King","11301958","NewYork/Usa"}

My problem if i use

System.Text.Encoding.UTF8.GetString(ByteArray);

to Get Data, i only get Name and Last Name, no Dates or where from.

How i can get that from this string?

You will probably have to create a custom parser:

byte [] bytes = // Your data here....
// Parser
List<string> words = new List<string>();
for (var i = 0; i < bytes.Length; i++) {
    if (0x1F == bytes[i]) {
        int index = bytes[i+1]; // Ignoring this
        int len = bytes[i+2];
        // Convert bytes to string
        words.Add(System.Text.Encoding.UTF8.GetString(bytes, i+3, len));
        i += len + 2;
    }
}
Console.WriteLine(String.Join("\n", words.ToArray()));

Output:

FERREIRA RAMOS MUZI

RICARDO
URY
09031979
MONTEVIDEO/URY
34946682
"           - some non-printable chars here
22072026

Looks like some fields will need special parsing.

c# - How i can split Data from String with Unicode?, Encoding.UTF8.GetString(ByteArray);. to Get Data, i only get Name and Last Name, no Dates or where from. How i can get that from this string? Converting from Unicode to a byte string is called encoding the string. Similarly, when you load Unicode strings from a file, socket, or other byte-oriented object, you need to decode the strings from bytes to characters. There are many ways of converting Unicode objects to byte strings, each of which is called an encoding. For a variety of historical, political, and technical reasons, there is no one “right” encoding.

It looks like it is a combination of binary data with strings. There is a line count. So this code may help

            string input = "\u001f\u0001\u0013FERREIRA RAMOS MUZI\u001f\u0002\0\u001f\u0003\aRICARDO\u001f\u0004\u0003URY\u001f\u0005\b09031979\u001f\u0006\u000eMONTEVIDEO/URY\u001f\a\b34946682\u001f\b\u0004\"\a \u0016\u001f\t\b22072026\u001f\n\0";
            string output = System.Net.WebUtility.HtmlDecode(input);
            string[] lines = output.Split(new char[] { '\u001f' });

Split Unicode Into Fragments, For example, if you enter the space character as a delimiter in a simple sentence "I love Unicode!", you will get a list of three words "I", "love", "Unicode!". The second method allows you to use a regular expression to split text. As we discussed earlier, in Python, strings can either be represented in bytes or unicode code points. The main takeaways in Python are: 1. Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters.

My Solution:

Detect only Letters a-zA-Z and Numbers with Regular Expression If regular expresion fail or is a white Space, a Word is Complet and next add it to a List, at the end i have a List With all words and numbers necessary.

1- Convert Byte[] Data to string

// Convert utf-8 bytes to a string.
s_unicode2 = System.Text.Encoding.UTF8.GetString(apduRsp.Data);

List<string> test = new List<string>();
if (s_unicode2.Length > 0)
{
   test = GetWords(s_unicode2);
}

2- Call GetWords() with string converted from Byte[]

private List<string> GetWords(string text)
    {
        Regex reg = new Regex("[a-zA-Z0-9]");
        string Word = "";
        char[] ca = text.ToCharArray();
        List<string> characters = new List<string>();
        for (int i = 0; i < ca.Length; i++)
        {
            char c = ca[i];
            if (c > 65535)
            {
                continue;
            }
            if (char.IsHighSurrogate(c))
            {
                i++;
                characters.Add(new string(new[] { c, ca[i] }));
            }
            else
            {
                if (reg.Match(c.ToString()).Success || c.ToString() == "/")
                {
                    Word = Word + c.ToString();
                    //characters.Add(new string(new[] { c }));
                }
                else if(c.ToString() == " ")
                {
                    if(Word.Length > 0)
                        characters.Add(Word);
                    Word = "";
                }
                else
                {
                    if(Word.Length > 0)
                        characters.Add(Word);
                    Word = "";
                }

            }

        }
        return characters;
    }

3- Result from GetWords()

That solution for me at the moment is good, but some people have 2 names, and this is a little problem at the moment of showing.

Split Unicode Into Characters, This browser-based utility splits Unicode text into characters. This utility splits Unicode data into characters. Quickly convert Unicode text to a string literal. From the time statistics, the user-defined split function took 29 ms to separate the defined string, where it took the STRING_SPLIT built-in function 0 ms to separate the same string. The performance difference is also clear by comparing the execution plans generated by each run.

String.Split Method (System), Splits a string into substrings based on the provided character separator. White​-space characters are defined by the Unicode standard and return true if they  Split A Text Or Numeric String Into Separate Cells - Excel Hello All I wonder if you could help with this one, I have a list of data like this (no spaces or other characters)

String.prototype.split(), The split() method divides a String into an ordered set of substrings, The separator can be a simple string or it can be a regular expression. is not split by user-perceived characters (grapheme clusters) or unicode characters (​codepoints), but by UTF-16 codeunits. Update compatibility data on GitHub  This tutorial shows how to represent Unicode strings in TensorFlow and manipulate them using Unicode equivalents of standard string ops. It separates Unicode strings into tokens based on script detection. import tensorflow as tf The tf.string data type. The basic TensorFlow tf.string dtype allows you to build tensors of byte strings. Unicode strings are utf-8 encoded by default. tf.constant(u"Thanks 😊") <tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

Python Strings | Python Education, String Methods; String Slices; String %; i18n Strings (Unicode); If Statement; Exercise: string1.py Characters in a string can be accessed using the standard [ ] syntax, and Python does not have a separate character type. A CStringA object contains the char type, and supports single-byte and multi-byte (MBCS) strings. A CString object supports either the char type or the wchar_t type, depending on whether the MBCS symbol or the UNICODE symbol is defined at compile time. A CString object keeps character data in a CStringData object.

Comments
  • 1F 01 13FERREIRA RAMOS MUZI; 1F field start, 01 field index, 13 length in bytes (19). This is much better processed as a byte array than as a string. This is a custom binary format; where are you getting it from and does it have documentation with some recommended way of processing it?
  • I has add a byte string
  • The problem is that some fields aren't strings and even the fields that are strings aren't always clear in their purpose. Field 8 contains 0x22072016, which appears to a BCD encoding of similar data to what's encoded as a string in field 9 (22072026). To properly decode this, you need to know what all those fields mean. Of course you can guess, but this doesn't look like the kind of data where you're supposed to guess.
  • It is not Unicode, use BinaryReader to read this data. There are 10 fields, each field starts with 0x1f. The second byte is the field number (0x01..0A). The third byte is the data length, followed by the data bytes. Fields 2 and 10 are empty, field 8 is a pretty wonky one that resembles a date (22072016).