Regex to capture hyphenated words separated by new line character

regex any character
in regex
regex match string
regex any number of digits
regex special characters
regular expression match
regex wildcard multiple characters
what does mean in regex

I have a pattern such as word-\nword, i.e. words are hyphenated and separated by new line character.

I would like the output as word-word. I get word-\nword with the below code.

    text_string = "word-\nword"
    result=re.findall("[A-Za-z]+-\n[A-Za-z]+", text_string)
    print(result)

I tried this, but did not work, I get no result.

   text_string = "word-\nword"
   result=re.findall("[A-Za-z]+-(?=\n)[A-Za-z]+", text_string)
   print(result)

How can I achieve this. Thank You !

Edit:

Would it be efficient to do a replace and run a simple regex

text_string = "aaa bbb ccc-\nddd eee fff"
replaced_text = text_string.replace('-\n', '-')
result = re.findall("\w+-\w+",replaced_text)
print(result) 

or use the method suggested by CertainPerformance

text_string = "word-\nword"
result=re.sub("(?i)(\w+)-\n(\w+)", r'\1-\2', text_string)
print(result)

You should use re.sub instead of re.findall:

result = re.sub(r"(?<=-)\n+", "", test_str)

This matches any new lines after a - and replaces it with empty string.

Demo

You can alternatively use

(?<=-)\n(?=\w)

which matches new lines only if there is a - before it and it is followed by word characters.

How to specify a hyphen in a regex character set, What is the difference between the and * character in regular expressions? With some variations depending on the engine, regex usually defines a word character as a letter, digit or underscore. A word boundary \b detects a position where one side is such a character, and the other is not. In the everyday world, most people would probably say that in the English language, a word character is a letter.

If the string is composed of just that, then a pure regex solution is to use re.sub, capture the first word and the second word in a group, then echo those two groups back (without the dash and newline):

result=re.sub("(?i)([a-z]+)-\n([a-z]+)", r'\1\2', text_string)

Otherwise, if there is other stuff in the string, iterate over each match and join the groups:

text_string = "wordone-\nwordtwo wordthree-\nwordfour"
result=re.findall("(?i)([a-z]+)-\n([a-z]+)", text_string)
for match in result:
    print(''.join(match))

Regular Expressions, 0 or more times, (abc)* would match a null string, abc, abcabc, abcabcabc, but not abcaabc. Known as the Kleen's star. Regular Expressions (Regex) Regular Expression, or regex or regexp in short, is extremely and amazingly powerful in searching and manipulating text strings, particularly in processing text files. One line of regex can easily replace several dozen lines of programming codes.

You can simply replace any occurrences of '-\n' with '-' instead:

result = text_string.replace('-\n', '-')

What's the difference between * and ? in regular expressions , They clearly separate the pattern from the surrounding text and punctuation. series of letters, digits and hyphens, finally followed by a single dot and to match a tab character (ASCII 0x09), «\r» for carriage return (0x0D) and «\n» for line feed (0x0A). More A “word character” is a character that can be used to form words. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression. This blog post gives an overview and examples of regular expression syntax as implemented by the re built-in module (Python 3.7+). Assume ASCII character set unless

If you want to restrict the characters that can separate repeated words to This prevents matching repeated words that appear across multiple lines. and end of the regular expression ensure that it doesn't match within other words ( e.g., Repeated words that include hyphens, single quotes, or right single quotes, such​  The period character (.) matches any character except (the newline character, \u000A), with the following two qualifications: If a regular expression pattern is modified by the RegexOptions.Singleline option, or if the portion of the pattern that contains the . character class is modified by the s option, . matches any character

are represented by placing a dash (-) between two characters. including lowercase letters and nonletters such as the newline character. Like predefined character classes, character classes delimited by square brackets match a single character in in a regular expression, the application attempts to match zero or more  Regular Expression to ad. Character classes. any character except newline \w \d \s: word, digit, whitespace

Consult the following regex cheat sheet to get a quick overview of what each Wildcard which matches any character, except newline (\n). Then the expression is broken into three separate groups. which can use one or more digits, letters between a-z, periods, and hyphens. TermsPrivacyGDPR. ×. Example.* in regex basically means "catch everything until the end of input". So, for simple strings, like hello world, .* works perfectly. But if you have a string representing, for example, lines in a file, these lines would be separated by a line separator, such as (newline) on Unix-like systems and \r (carriage return and newline) on Windows.

Comments
  • Works ! Thank You very much. Could you explain the code very briefly please. Dont get the (?i) part.
  • It uses the i flag (case insensitive) so you don't have to repeat [A-Za-z], which gets repetitive. Each word is captured in a group (the parentheses ()), and then \1\2 echoes both groups back to you, without the rest of the match.
  • It's an entire document. So I guess I'll use result=re.sub("(?i)(\w+)-\n(\w+)", r'\1-\2', text_string), otherwise it captures all the words before and after.
  • would it be efficient to do a replace and run a simple regex text_string = "aaa bbb ccc-\nddd eee fff" replaced_text = text_string.replace('-\n', '-') result = re.findall("\w+-\w+",replaced_text) print(result) or use the method suggested by "CertainPerformance" result=re.sub("(?i)(\w+)-\n(\w+)", r'\1-\2', text_string)