Extract URLs with regex into a new data frame column

regex in r
rvest extract href
remove url in r

I want to use a regex to extract all URLs from text in a dataframe, into a new column. I have some older code that I have used to extract keywords, so I'm looking to adapt the code for a regex. I want to save a regex as a string variable and apply here:

data$ContentURL <- apply(sapply(regex, grepl, data$Content, fixed=FALSE), 1, function(x) paste(selection[x], collapse=','))

It seems that fixed=FALSE should tell grepl that its a regular expression, but R doesn't like how I am trying to save the regex as:

regex <- "http.*?1-\\d+,\\d+"

My data is organized in a data frame like this:

data <- read.table(text='"Content"     "date"   
 1     "a house a home https://www.foo.com"     "12/31/2013"
 2     "cabin ideas https://www.example.com in the woods"     "5/4/2013"
 3     "motel is a hotel"   "1/4/2013"', header=TRUE)

And would hopefully look like:

                                           Content       date              ContentURL
1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
3                                 motel is a hotel   1/4/2013                        

Hadleyverse solution (stringr package) with a decent URL pattern:

library(stringr)

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

data$ContentURL <- str_extract(data$Content, url_pattern)

data

##                                            Content       date              ContentURL
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

You can use str_extract_all if there are multiples in Content, but that will involve some extra processing on your end afterwards.

Python Pandas extract URL or date by regex, Extract capture groups in the regex pat as columns in a DataFrame. For each subject string  I want to use a regex to extract all URLs from text in a dataframe, into a new column. I have some older code that I have used to extract keywords, so I'm looking to adapt the code for a regex. I want to save a regex as a string variable and apply here:

Here's one approach using the qdapRegex library:

library(qdapRegex)
data[["url"]] <- unlist(rm_url(data[["Content"]], extract=TRUE))
data

##                                            Content       date                     url
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

To see the regular expression used by the function (as qdapRegex aims to help analyze and educate about regexs) you can use the grab function with the function name prefixed with @:

grab("@rm_url")

## [1] "(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

grepl tells you a logical output of yes this string contains or no it does not. grep tells you the indexes or gives the values but values are the whole string nut the substring you want.

So to pass this regex along to base or the stringi package (qdapRegex wraps stingi for extraction) you could do:

regmatches(data[["Content"]], gregexpr(grab("@rm_url"), data[["Content"]], perl = TRUE))

library(stringi)
stri_extract(data[["Content"]], regex=grab("@rm_url"))

I'm sure there's a stringr approach too but am not familiar with the package.

pandas.Series.str.extract, Extract capture groups in the regex pat as columns in a DataFrame. For each subject string in the Series, extract groups from the first match of regular expression  0 3242.0 1 3453.7 2 2123.0 3 1123.6 4 2134.0 5 2345.6 Name: score, dtype: object Extract the column of words

Split on space then find "http":

data$ContentURL <- unlist(sapply(strsplit(as.character(data$Content), split = " "),
                                 function(i){
                                   x <- i[ grepl("http", i)]
                                   if(length(x) == 0) x <- NA
                                   x
                                 }))


data
#                                            Content       date              ContentURL
# 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
# 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
# 3                                 motel is a hotel   1/4/2013                    <NA>

Breaking Up A String Into Columns Using Regex In pandas, Create a dataframe with a single column of strings data = {'raw': ['Arizona 1 2014-​12-23 3242.0', 'Iowa 1 2010-02-23 3453.7', 'Oregon 0  Given some mixed data containing multiple values as a string, let’s see how can we divide the strings using regex and make multiple columns in Pandas DataFrame. Method #1 : In this method we will use re.search(pattern, string, flags=0) .

Python, extract() function is used to extract capture groups in the regex pat as columns in a DataFrame. For each subject string in the Series, extract groups from the first  regex: a regular expression used to extract the desired values. There should be one group (defined by ()) for each element of into. remove: If TRUE, remove input column from output data frame. convert: If TRUE, will run type.convert() with as.is = TRUE on new columns. This is useful if the component columns are integer, numeric or logical.

Replace values in Pandas dataframe using regex, GATE Notes · Last Minute Notes · Official Papers · Gate 2018 Important Dates and Links Replace values in Pandas dataframe using regex Problem #1 : You are given a dataframe which contains the details about various events in different cities. Extract the position of beginning of pattern Updated the city columns. Accessing pandas dataframe columns, rows, and cells At this point you know how to load CSV data in Python. In this lesson, you will learn how to access rows, columns, cells, and subsets of rows and columns from a pandas dataframe.

Introduction To Programming In R, In order extract the data files from the directory listings we will use functions for str_replace(user.info, email.regex, "<a href='mailto:\\1'>\\1</a>") Parsed with column specification: ## cols( ## Rank = col_double(), ## Name Our new function can be used just like any other function in R. For example, we can iterate over  Extract substring of a column in pandas: We have extracted the last word of the state column using regular expression and stored in other column. df1['State_code'] = df1.State.str.extract(r'\b(\w+)$', expand=True) print(df1) so the resultant dataframe will be

Data Wrangling in R: Regular Expressions, start of string # $ end of string # . any character except new line # * 0 or more # + 1 or more # ? Examples of SUPER BASIC regex patterns: # find elements in vector beginning with 1 or more Simply give a URL as an argument to the readLines() # function. extract all row names and add to data frame: allStocks$​Stock  Given a regular expression with capturing groups, extract () turns each group into a new column. If the groups don't match, or the input is NA, the output will be NA.

Comments
  • For R, the entire regex must go in a character variable. Where did you get the idea that \\< and \\> would be parsed?
  • You are playing with fire if you use grep to regex on an html document
  • Perhaps, showing us the data and what you are trying to extract will also help out.
  • All urls or a certain url?
  • Sorry for the confusion! I want to extract all urls.