BigQuery parse URL web address

bigquery domain function
bigquery substring
bigquery ip address
bigquery modulo
bigquery coalesce
bigquery split string
bigquery round
bigquery url decode

I need help to parse out the web URL using BigQuery. Need to remove the string/text after last forward slash '/' and return the URL back. The input URL length can vary record by record. If the input URL does not have and string/text after domain address it should return the URL as it is.

Here are some examples.

Input Web URL

https://www.stackoverflow.com

https://www.stackoverflow.com/questions

https://www.stackoverflow.com/questions/ask

https://stackoverflow.com/questions/ask/some-text

Expected Output

https://www.stackoverflow.com

https://www.stackoverflow.com

https://www.stackoverflow.com/questions

https://www.stackoverflow.com/questions/ask

I have tried using SPLIT function which converts the URL string into ARRAY and calculate array size using ARRAY_LENGTH. However it doesn't cover the all the various scenario I have mentioned above.

Please advise how to tackle this? using Standard SQL in BigQuery?

You can use simple REGEXP_REPLACE for the last "/" and strings after that.

SELECT REGEXP_REPLACE(url, r"([^/])/[^/]*$", "\\1")
FROM (SELECT 'https://www.stackoverflow.com/questions/ask' as url UNION ALL
  SELECT 'https://www.stackoverflow.com/questions' as url UNION ALL
  SELECT 'https://www.stackoverflow.com' as url
)

Note: \\1 (first capture group) represent the character just before "/", we need to consider the character to avoid matching with "//".

Test Result:

https://www.stackoverflow.com/questions

https://www.stackoverflow.com

https://www.stackoverflow.com

Net Functions in Standard SQL | BigQuery, For more examples, see the IP Version 6 Addressing Architecture. This function Description. Takes a URL as a STRING and returns the host as a STRING. If the function cannot parse the input, it returns NULL. Note: The� Takes a URL as a STRING and returns the host as a STRING. For best results, URL values should comply with the format as defined by RFC 3986. If the URL value does not comply with RFC 3986

I think a case expression helps fill in the blank:

select (case when url like '%//%/%' then regexp_replace(url, '/[^/]+$', '')
             else url
        end)
from (select 'https://www.stackoverflow.com/questions/ask' as url union all
      select 'https://www.stackoverflow.com/questions' as url union all
      select 'https://www.stackoverflow.com' as url
      ) x;

DOMAIN Function, Here is some example web visitor information, including the name of the individual and the referring URL. You would like to filter out the internal� soumendra-mishra / BigQuery. Created Jul 28 Clone with Git or checkout with SVN using the repository’s web address. about clone URLs

Below is for BigQuery Standard SQL

#standardSQL
SELECT url, 
  REPLACE(REGEXP_REPLACE(REPLACE(url, '//', '\\'), r'/[^/]+$', ''), '\\', '//')
FROM `project.dataset.table`  

you can test, play with above using sample data from your question as in example below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'https://www.stackoverflow.com' url UNION ALL
  SELECT 'https://www.stackoverflow.com/questions' UNION ALL
  SELECT 'https://www.stackoverflow.com/questions/ask' UNION ALL
  SELECT 'https://stackoverflow.com/questions/ask/some-text' 
)
SELECT url, 
  REPLACE(REGEXP_REPLACE(REPLACE(url, '//', '\\'), r'/[^/]+$', ''), '\\', '//') value
FROM `project.dataset.table`  

with result

Row url                                                 value    
1   https://www.stackoverflow.com                       https://www.stackoverflow.com    
2   https://www.stackoverflow.com/questions             https://www.stackoverflow.com    
3   https://www.stackoverflow.com/questions/ask         https://www.stackoverflow.com/questions  
4   https://stackoverflow.com/questions/ask/some-text   https://stackoverflow.com/questions/ask  

BigQuery 🔎: Extract URL parameters as ARRAY, We're gonna use the REGEXP_EXTRACT_ALL function provided in the Standard SQL dialect of BigQuery to extract parameters from the query� This simple tool lets you parse a URL into its individual components, i.e scheme, protocol, username, password, hostname, port, domain, subdomain, tld, path, query string, hash, etc. It also splits the query string into a human readable format and takes of decoding the parameters. This tool uses the URI.js library developed by Rodney Rhem

Provide a JavaScript UDF solution. Not because it is better for this scenario but it is always your last hope when things' getting really complicated.

(Also, I want to point out that, double slashes could exist in url like: https://www.stackoverflow.com//questions//ask, to handle which you may need extra logic coded in JavaScript)

#standardSQL
CREATE TEMP FUNCTION
  remove_last_part_from_url(url STRING)
  RETURNS STRING
  LANGUAGE js AS """
  var last_slash = url.lastIndexOf('/');
  var first_double_slash = url.indexOf('//');
  if (first_double_slash != -1 
      && last_slash != -1 
      && last_slash != first_double_slash + 1) {
    return url.substr(0, last_slash);
  }
  return url;
  """ ;
SELECT remove_last_part_from_url(url)
FROM (SELECT 'https://www.stackoverflow.com/questions/ask' as url UNION ALL
  SELECT 'https://www.stackoverflow.com/questions' as url UNION ALL
  SELECT 'https://www.stackoverflow.com//questions' as url UNION ALL -- double slash after https://
  SELECT 'https:/invalid_url' as url UNION ALL
  SELECT 'https://www.stackoverflow.com' as url
)

How to Parse Query String Parameters from URLs in Big Data , Parsing URL query string parameters is easy with Xplenty. You can take a huge pile of web server logs and analyze them via Xplenty's visual� Clone via HTTPS Clone with Git or checkout with SVN using the repository’s web address. Learn more about clone URLs BigQuery. newTableSchema ()

4. Loading Data into BigQuery, From Cloud Shell, you can page through the gzipped file using zless : Could not parse 'NULL' as int for field HBCU (position 26) starting at location 11945910 It is worth noting that you can do one-time loads from the BigQuery web user equals sign, and the Google Cloud Storage URL corresponding to the data file(s) . BigQuery supports the use of the SAFE. prefix with most scalar functions that can raise errors, including STRING functions, math functions, DATE functions, DATETIME functions, and TIMESTAMP functions. BigQuery does not support the use of the SAFE. prefix with aggregate, analytic, or user-defined functions.

BigQuery Export schema - Analytics Help, For each Analytics view that is enabled for BigQuery integration, a dataset is Could be "organic", "cpc", "referral", or the value of the utm_medium URL The sub-continent from which sessions originated, based on IP address of the visitor. takes to parse the document and execute deferred and parser-inserted scripts � BigQuery supports a FORMAT() function for formatting strings. This function is similar to the C printf function. It produces a STRING from a format string that contains zero or more format specifiers, along with a variable length list of additional arguments that matches the format specifiers.

Call functions via HTTP requests, Examples in this page are based on a sample function that triggers when you send time, formats the time as specified in a URL query parameter, and sends the result in the HTTP response. which supports routers and apps managed by the Express web framework. This parsing is done by the following body parsers :. This query is to find geolocation of an IP address including latitude, longitude, city and country. Legacy SQL doesn't support range conditions such as BETWEEN when using JOIN, so we need to filter data by WHERE.

Comments
  • Bravo! I new there were a better way, but missed that trick. Bravo!
  • Forgot to mention - instead of "\\1" - you can use r"\1"
  • @MikhailBerlyant - Thank you for your help!
  • @kshaikh - Sure, consider also voting up helpful answers :o)