What actions does job.commit perform in aws glue?

aws glue incremental data
aws glue reset job bookmark
aws glue job parameters
aws glue deduplicate
aws glue dynamic frame
aws glue concurrent jobs
aws glue performance
aws glue transformations

Every job script code should be ended with job.commit() but what exact action this function do?

  1. Is it just job end marker or not?
  2. Can it be called twice during one job (if yes - in what cases)?
  3. Is it safe to execute any python statement after job.commit() is called?

P.S. I have not found any description in PyGlue.zip with aws py source code :(

As of today, the only case where the Job object is useful is when using Job Bookmarks. When you read files from Amazon S3 (only supported source for bookmarks so far) and call your job.commit, a time and paths read so far will be internally stored, so that if for some reason you attempt to read that path again, you will only get back unread (new) files.

In this code sample, I try to read and process two different paths separately, and commit after each path is processed. If for some reason I stop my job, the same files won't be processed.

args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for path in paths:
    try:
        dynamic_frame = glue_context.create_dynamic_frame_from_options(
            connection_type='s3',
            connection_options={'paths'=[s3_path]},
            format='json',
            transformation_ctx="path={}".format(path))
        do_something(dynamic_frame)
        # Commit file read to Job Bookmark
        job.commit()
    except:
        # Something failed

Calling the commit method on a Job object only works if you have Job Bookmark enabled, and the stored references are kept from JobRun to JobRun until you reset or pause your Job Bookmark. It is completely safe to execute more python statements after a Job.commit, and as shown on the previous code sample, committing multiple times is also valid.

Hope this helps

Tracking Processed Data Using Job Bookmarks, AWS Glue runs a script when it starts a job. AWS Glue ETL scripts can be coded in Python or Scala. Python scripts use a language that is an extension of the  When you define your job on the AWS Glue console, you provide values for properties to control the AWS Glue runtime environment. The following list describes the properties of a Spark job. For the properties of a Python shell job, see Defining Job Properties for Python Shell Jobs .

According to the AWS support team, commit should not be called more than once. Here is the exact response I got from them:

The method job.commit() can be called multiple times and it would not throw any error 
as well. However, if job.commit() would be called multiple times in a Glue script 
then job bookmark will be updated only once in a single job run that would be after 
the first time when job.commit() gets called and the other calls for job.commit() 
would be ignored by the bookmark. Hence, job bookmark may get stuck in a loop and 
would not able to work well with multiple job.commit(). Thus, I would recommend you 
to use job.commit() once in the Glue script.

Job Monitoring and Debugging - AWS Glue, Assuming you are using the preconfigured “AWSGlueServiceRole” IAM role, looking closely into the policy details will answer why Glue job is  AWS Glue keeps track of job bookmarks by job. If you delete a job, the job bookmark is deleted. In some cases, you might have enabled AWS Glue job bookmarks but your ETL job is reprocessing data that was already processed in an earlier run.

To expand on @yspotts answer. It is possible to execute more than one job.commit() in an AWS Glue Job script, although the bookmark will be updated only once, as they mentioned. However, it is also safe to call job.init() more than once. In this case, the bookmarks will be updated correctly with the S3 files processed since the previous commit. If false, it does nothing.

In the init() function, there is an "initialised" marker that gets updated and set to true. Then, in the commit() function this marker is checked, if true then it performs the steps to commit the bookmarker and reset the "initialised" marker.

So, the only thing to change from @hoaxz answer would be to call job.init() in every iteration of the for loop:

args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for path in paths:
    job.init(args[‘JOB_NAME’], args)
    dynamic_frame = glue_context.create_dynamic_frame_from_options(
        connection_type='s3',
        connection_options={'paths'=[s3_path]},
        format='json',
        transformation_ctx="path={}".format(path))
    do_something(dynamic_frame)
    # Commit file read to Job Bookmark
    job.commit()

Editing Scripts in AWS Glue - AWS Glue, AWS Glue Libraries are additions and enhancements to Spark for ETL operations​. - awslabs/aws-glue-libs. Working with Data Catalog Settings on the AWS Glue Console Creating Tables, Updating Schema, and Adding New Partitions in the Data Catalog from AWS Glue ETL Jobs Populating the Data Catalog Using AWS CloudFormation Templates

Jobs API - AWS Glue, In this article, I will briefly touch upon the basics of AWS Glue and Before implementing any ETL job, you need to create an IAM role If you have any other data source, click on Yes and repeat the above steps. that performs extraction, transformation and loading process on AWS Glue. job.commit(). 8 What actions does job.commit perform in aws glue? Jan 16 '18. 5 AWS Glue - “GlueArgumentError: argument --input_file_path is required

My top 5 gotchas working with AWS Glue, The AWS Glue Jobs System . Sample Job for Amazon S3 to Amazon S3 . AWS Glue enables you to perform ETL operations on streaming data using This policy grants permission for some Amazon S3 actions to manage between your VPC and the AWS service does not leave the Amazon network. job.​commit(). When a user creates an AWS Glue job, confirm that the user's role contains a policy that contains iam:PassRole for AWS Glue. For more information, see Step 3: Attach a Policy to IAM Users That Access AWS Glue.

aws-glue-libs/job.py at master · awslabs/aws-glue-libs · GitHub, However, with AWS Glue, developers now have an option to easily AWS GLUE AND SNOWFLAKE IN ACTION [NOTE: Storing your account information and credentials this way, will expose them to anyone with access to this job. "[​table_name]").load() ## Perform any kind of transformations on your  AWS Glue is a fully-managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the time-consuming steps of data preparation for analytics. AWS Glue automatically discovers and profiles your data via the Glue Data Catalog, recommends and generates ETL code to transform your source data into target schemas, and runs the ETL

Comments
  • I can confirm. I am reading from another db and table and with job bookmark enabled, the job fails on subsequent runs. This is how I came to this stack overflow question. Does the bookmark only track which partitions have been read in a hive formatted path (for example /my_partition=apples/) or does it keep track of which folders it has read inside the partition folder as well.
  • @doorfly technically all files are inside the bucket at the same level (prefixes are used to index files, but the concept of folders doesn't exist within S3). With that being said, bookmarks will read any new files (doesn't matter which prefix they have) based on the timestamp of the file.
  • yes I know s3 doesn't have "folders"; it was for brevity. That said, I can't seem to get job bookmarking to work. There doesn't seem to be a way to get the bookmark position. There is a reset-job-bookmark in the API, but not something like get-job-bookmark which would help with debugging.
  • @doorfly, I'd love to dig deeper into your scenario. Can you show me a code sample of how you're reading your data from the S3 bucket?
  • here is the snippet: years = [2017, 2018] months = range(1,13) days = range(1,32) glue0 = glueContext.create_dynamic_frame.from_options(connection_type='s3', connection_options={'paths': ['s3://dev-bucket/aws-glue/data/{}/{:02d}/{:02d}'.format(y, m, d) for y in years for m in months for d in days]}, format='json') Not all paths are actually there since I enumerated over all months and days in 2017 and 2018. The api seem to ignore those that are not found and the job finishes executing, just not with any sort of bookmarking behavior that I could see. @hoaxz