A very simple multithreading parallel URL fetching (without queue)

python threading
threadpoolexecutor python
python threading example stackoverflow
flask multithreading example
python run same function in parallel
python concurrent requests
python concurrency
python thread return value

I spent a whole day looking for the simplest possible multithreaded URL fetcher in Python, but most scripts I found are using queues or multiprocessing or complex libraries.

Finally I wrote one myself, which I am reporting as an answer. Please feel free to suggest any improvement.

I guess other people might have been looking for something similar.

Using Python Threading and Returning Multiple Results (Tutorial , You can start potentially hundreds of threads that will operate in parallel, and work It's easy to learn, quick to implement, and solved my problem very quickly. Returning values from threads is not possible and, as such, in this example we #load up the queue with the urls to fetch and the index for each job (as a tuple):​. A very simple multithreading parallel URL fetching (without queue) I spent a whole day looking for the simplest possible multithreaded URL fetcher in Python, but most scripts I found are using queues or multiprocessing or complex libraries.

Python Multithreading Tutorial: Concurrency and Parallelism, The scripts in these threading examples have been tested with Python 3.6.4. Threading is one of the most well-known approaches to attaining Python concurrency and On every iteration, it calls self.queue.get() to try and fetch a URL to from a thread-safe queue. Therefore, this code is concurrent but not parallel. Saving whole json response from 100 URLS and then processing one by one also looks in-correct. Can some one suggest what would be best way of doing it ? My Question is related to below discussion. A very simple multithreading parallel URL fetching (without queue)

The main example in the concurrent.futures does everything you want, a lot more simply. Plus, it can handle huge numbers of URLs by only doing 5 at a time, and it handles errors much more nicely.

Of course this module is only built in with Python 3.2 or later… but if you're using 2.5-3.1, you can just install the backport, futures, off PyPI. All you need to change from the example code is to search-and-replace concurrent.futures with futures, and, for 2.x, urllib.request with urllib2.

Here's the sample backported to 2.x, modified to use your URL list and to add the times:

import concurrent.futures
import urllib2
import time

start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    conn = urllib2.urlopen(url, timeout=timeout)
    return conn.readall()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in urls}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print '%r generated an exception: %s' % (url, exc)
        else:
            print '"%s" fetched in %ss' % (url,(time.time() - start))
print "Elapsed Time: %ss" % (time.time() - start)

But you can make this even simpler. Really, all you need is:

def load_url(url):
    conn = urllib2.urlopen(url, timeout)
    data = conn.readall()
    print '"%s" fetched in %ss' % (url,(time.time() - start))
    return data

with futures.ThreadPoolExecutor(max_workers=5) as executor:
    pages = executor.map(load_url, urls)

print "Elapsed Time: %ss" % (time.time() - start)

Queue – A thread-safe FIFO implementation, The Queue class implements a basic first-in, first-out container. Queue() for i in range(5): q.put(i) while not q.empty(): print q.get() how to use the Queue class with multiple threads, we can create a very simplistic podcasting client. For our example we hard code the number of threads to use and the list of URLs to fetch. Re: Multithreaded URL Fetching jwenting Jul 5, 2007 1:24 PM ( in response to 807605 ) what you're suggesting is very much frowned upon by people running websites, as it tends to seriously bog down their servers and bandwidth.

I am now publishing a different solution, by having the worker threads not-deamon and joining them to the main thread (which means blocking the main thread until all worker threads have finished) instead of notifying the end of execution of each worker thread with a callback to a global function (as I did in the previous answer), as in some comments it was noted that such way is not thread-safe.

import threading
import urllib2
import time

start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]

class FetchUrl(threading.Thread):
    def __init__(self, url):
        threading.Thread.__init__(self)
        self.url = url

    def run(self):
        urlHandler = urllib2.urlopen(self.url)
        html = urlHandler.read()
        print "'%s\' fetched in %ss" % (self.url,(time.time() - start))

for url in urls:
    FetchUrl(url).start()

#Join all existing threads to main thread.
for thread in threading.enumerate():
    if thread is not threading.currentThread():
        thread.join()

print "Elapsed Time: %s" % (time.time() - start)

How to make Python code concurrent with 3 lines, The entire code runs on Python 3.2+ without external packages. Python iterates on 1000 URLs and calls each of them. This thing on my computer occupies 2% of the CPU and spends most of the import a new API to create a thread pool from concurrent.futures import The API is very simple to use. Threads can take tasks from the queue when they are available, do the work, and then go back for more. In this example, we needed to ensure maximum of 50 threads at any one time, but the ability to process any number of URL requests. Setting up a queue in Python is very simple: # Setting up the Queue from Queue import Queue

This script fetches the content from a set of URLs defined in an array. It spawns a thread for each URL to be fetch, so it is meant to be used for a limited set of URLs.

Instead of using a queue object, each thread is notifying its end with a callback to a global function, which keeps count of the number of threads running.

import threading
import urllib2
import time

start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
left_to_fetch = len(urls)

class FetchUrl(threading.Thread):
    def __init__(self, url):
        threading.Thread.__init__(self)
        self.setDaemon = True
        self.url = url

    def run(self):
        urlHandler = urllib2.urlopen(self.url)
        html = urlHandler.read()
        finished_fetch_url(self.url)


def finished_fetch_url(url):
    "callback function called when a FetchUrl thread ends"
    print "\"%s\" fetched in %ss" % (url,(time.time() - start))
    global left_to_fetch
    left_to_fetch-=1
    if left_to_fetch==0:
        "all urls have been fetched"
        print "Elapsed Time: %ss" % (time.time() - start)


for url in urls:
    "spawning a FetchUrl thread for each url to fetch"
    FetchUrl(url).start()

Advances in Computer Vision and Information Technology, Here the user has to give the no of links i.e., the exact depth to which user time can be significantly reduced, if many downloads are done in parallel. Java provides easy-to-use classes for both multithreading and handling of lists. (A queue In this case, we allow the crawler to only fetch URLs from queue 1 and add  Multithreading. Multithreading run parallel task on a shared memory space. The advantage of shared memory space is to reduce overhead. This mean both threads can read from the same object without creating a duplicate copy. However precautions have to be taken when writing an object back into memory. Below code is to run f1(x) and f2(x) in

The Python 3 Standard Library by Example: Pyth 3 Stan Libr Exam _2, The program reads one or more RSS feeds, queues up the enclosures for the be downloaded, and processes several downloads in parallel using threads. It does not have enough error handling for production use, but the skeleton The example uses hard-coded values for the number of threads and list of URLs to fetch. Multithreading and Kotlin. and keep threads around, I’ll run a simple test and spawn a million of them: Similar to threads, coroutines can run in parallel or concurrently, wait for and

Multithreaded Python: slithering through an I/O bottleneck, Multiple threads in Python is a bit of a bitey subject (not sorry) in that advantages afforded by running multiple tasks in parallel. Compare the difference between fetching from main memory and sending a simple packet over the Internet which can really speed up compute-bound (or CPU-bound) tasks. Using Queues with Threads¶ As an example of how to use the Queue class with multiple threads, we can create a very simplistic podcasting client. This client reads one or more RSS feeds, queues up the enclosures for download, and processes several downloads in parallel using threads.

Beginning Excel Services, Workbook URL (default) — The WFE will use a hash based on the workbook URL In the basic case, a request is taken out from the queue and assigned a thread that The number of requests that get executed in parallel is the same as the an I/O or networking operation (such as fetching a file), its thread is not utilized. Multithreaded Programming and Synchronization. Part 1: Simple Multi-threaded Programming using Pthreads; Part 2: Multi-threaded/Parallel Programming using OpenMP; Part 3: OpenMP Solution for Queue Scheduling Problem in Task 1.3; Part 1: Simple Multi-threaded Programming using Pthreads. Files found for part 1 are in the part_1/ directory. Tasks

Comments
  • just to add:in Python case, multithreading is not native to core due to GIL.
  • It stills looks that fetching the URLs in parallel is faster than doing it serially. Why is that? is it due to the fact that (I assume) the Python interpreter is not running continuously during an HTTP request?
  • What about if I want to parse the content of those web pages I fetch? Is it better to do the parsing within each thread, or should I do it sequentially after joining the worker threads to the main thread?
  • I made sure to claim that this was simplified "as far as possible", because that's the best way to make sure someone clever comes along and finds a way to simplify it even further just to make me look silly. :)
  • I believe it's not easy to beat that! :-) it's a great improvement since the first version I published here
  • maybe we can combine the first 2 loops into one? by instantiating and starting the threads in the same for loop?
  • @DanieleB: Well, then you have to change the list comprehension into an explicit loop around append, like this. Or, alternatively, write a wrapper which creates, starts, and returns a thread, like this. Either way, I think it's less simple (although the second one is a useful way to refactor complicated cases, it doesn't work when things are already simple).
  • @DanieleB: In a different language, however, you could do that. If thread.start() returned the thread, you could put the creation and start together into a single expression. In C++ or JavaScript, you'd probably do that. The problem is that, while method chaining and other "fluent programming" techniques make things more concise, they can also breaks down the expression/statement boundary, and are often ambiguous. so Python goes in almost the exact opposite direction, and almost no methods or operators return the object they operate on. See en.wikipedia.org/wiki/Fluent_interface.
  • I have a question regarding the code: does the print in the fourth line from the bottom really return the time it took to fetch the url or the time it takes to return the url from the 'results' object? In my understanding the timestamp should be printed in the fetch_url() function, not in the result printing part.
  • @UweZiegenhagen imap_unordered() returns the result as soon as it is ready. I assume the overhead is negligible compared to the time it takes to make the http request.
  • Thank you, I am using it in a modified form to compile LaTeX files in parallel: uweziegenhagen.de/?p=3501
  • This is by far the best, fastest and simplest way to go. I have been trying twisted, scrapy and others using both python 2 and python 3, and this is simpler and better
  • Thanks! Is there a way to add a delay between the calls?