PySpark java.io.IOException: No FileSystem for scheme: https

java.io.ioexception: no filesystem for scheme: dbfs
java io ioexception no filesystem for scheme: s3
exception in thread main'' java io ioexception no filesystem for scheme: alluxio
exception in thread main'' java io ioexception no filesystem for scheme wasbs
no filesystem for scheme: abfss
no filesystem for scheme: cos
no filesystem for scheme files
no filesystem for scheme: c databricks

I am using local windows and trying to load the XML file with the following code on python, and i am having this error, do anyone knows how to resolve it,

this is the code

df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ").load("https://irs-form-990.s3.amazonaws.com/201611339349202661_public.xml")

and this is the error

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-7-4832eb48a4aa> in <module>()
----> 1 df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ").load("https://irs-form-990.s3.amazonaws.com/201611339349202661_public.xml")

C:\SPARK_HOME\spark-2.2.0-bin-hadoop2.7\python\pyspark\sql\readwriter.py in load(self, path, format, schema, **options)
    157         self.options(**options)
    158         if isinstance(path, basestring):
--> 159             return self._df(self._jreader.load(path))
    160         elif path is not None:
    161             if type(path) != list:

C:\SPARK_HOME\spark-2.2.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

C:\SPARK_HOME\spark-2.2.0-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

C:\SPARK_HOME\spark-2.2.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o38.load.
: java.io.IOException: No FileSystem for scheme: https
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:469)
    at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1160)
    at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1148)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
    at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1148)
    at com.databricks.spark.xml.util.XmlFile$.withCharset(XmlFile.scala:46)
    at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:62)
    at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:62)
    at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:47)
    at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:46)
    at scala.Option.getOrElse(Option.scala:121)
    at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:45)
    at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:65)
    at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:43)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Unknown Source)

Somehow pyspark is unable to load the http or https, one of my colleague found the answer for this so here is the solution,

before creating the spark context and sql context we need to load this two line of code

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.4.1 pyspark-shell'

after creating the sparkcontext and sqlcontext from sc = pyspark.SparkContext.getOrCreate and sqlContext = SQLContext(sc)

add the http or https url into the sc by using sc.addFile(url)

Data_XMLFile = sqlContext.read.format("xml").options(rowTag="anytaghere").load(pyspark.SparkFiles.get("*_public.xml")).coalesce(10).cache()

this solution worked for me

java.io.IOException: No FileSystem for scheme: null, java.io.IOException: No FileSystem for scheme: null 12 13 display(df)/​databricks/spark/python/pyspark/sql/readwriter.py in load(self, path,  I'm trying to read a txt file from S3 with Spark, but I'm getting thhis error: No FileSystem for scheme: s3 This is my code: from pyspark import SparkContext, SparkConf conf = SparkConf().setApp

The error message says it all: you cannot use dataframe reader & load to access files on the web (http or htpps). I suggest you first download the file locally.

See the pyspark.sql.DataFrameReader docs for more on the available sources (in general, local file system, HDFS, and databases via JDBC).

Irrelevantly to the error, notice that you seem to use the format part of the command incorrectly: assuming that you use the XML Data Source for Apache Spark package, the correct usage should be format('com.databricks.spark.xml') (see the example).

Solved: Re: create Analytics from http usng spark streamin , saveAsTextFile(“/user/cloudera/flume”). I get below error at scala console :- java.​io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs. Attachments: Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

I've commit a similar but slightly different error: forgot the "s3://" prefix to file path. After adding this prefix to form "s3://path/to/object" the following code works:

my_data = spark.read.format("com.databricks.spark.csv")\
               .option("header", "true")\
               .option("inferSchema", "true")\
               .option("delimiter", ",")\
               .load("s3://path/to/object")

[#SPARK-21618] http(s) not accepted in spark-submit jar uri, I am using local windows and trying to load the XML file with the following code on python, and i am having this error, do anyone knows how to resolve it, this is  hi @vinglogn "HDI cluster compute and local compute environment, and my problem is with the latter." maybe there is some misunderstanding - there are two sets of notebooks, cluster and local, and it looks like you are running the cluster notebooks locally whereas you should be running the local notebooks instead.

[#SPARK-21917] Remote http(s) resources is not supported in , "main" java.io.IOException: No FileSystem for scheme: https at org.apache.​hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)  You received this message because you are subscribed to the Google Groups "Tachyon Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to tachyo@googlegroups.com.

Exception: No FileSystem for scheme: s3, s3n, and s3a · Issue #3 , with: java.io.IOException: No FileSystem for scheme: http at org.apache.​hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586) at  Solution given in multiple places, including here and here.. I added a ref to the hadoop-aws 2.7.3 jar file for S3 into the Docker image in spark-defaults.conf.I also tried a ref to AWS SDK For Java but did not add anything other than extended loading time.

java.lang.RuntimeException: java.io.IOException: No FileSystem for , mapPartitions(func).collect() File "/opt/spark/python/pyspark/rdd.py", line collectAndServe. : java.io.IOException: No FileSystem for scheme: s3n at org.​apache.hadoop.fs. https://stackoverflow.com/a/50276200/1096899  This is a typical case of the maven-assembly plugin breaking things.. Why this happened to us. Different JARs (hadoop-commons for LocalFileSystem, hadoop-hdfs for DistributedFileSystem) each contain a different file called org.apache.hadoop.fs.FileSystem in their META-INFO/services directory.

Comments
  • Answer not useful?
  • nop that was not I got it in different way, I will post the answer below
  • You are correct - actually addFile() can be used with HTTP(S) & FTP, too! Nice catch (+1), but my solution would also work if you downloaded the file first (and the XML Package I linked included instructions for the usage of --packages flag in the command line, which I guessed you knew already).
  • that's for the CSV, but in case of XML this format was giving some error.