How to get the input file name in the mapper in a Hadoop program?

How to get the input file name in the mapper in a Hadoop program?

hadoop filesplit
hadoop input
hadoop mapreduce architecture
hadoop mapreduce get file name
mapreduce context
mapreduce api
mapreduce example
mapreduce components

How I can get the name of the input file within a mapper? I have multiple input files stored in the input directory, each mapper may read a different file, and I need to know which file the mapper has read.


First you need to get the input split, using the newer mapreduce API it would be done as follows:

context.getInputSplit();

But in order to get the file path and the file name you will need to first typecast the result into FileSplit.

So, in order to get the input file path you may do the following:

Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();

Similarly, to get the file name, you may just call upon getName(), like this:

String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

Input file name in the mapper in a Hadoop program, For gathering the metrics across each file and not the entire set of files, it's required to get the file name within the mapper. Here is how to  But in order to get the file path and the file name you will need to first typecast the result into FileSplit. So, in order to get the input file path you may do the following: Path filePath = ((FileSplit) context.getInputSplit()).getPath(); String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();


Use this inside your mapper :

FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();

Edit :

Try this if you want to do it inside configure() through the old API :

String fileName = new String();
public void configure(JobConf job)
{
   filename = job.get("map.input.file");
}

Getting the filename of the input block in Hadoop, In this Blog, I will take you through how to do the same using simple multiple Write a Mapreduce program which will give wordcount of each input file in EXPLANATION:- In Mapper class we took the File Input Name using  As for the input file name of mapper, see the Configured Parameters section, the map.input.file variable (the filename that the map is reading from) is the one can get the jobs done. But note that: Note: During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ).


If you are using Hadoop Streaming, you can use the JobConf variables in a streaming job's mapper/reducer.

As for the input file name of mapper, see the Configured Parameters section, the map.input.file variable (the filename that the map is reading from) is the one can get the jobs done. But note that:

Note: During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ). For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. To get the values in a streaming job's mapper/reducer use the parameter names with the underscores.


For example, if you are using Python, then you can put this line in your mapper file:

import os
file_name = os.getenv('map_input_file')
print file_name

multiple output with multiple input file name, The key and value classes have to be serializable by the framework and hence Users can specify a different symbolic name for files and archives Mapper maps input key/value pairs to a set of intermediate key/value pairs. By parsing the mapreduce_map_input_file(new) or map_input_file(deprecated) environment variable, you will get the map input file name. Notice: The two environment variables are case-sensitive, all letters are lower-case.


If you're using the regular InputFormat, use this in your Mapper:

InputSplit is = context.getInputSplit();
Method method = is.getClass().getMethod("getInputSplit");
method.setAccessible(true);
FileSplit fileSplit = (FileSplit) method.invoke(is);
String currentFileName = fileSplit.getPath().getName()

If you're using CombineFileInputFormat, it's a different approach because it combines several small files into one relatively big file (depends on your configuration). Both the Mapper and RecordReader run on the same JVM so you can pass data between them when running. You need to implement your own CombineFileRecordReaderWrapper and do as follows:

public class MyCombineFileRecordReaderWrapper<K, V> extends RecordReader<K, V>{
...
private static String mCurrentFilePath;
...
public void initialize(InputSplit combineSplit , TaskAttemptContext context) throws IOException, InterruptedException {
        assert this.fileSplitIsValid(context);
        mCurrentFilePath = mFileSplit.getPath().toString();
        this.mDelegate.initialize(this.mFileSplit, context);
    }
...
public static String getCurrentFilePath() {
        return mCurrentFilePath;
    }
...

Then, in your Mapper, use this:

String currentFileName = MyCombineFileRecordReaderWrapper.getCurrentFilePath()

Hope I helped :-)

MapReduce Tutorial - Apache Hadoop, How do I get the Job variables in a streaming job's mapper/reducer? -input directoryname or filename, Required, Input location for mapper. MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line.


Noticed on Hadoop 2.4 and greater using the old api this method produces a null value

String fileName = new String();
public void configure(JobConf job)
{
   fileName = job.get("map.input.file");
}

Alternatively you can utilize the Reporter object passed to your map function to get the InputSplit and cast to a FileSplit to retrieve the filename

public void map(LongWritable offset, Text record,
        OutputCollector<NullWritable, Text> out, Reporter rptr)
        throws IOException {

    FileSplit fsplit = (FileSplit) rptr.getInputSplit();
    String inputFileName = fsplit.getPath().getName();
    ....
}

Hadoop Streaming - Apache Hadoop, FileInputFormat is the base class for all file-based InputFormat s. override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mapper s. Get a PathFilter instance of the filter set for the input paths. protected boolean isSplitable(FileSystem fs, Path filename). From the above two methods, programmer can get length of a split and storage locations. A good input split size is equal to the HDFS block size. But if the splits are too smaller than the default HDFS block size, then managing splits and creation of map tasks becomes an overhead than the job execution time.


FileInputFormat - Apache Hadoop, MapReduce programs are designed to process large volumes of data in an efficient way. The mapper instances read all input files simultaneously, and find every inverted index file which will tell us all the files names containing words that  Output of mapper is in the form of <CountryName1, 1>, <CountryName2, 1>. This output of mapper becomes input to the reducer. So, to align with its data type, Text and IntWritable are used as data type here. The last two data types, 'Text' and 'IntWritable' are data type of output generated by reducer in the form of key-value pair.


MapReduce, Mapper; import org.apache.hadoop.mapreduce. report progress // o set application-level status messages // o indicate a job is alive // o to get the values that are stored in job getName(); // The file name of the input document is the out-key. Hadoop WordCount Example- Mapper Phase Execution . The text from the input text file is tokenized into words to form a key value pair with all the words present in the input text file. The key is the word from the input file and value is ‘1’. For instance if you consider the sentence “An elephant is an animal”.


MapReduce: first example, public void readFile(String file) throws IOException{ FileSystem fileSystem = FileSystem.get(conf);. Path path = new Path("hdfs://hdpmaster.agiledc.lab:8020/​user/hadoop1/input/alchemist.pdf"); /*String filename = file.substring(file. to add this method in mapreduce program so that I can run on hadoop. Recall that the output of the Mapper must match the input to the Reducer (both key and value types). Since the output key from the Mapper class is Text, the input key to the Reducer class must also be Text. Likewise, since the output value from the Mapper class is Text, the input value to the Reducer class must also be Text.