Finding directories older than N days in HDFS

Related searches

Can hadoop fs -ls be used to find all directories older than N days (from the current date)?

I am trying to write a clean up routine to find and delete all directories on HDFS (matching a pattern) which were created N days prior to the current date.

This script lists all the directories that are older than [days] :

#!/bin/bash
usage="Usage: $0 [days]"

if [ ! "$1" ]
then
  echo $usage
  exit 1
fi

now=$(date +%s)
hadoop fs -lsr | grep "^d" | while read f; do 
  dir_date=`echo $f | awk '{print $6}'`
  difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
  if [ $difference -gt $1 ]; then
    echo $f;
  fi
done

Finding directories older than N days in HDFS, difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) Can hadoop fs -ls be used to find all directories older than N days (from the current date)? I am trying to write a clean up routine to find and delete all directories on HDFS (matching a pattern) which were created N days prior to the current date.

If you happen to be using CDH distribution of Hadoop, it comes with a very useful HdfsFindTool command, which behaves like Linux's find command.

If you're using the default parcels information, here's how you'd do it:

hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-*-job.jar \
org.apache.solr.hadoop.HdfsFindTool -find PATH -mtime +N

Where you'd replace PATH with the search path and N with number of days.

Hadoop directories delete older than a month, This script lists all the directories that are older than [days] : #!/bin/bash usage=" Usage: $0 [days]" if [ ! "$1" ] then echo $usage exit 1 fi� Solved: How to do a cleanup of hdfs files older than a certain date using a bash script. I am just looking for a general strategy.

For real clusters it is not a good idea, to use ls. If you have admin rights, it is more suitable to use fsimage.

I modify script above to illustrate idea.

first, fetch fsimage

curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump

convert it to text (same output as lsr gives)

hdfs oiv -i img.dump -o fsimage.txt

Script:

#!/bin/bash
usage="Usage: dir_diff.sh [days]"

if [ ! "$1" ]
then
  echo $usage
  exit 1
fi

now=$(date +%s)
curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump
hdfs oiv -i img.dump -o fsimage.txt
cat fsimage.txt | grep "^d" | while read f; do 
  dir_date=`echo $f | awk '{print $6}'`
  difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
  if [ $difference -gt $1 ]; then
    echo $f;
  fi
done

Delete files 10 days older from hdfs, bash-4.1\\$ hdfs dfs -ls -R /data/backup/prd/xyz/ | grep '^d' drwxr-xr-x - abisox abadmgrp 0 Find files older than 30 days in directories and delete them Script to list/delete old files in an HDFS Directory - list-old-hdfs-files.sh

hdfs dfs -ls /hadoop/path/*.txt|awk '$6 < "2017-10-24"'

Solved: How to do a cleanup of hdfs files older than a cer , Delete files 10 days older from hdfs � hadoop. I am writing a ksh script to clean up hdfs directories and files at least 10 days old. I am testing� - Represents the current directory.-mtime - Represents the file modification time and is used to find files older than 30 days.-print - Displays the older files; If you want to search files in a specific directory, just replace the dot with the folder path. For example, to find out the files which are older than 30 days in /home/sk/Downloads directory, just run: $ find /home/sk/Downloads -mtime +30 -print. Sample output:

I didn't have the HdfsFindTool, nor the fsimage from curl, and I didn't much like the ls to grep with while loop using date awk and hadoop and awk again. But I appreciated the answers.

I felt like it could be done with just one ls, one awk, and maybe an xargs.

I also added the options to list the files or summarize them before choosing to delete them, as well as choose a specific directory. Lastly I leave the directories and only concern myself about the files.

#!/bin/bash
USAGE="Usage: $0 [N days] (list|size|delete) [path, default /tmp/hive]"
if [ ! "$1" ]; then
  echo $USAGE
  exit 1
fi
AGO="`date --date "$1 days ago" "+%F %R"`"

echo "# Will search for files older than $AGO"
if [ ! "$2" ]; then
  echo $USAGE
  exit 1
fi
INPATH="${3:-/tmp/hive}"

echo "# Will search under $INPATH"
case $2 in
  list)
    hdfs dfs -ls -R "$INPATH" |\
      awk '$1 ~ /^[^d]/ && ($6 " " $7) < '"\"$AGO\""
  ;;
  size)
    hdfs dfs -ls -R "$INPATH" |\
      awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {
           sum += $5 ; cnt += 1} END {
           print cnt, "Files with total", sum, "Bytes"}'
  ;;
  delete)
    hdfs dfs -ls -R "$INPATH" |\
      awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {print $8}' | \
      xargs hdfs dfs -rm -skipTrash
  ;;
  *)
    echo $USAGE
    exit 1
  ;;
esac

I hope others find this useful.

Solved: Retrive hdfs files after a specific time stamp, Solved: How to do a cleanup of hdfs files older than a certain date using a bash Find answers, ask questions, and share your expertise Former HCC members be sure to read and learn how to activate your account here. Below post has one example script which deletes files older than certain days:. In this case, we’re going to look at the LastWriteTime for each file. In this example, I want to show all files older than 30 days. In order to do that, we have to get the current date, subtract 30 days and then grab everything less than (older than) the resulting date. Get-ChildItem | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-30)}

Former HCC members be sure to read and learn how to activate your Sample Command to find files older than 3 days in the directory� How to delete files older than X days in HDFS/Hadoop. Hi, I'm wondering how to delete files older than X days with HDFS/Hadoop. On linux we can do it with the folowing command: find

How to find directories in HDFS which are older than N days? January 30, 2017. HDFS – Why another file system? February 6, 2017. 0.

I am writing a ksh script to clean up hdfs directories and files at least 10 days old. I am testing the deletion command in a terminal, but it kept saying it is wrong: $ hdfs dfs -find "/file/path/file" -depth -type d -mtime +10 -exec rm -rf {} \; find: Unexpected argument: -depth

Comments
  • One of the earlier solutions was partially helpful. I could write a shell script to find and delete all the directories matching a pattern but what I really needed to do was delete just the ones that were older than N days. (stackoverflow.com/questions/7733096/…)
  • Could you explain please why it's better to use fsimage?
  • If you have millions files 'fs -ls' probably wouldn't work. So you might either write your own java code to iterate filesystem or dump fsimage once and run many subsequent operations using it and simple unix tools.
  • please give a proper explanation to your answer.
  • hdfs dfs -ls /hadoop/path/*.txt - This part will search all .txt files awk '$6< "2017-10-24"'- this part will check for create date of file with condition.