Count lines in large files

linux count number of lines in large file
count lines in text file windows
count lines in a file
unix command to count rows in a file
perl count lines in large file
fast way to count lines in file
python count lines in huge file
python count lines in very large file

I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often.

The way I do it now it's just cat fname | wc -l, and it takes very long. Is there any solution that'd be much faster?

I work in a high performance cluster with Hadoop installed. I was wondering if a map reduce approach could help.

I'd like the solution to be as simple as one line run, like the wc -l solution, but not sure how feasible it is.

Any ideas?

Try: sed -n '$=' filename

Also cat is unnecessary: wc -l filename is enough in your present way.

What is a quick way to count lines in a 4TB file?, I have a 4TB big text file Exported from Teradata records, and I want to know how many records there are in that file. How may I do this quickly  Counting the # of lines in a very large file gives System OutofMemory Exception [duplicate] Ask Question If you just want to get the line count, stream it and

Your limiting speed factor is the I/O speed of your storage device, so changing between simple newlines/pattern counting programs won't help, because the execution speed difference between those programs are likely to be suppressed by the way slower disk/storage/whatever you have.

But if you have the same file copied across disks/devices, or the file is distributed among those disks, you can certainly perform the operation in parallel. I don't know specifically about this Hadoop, but assuming you can read a 10gb the file from 4 different locations, you can run 4 different line counting processes, each one in one part of the file, and sum their results up:

$ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l &
$ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l &
$ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l &
$ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l &

Notice the & at each command line, so all will run in parallel; dd works like cat here, but allow us to specify how many bytes to read (count * bs bytes) and how many to skip at the beginning of the input (skip * bs bytes). It works in blocks, hence, the need to specify bs as the block size. In this example, I've partitioned the 10Gb file in 4 equal chunks of 4Kb * 655360 = 2684354560 bytes = 2.5GB, one given to each job, you may want to setup a script to do it for you based on the size of the file and the number of parallel jobs you will run. You need also to sum the result of the executions, what I haven't done for my lack of shell script ability.

If your filesystem is smart enough to split big file among many devices, like a RAID or a distributed filesystem or something, and automatically parallelize I/O requests that can be paralellized, you can do such a split, running many parallel jobs, but using the same file path, and you still may have some speed gain.

EDIT: Another idea that occurred to me is, if the lines inside the file have the same size, you can get the exact number of lines by dividing the size of the file by the size of the line, both in bytes. You can do it almost instantaneously in a single job. If you have the mean size and don't care exactly for the the line count, but want an estimation, you can do this same operation and get a satisfactory result much faster than the exact operation.

fast line count for large files?, does anyone know of a way to get a file line count for large files that doesn't use "​wc -l" or "grep -c" etc.? To do this, follow the steps below. Edit the file you want to view line count. Go to the end of the file. If the file is a large file, you can immediately get to the end of the file by pressing Ctrl+End on your keyboard. Once at the end of the file, the Line: in the status bar displays the line

On a multi-core server, use GNU parallel to count file lines in parallel. After each files line count is printed, bc sums all line counts.

find . -name '*.txt' | parallel 'wc -l {}' 2>/dev/null | paste -sd+ - | bc

To save space, you can even keep all files compressed. The following line uncompresses each file and counts its lines in parallel, then sums all counts.

find . -name '*.xz' | parallel 'xzcat {} | wc -l' 2>/dev/null | paste -sd+ - | bc

How to count 900 million lines in a second, I always use wc-l command to count number of lines. But when my files(900 mill) are big, i have to wait at least 5 minutes to see the results. Any better ideas ? The first thing to try is to stream Get-Content and build up the line count one at a time, rather that storing all lines in an array at once. I think that this will give proper streaming behavior - i.e. the entire file will not be in memory at once, just the current line.

As per my test, I can verify that the Spark-Shell (based on Scala) is way faster than the other tools (GREP, SED, AWK, PERL, WC). Here is the result of the test that I ran on a file which had 23782409 lines

time grep -c $ my_file.txt;

real 0m44.96s user 0m41.59s sys 0m3.09s

time wc -l my_file.txt;

real 0m37.57s user 0m33.48s sys 0m3.97s

time sed -n '$=' my_file.txt;

real 0m38.22s user 0m28.05s sys 0m10.14s

time perl -ne 'END { $_=$.;if(!/^[0-9]+$/){$_=0;};print "$_" }' my_file.txt;

real 0m23.38s user 0m20.19s sys 0m3.11s

time awk 'END { print NR }' my_file.txt;

real 0m19.90s user 0m16.76s sys 0m3.12s

spark-shell
import org.joda.time._
val t_start = DateTime.now()
sc.textFile("file://my_file.txt").count()
val t_end = DateTime.now()
new Period(t_start, t_end).toStandardSeconds()

res1: org.joda.time.Seconds = PT15S

Count lines fast, I'm running Windows 7 and have GnuWin32. I have a several-gigabyte text file with LF ( \n ) line endings. I want to know how many lines it has (or  Here is a python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20million line file from 26 seconds to 7 seconds using an 8 core windows 64 server. Note: not using memory mapping makes things much slower.

If your data resides on HDFS, perhaps the fastest approach is to use hadoop streaming. Apache Pig's COUNT UDF, operates on a bag, and therefore uses a single reducer to compute the number of rows. Instead you can manually set the number of reducers in a simple hadoop streaming script as follows:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -Dmapred.reduce.tasks=100 -input <input_path> -output <output_path> -mapper /bin/cat -reducer "wc -l"

Note that I manually set the number of reducers to 100, but you can tune this parameter. Once the map-reduce job is done, the result from each reducer is stored in a separate file. The final count of rows is the sum of numbers returned by all reducers. you can get the final count of rows as follows:

$HADOOP_HOME/bin/hadoop fs -cat <output_path>/* | paste -sd+ | bc

Efficient way to count lines in huge files through shell script, Hi All, I want to count the number of occurrence of a particular string in list of files. And each file size is huge. Is thery any efficient way to do this. I want to  The execution speed depends on a lot of things and my result (or yours) may not be correct for others. Here are some anyway: nl infile | tac |sed -n 1p | awk '{print $1} vs sed -n '$=' infile vs wc -l infile. infile: 500000 lines (half a million), 18 Mb flat text file.

Read and count lines of large files in 2019 Linux, cat - concatenate files and print; JQ - is like sed for JSON data; python fileinput. Topics in the article: Count number of lines in one large file  :NEXT is called after getting the number of lines in the file. If the file line count is greater than the %maxlines% variable it goes to the :EXITLOOP where it overwrites the file, creating a new one with the first line of information. if it is less than the %maxlines% variable it simply adds the line to the current file.

How to Count Lines of a File by Command Line, Need to get the line count of a text file or document? Counting lines of any file is easy at the command line, and the command for line counting  this is far better solution once working with large amount of GB and files. doing one wc on a form of a cat is slow because the system first must process all GB to start counting the lines (tested with 200GB of jsons, 12k files). doing wc first then counting the result is far faster – ulkas Nov 7 '18 at 8:03

Split large files into a number of smaller files in Unix, To split large files into smaller files in Unix, use the split command. the number of lines you'd like in each of the smaller files (the default is 1,000). Introduction to Unix commands Get the line, word, or character count of a  Count number of lines in text . Home / Online tools / Count lines; This tool will display the number of lines in a given text. Number of lines: 1. How it works.

Comments
  • Do each of the nodes already have a copy of the file?
  • Thanks. yes. but to access many nodes I use an LSF system which sometimes exhibits quite an annoying waiting time, that's why the ideal solution would be to use hadoop/mapreduce in one node but it'd be possible to use other nodes (then adding the waiting time may make it slower than just the cat wc approach)
  • wc -l fname may be faster. You can also try vim -R fname if that is faster (it should tell you the number of lines after startup).
  • you can do it with a pig script see my reply here: stackoverflow.com/questions/9900761/…
  • Somewhat faster is to remember the useless use of cat rule.
  • mmm interesting. would a map/reduce approach help? I assume if I save all the files in a HDFS format, and then try to count the lines using map/reduce would be much faster, no?
  • @lvella. It depends how they are implemented. In my experience I have seen sed is faster. Perhaps, a little benchmarking can help understand it better.
  • @KingsIndian. Indeeed, just tried sed and it was 3 fold faster than wc in a 3Gb file. Thanks KingsIndian.
  • @Dnaiel If I would guess I'd say you ran wc -l filename first, then you ran sed -n '$=' filename, so that in the first run wc had to read all the file from the disk, so it could be cached entirely on your probably bigger than 3Gb memory, so sed could run much more quickly right next. I did the tests myself with a 4Gb file on a machine with 6Gb RAM, but I made sure the file was already on the cache; the score: sed - 0m12.539s, wc -l - 0m1.911s. So wc was 6.56 times faster. Redoing the experiment but clearing the cache before each run, they both took about 58 seconds to complete.
  • This solution using sed has the added advantage of not requiring an end of line character. wc counts end of line characters ("\n"), so if you have, say, one line in the file without a \n, then wc will return 0. sed will correctly return 1.
  • Good idea. I'm using this. See my answer about using dd instead of wc to read the file if disk bottleneck is an issue.