how to identify people's relationship based on name, address and then assign a same ID through linux comman or Pyspark

google images
how can i find a person with just a picture
face recognition app
identify person in photo app
people will finish updating when photos is in the background
face recognition theory
recognizing faces involves what cells

I have one csv file.

0,David,H M,Lee,M,1990,201211.0,B

I want to assign persons with same last name living in the same address a same ID or index. It's better that ID is made up of only numbers. If persons have different last name in the same place, then ID should be different. Such ID should be unique. Namely, people who are different in either address or last name, ID must be different. My expected output is

0,David,H M,Lee,M,1990,201211.0,B,13

My datafile size is around 30 GB. I am thinking of using groupBy function in spark based on the key consisting of LNAME and address to group those observations together. Then assign it a ID by key. But I don't know how to do this. After that, maybe I can use flatMap to split the line and return those observations with a ID. But I am not sure about it. In addition, can I also make it in Linux environment? Thank you.

Find and identify photos of people using Photos on Mac, We think we recognise people by their face, but looking at their body can be just as important, new research shows. A people search culls many databases, to find both current and historical information. By cross-referencing and comparing information from many different sources over many years, complex search algorithms sift through and verify data, narrowing the thousands of bits of information down so that they relate to the one person being looked for.

Since you have 30GB of input data, you probably don't want something that'll attempt to hold it all in in-memory data structures. Let's use disk space instead.

Here's one approach that loads all your data into a sqlite database, and generates an id for each unique last name and address pair, and then joins everything back up together:


# Use an on-disk database instead of in-memory because source data is 30gb.
# This will take a while to run.
db=$(mktemp -p .)

sqlite3 -batch -csv -header "${db}" <<EOF
.import "${csv}" people
CREATE TABLE ids(id INTEGER PRIMARY KEY, lname, address, UNIQUE(lname, address));
INSERT OR IGNORE INTO ids(lname, address) SELECT lname, address FROM people;
FROM people AS p
JOIN ids AS i ON (p.lname, p.address) = (i.lname, i.address)
ORDER BY p.rowid;

rm -f "${db}"


$./ data.csv
0,David,"H M",Lee,M,1990,201211.0,B,3

It's better that ID is made up of only numbers.

If that restriction can be relaxed, you can do it in a single pass by using a cryptographic hash of the last name and address as the ID:

$ perl -MDigest::SHA=sha1_hex -F, -lane '
   BEGIN { $" = $, = "," } 
   if ($. == 1) { print @F, "ID" }
   else { print @F, sha1_hex("@F[3,7]") }' data.csv
0,David,H M,Lee,M,1990,201211.0,B,e86e81ab2715a8202e41b92ad979ca3a67743421

Or if the number of unique lname, address pairs is small enough that they can reasonably be stored in a hash table on your system:

#!/usr/bin/gawk -f
    FS = OFS = ","
NR == 1 {
    print $0, "ID"
! ($4, $8) in ids {
    ids[$4, $8] = ++counter
    print $0, ids[$4, $8]

We identify people by their body, when face is no help, The Real Deal: How to accurately identify people online. September 25, 2019. Accuracy is a core tenet of identity resolution. In this context, it's all about  If the photograph is a family portrait or group shot, try to identify other people in the photo. Look for other photos from the same family line which include recognizable details — the same house, car, furniture, or jewelry. Talk to your family members to see if they recognize any of the faces or features of the photograph.

$ sort -t, -k8,8 -k4,4 <<EOD | awk -F, '  $8","$4 != last { ++id; last = $8","$4 }
                                          { NR!=1 && $9=id; print }' id=9 OFS=,
0,David,H M,Lee,M,1990,201211.0,B
0,David,H M,Lee,M,1990,201211.0,B,11

The Real Deal: How to accurately identify people online, Because many people use the same profile photo on various social You can also use SocialMapper, which can identify or recognize any  Google’s image search tool has been there for a while and it has been extensively used to search the internet using images as search queries. The tool is pretty much accurate at identifying similar images which is why it can also server your purpose for identifying a person from his photograph.

How to Find Out Someone's Name From a Picture, New research reveals that when facial features are difficult to make out, we readily use information about someone's body to identify them — even  The technique of gel electrophoresis separates DNA by size, thus allowing people to be identified based on analyzing the lengths of their DNA. We discuss how gel electrophoresis works, and lab footage is shown of this technique being performed in real time.

Identifying people by their bodies when faces are no help , I don't see any way to determine which of the two girls is identified as Betty. I thought that if I hovered the cursor over a face, it would identify it, but  Toxic people, unfortunately, do not walk around the office with little plastic ID tags on their wrists or ankles, like a rare bird species being tracked by biologists in the wild. I wish they did, though. Toxic people can be hard to identify until it’s too late: until you’ve already bonded, or formed a friendship, or taken up a project

Identify the people in a photo, We are social animals of course and we cannot live without talking to people. So, Here is a trick I followed, when I had to identify good people from bad in my  The image recognition features allow you to: Search by People, Places, or Things using the search box option. Refine your search by Date, People, Places, or Things using the filter options. Note: Image recognition in Amazon Photos is enabled by default, unless you are a resident of Illinois. For residents of Illinois, image recognition is

  • This is beginner awk homework. Give it a try, you can't figure out something specific show how far you got and ask about that. At least show a plan for what to do and ask about some step in that plan.
  • For pyspark you need a Window and a rank function.
  • @jthill Thanks for your comment. I have edited my question. I am a beginner of awk so I am not clear how to use it to resolve my issue.
  • this is actually a very good question if you removed the linux and awk tags. awk is simply not a right tool to parse CSV files when it's 30GB containing fields like address.
  • @jxc Thank you for your comment. The reason why I wanna try linux environment is because doing my work could not use spark parallelization function and it may be faster to run it using awk under linux.
  • Thanks a million for such informative explanation. It worked well in my computer. But I have something unclear. First, what does it mean adding ** in **row.asDict()? I know that row.asDict() is to make the row be dictionary format but unclear about *. Second is related to ***clean-out/normalize the column... note. There is code ` coalesce('LNAME', lit('')), coalesce('Address', lit('')). Is it supposed to be coalesce(col('LNAME'), lit('')), coalesce(col('Address'), lit(''))`?. The final one is how to add another column representing the frequency of those observations.
  • Continued. My data is large so I guess it will be better to do it within partitions. So I tried to follow your logic. Use df1 = df.repartition(N, 'LNAME', 'Address').withColumn('fre', count('LNAME').over(w2)). But it doesn't yields what I want. Its outcome is the accumulated count value in each partition. Would you mind giving me some advice?
  • @Samson, the ** is used to unpack a dictionary and merge with other dicts(Python 3 only). This is used to merge existing keys in Row objects with two new ones: partition_id and idx. For newer version of Spark, most of the API functions can take column_name as argument which is the same as using col('col_name'). If using colname does not work for your Spark version, just adjust it to using col('col_name')
  • to calculate the count, you can try use w3 = Window.partitionBy('LNAME', 'Address') to replace w2 in your code.
  • Thank you always. It works. I was wondering whether it is very costly to partition by LNAME and Address when it comes to a big data. Or is it more efficient comparing to that first use reduceBy to get the frequency data based on LNAME and Address, and then employ join to add the frequency into the original data file?