How to fix duplicated documents in Elasticsearch when indexing by Logstash?

Related searches

I'm using the Elastic Stack to handle my log files but is generating duplicated documents in the Elasticsearch.

I've made some survey and already tried to add the "document_id", but it did not solve.

This is the configuration of my Logstash:

input {
  beats {
    port => 5044
  }
}

filter {

  fingerprint {
    source => "message"
    target => "[fingerprint]"
    method => "SHA1"
    key => "key"
    base64encode => true
  } 

  if [doctype] == "audit-log" {
    grok {
      match => { "message" => "^\(%{GREEDYDATA:loguser}@%{IPV4:logip}\) \[%{DATESTAMP:logtimestamp}\] %{JAVALOGMESSAGE:logmessage}$" }
    }
    mutate {
      remove_field => ["host"]
    }
    date {
      match => [ "logtimestamp" , "dd/MM/yyyy HH:mm:ss" ]
      target => "@timestamp"
      locale => "EU"
      timezone => "America/Sao_Paulo"
    } 
  }  

}

output {
  elasticsearch {
    hosts => "192.168.0.200:9200"
    document_id => "%{[fingerprint]}"
  }
}

Here the duplicated documents:

{
  "_index": "logstash-2019.05.02-000001",
  "_type": "_doc",
  "_id": "EbncP00tf9yMxXoEBU4BgAAX/gc=",
  "_version": 1,
  "_score": null,
  "_source": {
    "@version": "1",
    "fingerprint": "EbncP00tf9yMxXoEBU4BgAAX/gc=",
    "message": "(thiago.alves@192.168.0.200) [06/05/2019 18:50:08] Logout do usuário 'thiago.alves'. (cookie=9d6e545860c24a9b8e3004e5b2dba4a6). IP=192.168.0.200",
    ...
}

######### DUPLICATED #########

{
  "_index": "logstash-2019.05.02-000001",
  "_type": "_doc",
  "_id": "V7ogj2oB8pjEaraQT_cg",
  "_version": 1,
  "_score": null,
  "_source": {
    "@version": "1",
    "fingerprint": "EbncP00tf9yMxXoEBU4BgAAX/gc=",
    "message": "(thiago.alves@192.168.0.200) [06/05/2019 18:50:08] Logout do usuário 'thiago.alves'. (cookie=9d6e545860c24a9b8e3004e5b2dba4a6). IP=192.168.0.200",
    ...
}

That's it. I don't know why is duplicating yet. Someone have any idea?

Thank you in advance...

I had this problem once and after many attempts to solve it, I realized that I did a backup for my conf file into 'pipeline' folder and the logstash was using this backup file to process their input rules. Be careful because Logstash will use others files in pipeline folder even the file extension be different from '.conf'.

So, please check if do you have others files in 'pipeline' folder.

Please let me know if this was useful to you.

Little Logstash Lessons: Handling Duplicates, Approaches for de-duplicating data in Elasticsearch using Logstash. Elasticsearch provides a REST API for indexing your documents. If a document already exists with the same ID, Elasticsearch will replace the existing� Logstash's Elasticsearch output uses the indexing API and by default does not expect an ID to be supplied. Hence, it treats every single event as a separate document. However, there is an option where you can easily set a unique ID for every event in Logstash.

Generate a UUID key for each document then your issue will be solved.

How to stop duplicate entries using elasticsearch plugin, (this isnt a major problem is problem 2 can be solved). If id, type and destination index of the documents are the same, by default it should not� How do I avoid elasticsearch duplicate documents? The elasticsearch index docs count (20,010,253) doesn’t match with logs line count (13,411,790). documentation: File input plugin File rotation is detected and handled by this input, regardless of whether the file is rotated via a rename or a copy operation. nifi: real time nifi pipeline copies logs from nifi server to elk server. nifi has

Your code seems fine and shouldn't allow duplicates, maybe the duplicated one was added before you added document_id => "%{[fingerprint]}" to your logstash, so elasticsearch generated a unique Id for it that wont be overriden by other ids, remove the duplicated (the one having _id different than fingerprint) manually and try again, it should work.

How to Find and Remove Duplicate Documents in Elasticsearch , Logstash may be used for detecting and removing duplicate documents from an Elasticsearch index. This technique is described in this blog� Hello, We have setup logs pipeline as below filebeat -> kafka -> logstash -> elasticsearch filebeat is running on many servers which will send to centralized kafka server one topic and 20 partitions. We have 2 logstash servers with 10 threads to read the events from kafka. Both logstash servers form a same group (group name is 'logstash'). logstsh will write to elasticsearch servers. When i

Eliminating Duplicate Documents in Elasticsearch, Avoiding duplication in your Elasticsearch indexes is always a good The problem is that when the user attempts to reindex, Elasticsearch� one easy fix for duplicates lines is by giving each line a unique id in logstash e.g. based in timestamp and hash of line. Drawback is duplicate lines with same timestamp will appear only once in index (kinda acts like deduplication).

Preventing Duplicate Data for Elasticsearch, The Sneaky Problem a lot of process to filter duplicate data before it is sent to the Logstash. We index data in Elasticsearch with a specified ID. Any duplicate records will end up getting the same hash ID, so it won't be� We have a 3 node ELK cluster with 83000000 documents and 53 gb. Our cluster is populated by logstash who create a index by day. We have this scheme since 2014 january. Now elasticsearch is going slow. It have 484 index, 1 replica by document and 4832 shards. Using cluster.stats() from elasticsearch python took more than 10 seconds

As you have seen in this blog post, it is possible to prevent duplicates in Elasticsearch by specifying a document identifier externally prior to indexing data into Elasticsearch. The type and structure of the identifier can have a significant impact on indexing performance.

Comments
  • I'm surprised to see the square brackets around fingerprint. Have you tried setting the document_id to "%{fingerprint}"?
  • I second @JoeZack, you need to use the fingerprint as the document ID and the problem will be solved
  • Firstly, thank you for help. I've tried to use the fingerprint without the square brackets, but unfortunately the problem continues... Thank you.
  • OMG!!! Very good! I really had another file in the pipeline directory, but I didn't know that Logstash was considering it. I has removed the other file and the problem has solved. Very thank you. Congratulations!!!
  • We've faced an issue similar to OP's and setting up and UUID in fact solved our issue. The logstash-filter-uuid works fine, but basically any kind of ID generating method works as long as you use the document_id parameter in the Elasticsearch output as described here: elastic.co/guide/en/logstash/6.3/…
  • Firstly, thank you for help. I've tried to use the UUID, but unfortunately the problem continues... Here the link that I followed: logstash-lessons-handling-duplicates Thank you.
  • Firstly, thank you for help. I am indexing from a log file and every document that is indexed is duplicating, I don't know how to try again with the same document. Thank you.
  • Delete the whole index then and see if it helps.