Elasticsearch query for user data overrides

Related searches

I need to design a query which can support user specific document edits. The document below describes one way to store this data. The document below includes a root document Description property. The root document Description property should searched by all users, except for Eric and Alex. For Eric and Alex, the Description property has been customized, and a search query executed by either of those users should search their custom Description field data, within the nested UserData array. A search query executed by either Eric or Alex should not search the root document Description field.

For my use case, users may customize 0 or more root document properties. For any root document property which a user has customized, only the custom value for that property should be searched for that user.

The brute force method to solve this would be to index a separate copy of each customized document. I'm trying to avoid that, fearing that creating multiple copies of each document which a user has customized will unfairly weight the index, by duplicating document content which is not legitimately duplicated.

{ 
  "Name": "doc1",
  "Description": "Base description1",
  "Spec": "Base document spec",
  "UserData":[
  {
    "EnteredBy": "Eric",
    "Description": "Desc entered by Eric, abc"
  },
  {
    "EnteredBy": "Alex",
    "Description": "Desc entered by Alex, def",
    "Spec": "Spec entered by Alex"
  }]
}

Edit 1

Below are listed the options I've considered.

Option 1: I could created a separate index for each user. In that index, I would add all of the base documents, which the user has not customized, and add each document which the user has customized. This would result in 1000+ indexes.

Option 2: I could use the script_score feature and manually compute the score for each document, using the override logic, described above. From what I've seen, the scoring logic would need to be primitive and may end up negating the power of Elasticsearch.

Edit 2

The solution will need to support a maximum of 40 fields and cases where any one field has been customized by up to 200 users. The index will contain 750,000 documents.

What about to create little bit different document structure, with nested fields, and add users to nested params? As example

POST /st_t2/_doc
{ 
  "Name":"doc1",
  "Description": [
    {"base": "wtf"},
    {"Alex": "Desc entered by Alex, aaa"}

    ],
  "Spec": [
    {"base": "Base document spec"}
   ]
}

And then you can create boolean queries like this:

GET st_t2/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": [
              {
                "exists": {
                  "field": "Description.Eric"
                }
              },
              {
                "match": {
                  "Description.Eric": "wtf"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must_not": [
              {
                "exists": {
                  "field": "Description.Eric"
                }
              }
            ],
            "must": [
              {
                "match": {
                  "Description.base": "wtf"
                }
              }
            ]
          }
        }
      ]
    }
  }
}

UPDATED:

During implementation of this solution @Eric Bowden came to decision to use nested mapping, and use provided exist and match inside nested fields. working example

Elasticsearch query for user data overrides, I need to design a query which can support user specific document edits. The document below describes one way to store this data. This component connects to a Elasticsearch server to retrieve data and load it into a table. This stages the data, so the table is reloaded each time. You may then use transformations to enrich and manage the data in permanent tables. The component offers both a Basic and Advanced mode (see below) for generating the Elasticsearch query.

Using input from @Oleksii Baidan, the following query worked for me. For the sample query below, a document is returned because user Eric has provided a custom value for field Description. If I were to modify the query below, to search for "abc", instead of "jkl", then the query would not return a result, as expected, because user Eric has overridden the Description field, hiding the base value of the description.

GET index1/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": [
              {
                "nested": {
                "path":"Description",
                "query":{
                "exists": {
                  "field": "Description.Eric"
                }}}},
              {
                "nested": {
                "path":"Description",
                "query":{
                "match": {
                  "Description.Eric": "jkl"
                }}}}
            ]
        }},
        {
          "bool": {
            "must_not": [{
              "nested": {
                "path":"Description",
                "query": {
                "exists": {
                  "field": "Description.Eric"
              }}}}
            ],
            "must": [{
              "nested": {
              "path":"Description",
              "query":{
              "match": {
                "Description.base": "jkl"
              }}}}
            ]}}
      ]}}
}

Index definition.

PUT index1
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties" : {
      "Name" : {
        "type":"nested"
      },
      "Description" : {
        "type":"nested"
      },
      "Spec" : {
        "type":"nested"
      }}}
}

Sample document

POST index1/_doc
{ 
  "Name": [
      {"base":"NameBase2"}
    ],
  "Description": [
    {"base": "DescriptionBase2 abc"},
    {"Alex": "DescriptionAlex2 def"},
    {"Eric": "DescriptionEric2 jkl"}
    ],
  "Spec": [
    {"base": "SpecBase2"}
   ]
}

Update Working with this further, I realized that it is not necessary to configure the user properties as nested. I'm leaving this SF post in place, as an example, but, from my understanding, configuring the user fields as nested is not necessary and provides no additional value.

Add Elasticsearch user settings, At the bottom of each Elasticsearch node, expand the User settings overrides Closed indices are a data loss risk: closed indices are not included when you� In case of internal link, a data source selectorallows you to select the target data source. Only tracing data sources are supported. Metric Query editor. The Elasticsearch query editor allows you to select multiple metrics and group by multiple terms or filters. Use the plus and minus icons to the right to add/remove metrics or group by clauses.

After experimenting with the prior answer which I posted, I discovered that this led to a number of fields in my index which exceeded the maximum number recommended by Elasticsearch, 1000.

As a solution, I tried solving this with parent/child documents. This appears to work. The only downside is that the child document data is not retrieved in the query results.

The index

PUT index1
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "ProductCustomizationField": { 
        "type": "join",
        "relations": {
          "Product": "ProductCustomization" 
        }
      }
    }
  }
}

Sample parent document

POST index1/_doc/TestDoc1
{ 
  "Name": "TestDoc1 abc",
  "ProductCustomizationField":"Product"
}

Sample child document

POST index1/_doc/TestDoc1-Eric?routing=TestDoc
{ 
  "Name": "TestDoc1-Eric def",
  "Owner":"Eric",
  "ProductCustomizationField": {
    "name":"ProductCustomization",
    "parent":"TestDoc1"
  }
}

Sample query

This is example query executed by user Eric, searching for "def". A result will be found, because Eric customized the product name to include "def", see sample child document (above). If another user, e.g. Alex, where to search for "def", then a result would not be returned, because "def" only exists within Eric's customized product name.

GET index1/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": [
              {
                "has_child" : {
                "type" : "ProductCustomization",
                "query": {
                  "match": {
                    "Owner": "Eric"
                  }}}},
              {
                "has_child" : {
                "type" : "ProductCustomization",
                "query": { 
                  "match": {
                  "ModelNumber": "def"
                }}}}
            ]
        }},
        {
          "bool": {
            "must_not": [
              {
                "has_child" : {
                "type" : "ProductCustomization",
                "query": { 
                "match": {
                  "Owner": "Eric"
                }}}},
                {
                  "exists" : {
                    "field":"Owner"
                  }}],
            "must": [
            {
               "match": {
                  "ModelNumber": "def"
            }}]}}
      ]}}
}

Metadata fields | Elasticsearch Reference [7.8], What is Elasticsearch? Data in: documents and indices � Information out: search and analyze � Scalability and resilience � What's� I'm curious as to whether it is possible to set an index structure and query within ElasticSearch so that the following operation can be achieved in as few queries as possible. Let's define a few things present in the system. A document has a bunch of info related to its source, data type, etc

_id field | Elasticsearch Reference [7.8], The value of the _id field is accessible in certain queries ( term , terms , match sorting, but doing so is discouraged as it requires to load a lot of data in memory. The simple_query_string query is a version of the query_string query that is more suitable for use in a single search box that is exposed to users because it replaces the use of AND/OR/NOT with

Query and filter context | Elasticsearch Reference [7.8], this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, � The following examples show how to use org.elasticsearch.index.query.QueryBuilders#functionScoreQuery() .These examples are extracted from open source projects. You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example.

To ensure that Elasticsearch can read the user and role information at startup, run elasticsearch-users useradd as the same user you use to run Elasticsearch. Running the command as root or some other user updates the permissions for the users and users_roles files and prevents Elasticsearch from accessing them.

Comments
  • Would you expect this approach to scale up to 20 fields and cases where up to 200 users may customize one or more fields on a document? My index will include 735,000 documents.
  • 735k documents it is not so much, but yes I understand your concern. In base search queries will be very fast and simple - no scripts, simple logic. But your task is not usual- and the best answer will be to try. I think that this one can work, but best way to check try on dataset.
  • You led me to correct answer, but your answer didn't work for me, as it is above. In my query, I had to specify the queries as nested. Are you sure that your answer works, as it is? I think I need your index definition to test it, myself. If you agree that the query needs to be nested, will you update your answer and I'll mark it as correct and award the bounty. Thanks a ton for your help with this!
  • In my answer mainly I tried to solve problem of search query (this trick with exist and must not). I look up nested variant also, but didn't stop on it to simplify my answer - and give us direction in what we can think. I glad to hear that my solution helps you! Will update answer - bounty will be good year start :)
  • @EricBowden I updated answer with link to your solution. I think no reason to duplicate it in my answer :) What you think?