In this post on Elasticsearch Update by Query API, I will show the real world use of this API, what it can do and what it can’t. And why you should strive to not be in a situation which warrants the use of this API in the first place.
WHY
After data has been indexed, maybe you see a mistake. Or you just want to make changes to the document already indexed. This API allows you to change/modify the existing document in an index. But there is Update API. Right? So why Elasticsearch Update by Query API exists? The answer is that Update API will depend on you to choose the document via id. This document then gets updated. However the Elasticsearch Update by Query API will actually take a query. And it will update all the documents which are returned by the query. Now that is powerful.
And to save you dissapointment later on
On the bright side you can add or drop fields from the indexed document while indexing it back to the same index.
HOW
Let the examples do the talking. Quickly we will push in some data. I am on Elasticsearch 6.5.2. If you are using Elasticsearch 7.x then the _doc part will not be needed.
PUT my_index/_doc/1 { "ProductCode": 12, "Price":100, "IsNonGMO":true, "Shipping":10 } PUT my_index/_doc/2 { "ProductCode": 12, "Price":110, "IsNonGMO":false, "Shipping":10 } PUT my_index/_doc/3 { "ProductCode": 13, "Price":120, "Shipping":10 }
Let us see what we got in the index.
GET my_index/_search
Output
{ "took" : 30, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 1.0, "hits" : [ { "_index" : "my_index", "_type" : "_doc", "_id" : "2", "_score" : 1.0, "_source" : { "ProductCode" : 12, "Price" : 110, "IsNonGMO" : false, "Shipping" : 10 } }, { "_index" : "my_index", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : { "ProductCode" : 12, "Price" : 100, "IsNonGMO" : true, "Shipping" : 10 } }, { "_index" : "my_index", "_type" : "_doc", "_id" : "3", "_score" : 1.0, "_source" : { "ProductCode" : 13, "Price" : 120, "Shipping" : 10 } } ] } }
Let us make Shipping
0 for ProductCode
12. There are 2 such documents. We will use Elasticsearch Update by Query API and supply a query to match all documents where the ProductCode
is 12.
I will also add a field called IsOrganic
and drop the field IsNonGMO
.
POST my_index/_update_by_query { "script": { "source": "ctx._source.Shipping = 0;ctx._source.IsOrganic = 'true';ctx._source.remove('IsNonGMO')", "lang": "painless" }, "query": { "match": { "ProductCode": 12 } } }
The output as expected
{ "took" : 172, "timed_out" : false, "total" : 2, "updated" : 2, "deleted" : 0, "batches" : 1, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0, "failures" : [ ] }
And this is how our documents are looking now.
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 1.0, "hits" : [ { "_index" : "my_index", "_type" : "_doc", "_id" : "2", "_score" : 1.0, "_source" : { "ProductCode" : 12, "Price" : 110, "IsOrganic" : "true", "Shipping" : 0 } }, { "_index" : "my_index", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : { "ProductCode" : 12, "Price" : 100, "IsOrganic" : "true", "Shipping" : 0 } }, { "_index" : "my_index", "_type" : "_doc", "_id" : "3", "_score" : 1.0, "_source" : { "ProductCode" : 13, "Price" : 120, "Shipping" : 10 } } ] } }
For ProductCode
12, the Shipping
is now 0, the IsNonGMO
field gone and we have added a new field called IsOrganic
.
Closing thoughts
The Elasticsearch Update by Query API is a very powerful tool in your arsenal. However there are certain things you have to know about this.
1. You cannot really repair bad mappings on the existing indices. For that you will need a bigger hammer, called Reindex API.
2. If possible make sure that the data coming in is correct rather than using this API to correct the already indexed data. Prevention is better than cure.
3. This can be a very time consuming call if millions of documents are to be updated. Always read the details to know how to handle the complications.