Post-Indexing Updates

Documents often require updating — fields need to be incremented, modified, overwritten, or deleted. In this section we'll discuss situations where you:
- know the ID of the doc in question and want to update only that particular document
- want to update a group of documents that have something in common (i.e. match a query)
I'm tracking visits to my site based on the slug. A sample entry in my site_visits
index:
POST site_visits/_doc/home_id
{
"slug": "/home",
"visits": 0,
"tags": ["landing", "ab_test_100"],
"unneeded_attribute": "old"
}
How do I
- add a
modified_at
field and set it to now - increment the
visits
count - remove the
ab_test_100
tag - and delete the
unneeded
attribute?
For adding new fields such as modified_at
, you may be tempted to repeat the POST
call from above with only the new field being present:
POST site_visits/_doc/home_id
{
"modified_at": "2020-12-05T14:11:41.634Z"
}
While a perfectly valid call, it would completely overwrite the existing doc.
What's needed instead is a request to the _update
API:
POST site_visits/_update/home_id
{
"doc": {
"modified_at": "2020-12-05T14:11:41.634Z"
}
}
Notice that the new field had to be wrapped inside of doc
, and also that the contents of the other fields were left untouched.
Now, as to modifying existing fields, all of that can be done in one go inside a script
which targets the same URI path as above:
POST site_visits/_update/home_id
{
"script": {
"source": """
// incrementing a number
ctx._source.visits++;
// removing an array list entry
if (ctx._source.tags.contains(params.tag_to_remove)) {
ctx._source.tags.remove(ctx._source.tags.indexOf(params.tag_to_remove));
}
// removing a field
ctx._source.remove(params.field_to_remove);
// assigning a timestamp
def now_millis = ctx._now;
def now_date = new Date(now_millis);
def df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm'Z'");
def now_iso = df.format(now_date);
ctx._source.modified_at = now_iso;
""",
"params": {
"tag_to_remove": "ab_test_100",
"field_to_remove": "unneeded_attribute"
}
}
}
I've used ctx._now
only to illustrate its availability in the _update
API. That's essentially the only place where a system-level now
would be safe to use in a script due to the (often) distributed nature of ES. If instead of updating we were to, say, compare dates, and our query were executed on multiple nodes, it'd be very hard to appropriately synchronize now
. So, for all intents and purposes, it's safer to work with a parametrized now
— i.e. to add a runtime-generated now
attribute to the params
dictionary to guarantee that all workers touching the script will be exposed to the same value.