Create readable and maintainable ingest pipelines
Stack Serverless
There are many ways to achieve similar results when creating ingest pipelines, which can make maintenance and readability difficult. This guide outlines patterns you can follow to make the maintenance and readability of ingest pipelines easier without sacrificing functionality.
This guide does not provide guidance on optimizing for ingest pipeline performance.
When creating ingest pipelines, there are are few options for accessing fields in conditional statements and scripts. All formats can be used to reference fields, so choose the one that makes your pipeline easier to read and maintain.
Notation | Example | Notes |
---|---|---|
Dot notation | ctx.event.action |
Supported in conditionals and painless scripts. |
Square bracket notation | ctx['event']['action'] |
Supported in conditionals and painless scripts. |
Mixed dot and bracket notation | ctx.event['action'] |
Supported in conditionals and painless scripts. |
Field API Stack | field('event.action', '') or $('event.action','') |
Supported in conditionals and painless scripts. |
Field API | field('event.action', '') or $('event.action','') |
Supported only in painless scripts. |
Below are some general guidelines for choosing the right option in a situation.
Stack
The field API can be used in conditionals (the if
statement of your processor) in addition to being used in the script processor itself.
This is the preferred way to access fields.
Benefits
- Clean and easy to read.
- Handles null values automatically.
- Adds support for additional functions like
isEmpty()
to ease comparisons. - Handles dots as part of field name.
- Handles dots as dot walking for object notation.
- Handles special characters.
Limitations
- Not available in all versions for conditionals.
Benefits
- Clean and easy to read.
- Supports null safe operations
?
. Read more in Use null safe operators (?.
).
Limitations
- Does not support field names that contain a
.
or any special characters such as@
. Use Bracket notation instead.
Benefits
- Supports special characters such as
@
in the field name. For example, if there's a field name calledhas@!%&chars
, you would usectx['has@!%&chars']
. - Supports field names that contain
.
. For example, if there's a field namedfoo.bar
, if you usedctx.foo.bar
it will try to access the fieldbar
in the objectfoo
in the objectctx
. If you usedctx['foo.bar']
it can access the field directly.
Limitations
- Slightly more verbose than dot notation.
- No support for null safe operations
?
. Use Dot notation instead.
Benefits
- You can also mix dot notation and bracket notation to take advantage of the benefits of both formats.
For example, you could use
ctx.my.nested.object['has@!%&chars']
. Then you can use the?
operator on the fields using dot notation while still accessing a field with a name that contains special characters:ctx.my?.nested?.object['has@!%&chars']
.
Limitations
- Slightly more difficult to read.
Use conditionals (if
statements) to ensure that an ingest pipeline processor is only applied when specific conditions are met.
Anticipate potential problems with the data, and use the null safe operator (?.
) to prevent data from being processed incorrectly.
It is not necessary to use a null safe operator for first level objects
(for example, use ctx.openshift
instead of ctx?.openshift
).
ctx
will only ever be null
if the entire _source
is empty.
For example, if you only want data that has a valid string in a ctx.openshift.origin.threadId
field:
ctx.openshift.origin != null
&& ctx.openshift.origin.threadId != null
- It's unnecessary to check both
openshift.origin
andopenshift.origin.threadId
. - This will fail if
openshift
is not properly set because it assumes thatctx.openshift
andctx.openshift.origin
both exist.
ctx.openshift?.origin?.threadId instanceof String
- Only if there's a
ctx.openshift
and actx.openshift.origin
will it check for actx.openshift.origin.threadId
and make sure it is a string.
If you're using a null safe operator, it will return the value if it is not null
so there is no reason to check whether a value is not null
before checking the type of that value.
For example, if you only want data when the value of the ctx.openshift.origin.eventPayload
field is a string:
ctx?.openshift?.eventPayload != null && ctx.openshift.eventPayload instanceof String
ctx.openshift?.eventPayload instanceof String
When using the boolean OR operator (||
), you need to use the null safe operator for both conditions being checked.
For example, if you want to include data when the value of the ctx.event.type
field is either null
or '0'
:
ctx.event.type == null || ctx.event.type == '0'
- This will fail if
ctx.event
is not properly set because it assumes thatctx.event
exists. If it fails on the first condition it won't even try the second condition.
ctx.event?.type == null || ctx.event?.type == '0'
- Both conditions will be checked.
It is often unnecessary to use the null safe operator (.?
) multiple times when you have already traversed the object path.
For example, if you're checking the value of two different child properties of ctx.arbor.ddos
:
ctx.arbor?.ddos?.subsystem == 'CLI' && ctx.arbor?.ddos?.command_line != null
ctx.arbor?.ddos?.subsystem == 'CLI' && ctx.arbor.ddos.command_line != null
- Since the
if
condition is evaluated left to right, oncectx.arbor?.ddos?.subsystem == 'CLI'
passes, you knowctx.arbor.ddos
exists so you can safely omit the second?
.
When checking if a field is not empty, avoid redundant null safe operators and use clear, concise conditions.
ctx?.user?.geo?.region != null && ctx?.user?.geo?.region != ''
Once you've checked ctx.user?.geo?.region != null
, you can safely access ctx.user.geo.region
in the next condition.
ctx.user?.geo?.region != null && ctx.user.geo.region != ''
To check if a string field is not empty, use the isEmpty()
method in your condition. For example:
ctx.user?.geo?.region instanceof String && ctx.user.geo.region.isEmpty() == false
- This ensures the field exists, is a string, and is not empty.
For such checks you can also omit the instanceof String
and use an Elvis
such as if: ctx.user?.geo?.region?.isEmpty() ?: false
. This will only work when region
is a String
. If it is a double
, object
, or any other type that does not have an isEmpty()
function, it will fail with a Java Function not found
error.
Full example
Here is a full reproducible example:
POST _ingest/pipeline/_simulate
{
"docs": [
{
"_source": {
"user": {
"geo": {
"region": "123"
}
}
}
},
{
"_source": {
"user": {
"geo": {
"region": ""
}
}
}
},
{
"_source": {
"user": {
"geo": {
"region": null
}
}
}
},
{
"_source": {
"user": {
"geo": null
}
}
}
],
"pipeline": {
"processors": [
{
"set": {
"field": "demo",
"value": true,
"if": "if": "ctx.user?.geo?.region != null && ctx.user.geo.region != ''"
}
}
]
}
}
When using the boolean OR operator (||
), if
conditions can become unnecessarily complex and difficult to maintain, especially when chaining many OR checks. Instead, consider using array-based checks like .contains()
to simplify your logic and improve readability.
"if": "ctx?.kubernetes?.container?.name == 'admin' || ctx?.kubernetes?.container?.name == 'def'
|| ctx?.kubernetes?.container?.name == 'demo' || ctx?.kubernetes?.container?.name == 'acme'
|| ctx?.kubernetes?.container?.name == 'wonderful'
["admin","def","demo","acme","wonderful"].contains(ctx.kubernetes?.container?.name)
This example only checks for exact matches. Do not use this approach if you need to check for partial matches.
When working with data sizes, store all values as bytes (using a long
type) in Elasticsearch. This ensures consistency and allows you to leverage advanced formatting in Kibana Data Views to display human-readable sizes.
Avoid chaining several gsub
processors to strip units and manually convert values. This approach is error-prone, hard to maintain, and can easily miss edge cases.
{
"gsub": {
"field": "document.size",
"pattern": "M",
"replacement": "",
"ignore_missing": true,
"if": "ctx?.document?.size != null && ctx.document.size.endsWith(\"M\")"
}
},
{
"gsub": {
"field": "document.size",
"pattern": "(\\d+)\\.(\\d+)G",
"replacement": "$1$200",
"ignore_missing": true,
"if": "ctx?.uws?.size != null && ctx.document.size.endsWith(\"G\")"
}
},
{
"gsub": {
"field": "document.size",
"pattern": "G",
"replacement": "000",
"ignore_missing": true,
"if": "ctx?.uws?.size != null && ctx.document.size.endsWith(\"G\")"
}
}
The bytes
processor automatically parses and converts strings like "100M"
or "2.5GB"
into their byte values. This is more reliable, easier to maintain, and supports a wide range of units.
POST _ingest/pipeline/_simulate
{
"docs": [
{
"_source": {
"document": {
"size": "100M"
}
}
}
],
"pipeline": {
"processors": [
{
"bytes": {
"field": "document.size"
}
}
]
}
}
After storing values as bytes, you can use Kibana's field formatting to display them in a human-friendly format (KB, MB, GB, etc.) without changing the underlying data.
The rename processor renames a field. There are two flags:
ignore_missing
: Useful when you are not sure that the field you want to rename exists.ignore_failure
: Helps with any failures encountered. For example, the rename processor can only rename to non-existing fields. If you already have the fieldabc
and you want to renamedef
toabc
, the operation will fail.
If no built-in processor can achieve your goal, you may need to use a script processor in your ingest pipeline. Be sure to write scripts that are clear, concise, and maintainable.
All of the above discussed ways to access fields and retrieve their values is applicable within the script context. Null handling is still an important aspect when accessing the fields.
The fields API is the recommended way to add new fields.
For example, add a new system.cpu.total.norm.pct
field based on the value of the cpu.usage
field. The value of the existing cpu.usage
field is a number on a scale of 0-100. The value of the new system.cpu.total.norm.pct
field will be on a scale from 0-1.0 where 1 is the equivalent of 100 in the cpu.usage
field.
Option 1: Fields API (preferred)
Create a new system.cpu.total.norm.pct
field and set the value to the value of the cpu.usage
field divided by 100.0
.
POST _ingest/pipeline/_simulate
{
"docs": [
{
"_source": {
"cpu": {
"usage": 90
}
}
}
],
"pipeline": {
"processors": [
{
"script": {
"source": """
field('system.cpu.total.norm.pct').set($('cpu.usage',0.0)/100.0)
"""
}
}
]
}
}
- This field expects 0-1 and not 0-100. When renaming the field, divide this value by 100 to get the correct value.
- The
field
API is exposed asfield(<field name>)
. Theset(<value>)
is responsible for setting the value. Inside we use the$(<field name>, fallback)
to read the value out of the existing field. Lastly we divide by100.0
. The.0
is important, otherwise it will perform an integer only division and return just 0 instead of 0.9.
Option 2: Without the fields API
Without the field API, there is much more code involved to ensure that you can walk the full path of system.cpu.total.norm.pct
.
{
"script": {
"source": "
if(ctx.system == null){
ctx.system = new HashMap();
}
if(ctx.system.cpu == null){
ctx.system.cpu = [:];
}
if(ctx.system.cpu.total == null){
ctx.system.cpu.total = [:];
}
if(ctx.system.cpu.total.norm == null){
ctx.system.cpu.total.norm = [:];
}
ctx.system.cpu.total.norm.pct = $('cpu.usage', 0.0)/100.0;
"
}
}
- Check whether the objects are null or not and then create them.
- Create a new
HashMap
to store all the objects in it. - Instead of writing
new HashMap()
, use the shortcut[:]
. - Perform the same calculation as above and set the value.
{
"script": {
"source": """
String timeString = ctx['temp']['duration'];
ctx['event']['duration'] = Integer.parseInt(timeString.substring(0,2))*360000 + Integer.parseInt(timeString.substring(3,5))*60000 + Integer.parseInt(timeString.substring(6,8))*1000 + Integer.parseInt(timeString.substring(9,12)); <2> <3>
""",
"if": "ctx.temp != null && ctx.temp.duration != null"
}
}
- Avoid accessing fields using square brackets instead of dot notation.
ctx['event']['duration']
: Do not attempt to access child properties without ensuring the parent property exists.timeString.substring(0,2)
: Avoid parsing substrings manually instead of leveraging date/time parsing utilities.event.duration
should be in nanoseconds, as expected by ECS, instead of milliseconds.- Avoid redundant null checks instead of the null safe operator (
?.
).
This approach is hard to read, error-prone, and doesn't take advantage of the powerful date/time features available in Painless.
POST _ingest/pipeline/_simulate
{
"docs": [
{
"_source": {
"temp": {
"duration": "00:00:06.448"
}
}
}
],
"pipeline": {
"processors": [
{
"script": {
"source": """
if (ctx.event == null) {
ctx.event = [:];
}
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("HH:mm:ss.SSS");
LocalTime time = LocalTime.parse(ctx.temp.duration, formatter);
ctx.event.duration = time.toNanoOfDay();
""",
"if": "ctx.temp?.duration != null"
}
}
]
}
}
- Ensure the
event
object exists before assigning to it. - Use
DateTimeFormatter
andLocalTime
to parse the duration string. - Store the duration in nanoseconds, as expected by ECS.
- Use the null safe operator to check for field existence.
When reconstructing or normalizing IP addresses in ingest pipelines, avoid unnecessary complexity and redundant operations.
{
"script": {
"source": """
String[] ipSplit = ctx['destination']['ip'].splitOnToken('.');
String ip = Integer.parseInt(ipSplit[0]) + '.' + Integer.parseInt(ipSplit[1]) + '.' + Integer.parseInt(ipSplit[2]) + '.' + Integer.parseInt(ipSplit[3]);
ctx['destination']['ip'] = ip;
""",
"if": "(ctx['destination'] != null) && (ctx['destination']['ip'] != null)"
}
}
- Uses square bracket notation for field access instead of dot notation.
- Unnecessary casting to
Integer
when parsing string segments. - Allocates an extra variable for the IP string instead of setting the field directly.
- Does not check if
destination
is available as an object.
POST _ingest/pipeline/_simulate
{
"docs": [
{
"_source": {
"destination": {
"ip": "192.168.0.1.3.4.5.6.4"
}
}
}
],
"pipeline": {
"processors": [
{
"script": {
"source": """
def temp = ctx.destination.ip.splitOnToken('.');
ctx.destination.ip = temp[0] + "." + temp[1] + "." + temp[2] + "." + temp[3];
""",
"if": "ctx.destination?.ip != null"
}
}
]
}
}
- Uses dot notation for field access.
- Avoids unnecessary casting and extra variables.
- Uses the null safe operator (
?.
) to check for field existence.
This approach is more maintainable, avoids unnecessary operations, and ensures your pipeline scripts are robust and easy to understand.
It's a common mistake to explicitly remove the @timestamp
field before running a date processor, as shown below:
{
"set": {
"field": "openshift.timestamp",
"value": "{{openshift.date}} {{openshift.time}}",
"if": "ctx?.openshift?.date != null && ctx?.openshift?.time != null && ctx?.openshift?.timestamp == null"
}
},
{
"remove": {
"field": "@timestamp",
"ignore_missing": true,
"if": "ctx?.openshift?.timestamp != null || ctx?.openshift?.timestamp1 != null"
}
},
{
"date": {
"field": "openshift.timestamp",
"formats": [
"yyyy-MM-dd HH:mm:ss",
"ISO8601"
],
"timezone": "Europe/Vienna",
"if": "ctx?.openshift?.timestamp != null"
}
}
This removal step is unnecessary and can even be counterproductive. The date
processor will automatically overwrite the value in @timestamp
with the parsed date from your source field, unless you explicitly set a different target_field
. There's no need to remove @timestamp
beforehand—the processor will handle updating it for you.
Removing @timestamp
can also introduce subtle bugs, especially if the date processor is skipped or fails, leaving your document without a timestamp.
Mustache is a simple templating language used in Elasticsearch ingest pipelines to dynamically insert field values into strings. You can use double curly braces ({{ }}
) to reference fields from your document, enabling flexible and dynamic value assignment in processors like set
, rename
, and others.
For example, {{host.hostname}}
will be replaced with the value of the host.hostname
field at runtime. Mustache supports accessing nested fields, arrays, and even provides some basic logic for conditional rendering.
When you need to reference a specific element in an array using Mustache templates, you can use dot notation with the zero-based index. For example, to access the first value in the tags
array, use .0
after the array field name.
POST _ingest/pipeline/_simulate
{
"docs": [
{
"_source": {
"host": {
"hostname": "abc"
},
"tags": [
"cool-host"
]
}
}
],
"pipeline": {
"processors": [
{
"set": {
"field": "host.alias",
"value": "{{tags.0}}"
}
}
]
}
}
In this example, {{tags.0}}
retrieves the first element of the tags
array ("cool-host"
) and assigns it to the host.alias
field. This approach is necessary when you want to extract a specific value from an array for use elsewhere in your document. Using the correct index ensures you get the intended value, and this pattern works for any array field in your source data.
Whenever you need to store the original _source
within a field event.original
, use mustache function {{#toJson}}<field>{{/toJson}}
.
POST _ingest/pipeline/_simulate
{
"docs": [
{
"_source": {
"foo": "bar",
"key": 123
}
}
],
"pipeline": {
"processors": [
{
"set": {
"field": "event.original",
"value": "{{#toJson}}_source{{/toJson}}"
}
}
]
}
}