We dive into multiple tasks every day and come across hundreds of difficulties. Which is pretty natural for a software engineer’s life.
Things get challenging, but we excitedly accept that challenge to overcome each and every task in our daily lives.
Because our clients rely on us to solve obstacles, and we are always up for the task at hand.
We seek solutions that benefit both our clients and their users. We love that fact to make things easier.
Today, I’d like to give a real-world example of how we solved a problem for one of our development processes: detecting and stopping recursive loops in AWS Lambda functions.
This particular story demonstrates the technical issues we face on a regular basis and how we overcome them through leveraging AWS principles and our experience as well as tackling each problem from a new perspective.
Event-driven solutions In Retry Mechanism
Allow me to relate our real-life experience. So, we formed a pipeline to collect data from the source items.
This is how it worked: Consider it a line at the store, with events waiting their turn. When an event reaches the front, we manage it according to our requirements and then transmit the data to its new location.
Let’s explain it from a technical point of view. We have a FIFO queue, Where the event comes to the queue. We process it based on our requirements, and they will take the data and download it to the destination bucket.
However, we occasionally encountered a stumbling block and were unable to retrieve data from the source URL in a timely manner.
Isn’t this frustrating?
Especially since our major purpose, from a business standpoint, was to promote our merchandise immediately.
We’ve been brainstorming on how to tackle that hiccup and tried to resolve that in a way:
We set up a retry system to handle it smoothly.
This is a basic explanation: If my item cannot be processed effectively due to unavailable source data, we purposely transfer it to the dead letter queue. Following that, we wait for a predetermined length of time (DelaySeconds) before attempting to process it again.
Now, in our dead letter queue, we’ve set the DelaySeconds property value to 15 seconds right off the bat.
With this approach, we avoid waiting within the lambda function. Once 15 seconds have passed, the message arrives in the Dead Letter Queue (DLQ) and updates a record by increasing the retry count and renewing the deduplication ID and group ID. This guarantees that the main queue recognizes the event as new because we are sending it there. This process continues until either the item is successfully processed or the maximum number of retries is reached.
What Is DelaySeconds?
DelaySeconds is a parameter commonly used with message queues in distributed systems or messaging systems such as Amazon Simple Queue Service (SQS) or Apache Kafka.
In the context of SQS, DelaySeconds refers to the amount of time a message is held in the queue before it is accessible for processing by consumers. This delay is useful when you wish to separate the generation and consumption of messages in your system, or when you need to verify that certain requirements are met before a message is handled.
For example, if your system allows users to cancel an activity within a specified time frame, you may wish to delay performing the action until the cancellation window has passed.
In this situation, you might add a DelaySeconds parameter to the message indicating the action to ensure it is not handled until the cancellation window expires.
Basics of Dead Letter Queue: Why Messages End Up There
A dead letter queue serves as a holding area for messages that cannot be sent successfully. It is used in messaging systems to manage messages that cannot be processed or sent as expected.
Here’s an example. Assume you’re sending messages from one app to another. If a message experiences an error, such as being malformed or being unable to locate the receiver, it is placed in the dead letter queue rather than being permanently lost. This allows you to analyze why the communication failed and take appropriate steps to resolve the issue.
Now, what our DLQ basically handles is this:
- It updates the retry count
- It tweaks the message duplication ID and updates the message group ID
- Later, it bounces the message back to the Main queue for another round.
In the worst-case situation, we predict that the data will appear in the source item within 5 minutes. As a result, we’ve scheduled a 15-second interval for the following attempt, with a 5-minute pause in between.
We pounded away around 20 times in 15 seconds, hoping to make a breakthrough.
Normally, a couple of tries are enough, but this time, no dice.
Jump Into The Scenario:
Our system had been running smoothly for months when suddenly, AWS sent us a warning. Turns out, they caught wind of some recursive detection mischief.
As we started integrating, we discovered a sneaky message attribute switcheroo that occurs when our events reach the main queue.
We initially assumed it was harmless, but AWS was not having it.
They have this rule: if your lambda is invoked 16 times in a row by the same event, it is marked as a recursive loop.
We modified the retry count, Deduplication ID, and Sequence Set group ID to prevent the event from being repeated. Despite our efforts, AWS identified it as the same event. We were puzzled as to why AWS saw it that way even after the changes.
But wait, there is more than one way to break this nut.
We could have contacted AWS for assistance, but where’s the fun in that?
We decided to roll up our sleeves and take on the challenge ourselves.
In essence, our approach involved adjusting two key parameters: the DelaySeconds property and the MAX_RETRY_COUNT_VALUE.
By increasing the DelaySeconds property value, we effectively extended the time interval before retrying a process.
This adjustment allows for more time between attempts, potentially providing additional opportunities for success or resolving underlying issues.
Simultaneously, we reduced the MAX_RETRY_COUNT_VALUE (<16), which limits the maximum number of retry attempts.
This reduction implies a more cautious approach, limiting the number of retries to prevent excessive resource consumption or potential system overload.
Overall, this strategy aimed to strike a balance between giving processes adequate time to execute successfully while also preventing excessive retry attempts, thereby optimizing system performance and reliability.
Rolling Into Some Technical Facts: Detecting and Stopping Recursive Loops in AWS Lambda Functions
If you’ve ever used Lambda functions, you know how effective they are for developing serverless apps.
However, with tremendous power comes great responsibility, and recursive loops can cause devastation if unchecked.
We’ll go over everything you need to know about recursive loops in Lambda functions, including how they’re recognized, what actions AWS takes when recursion is found, and how you may change recursion detection settings.
Recursive loops can occur unintentionally in Lambda functions owing to programming faults or unanticipated input data.
Consider a scenario in which one Lambda function invokes another, which then triggers the first.
Without suitable recursion detection tools, this could result in an unending loop that consumes resources and reduces application performance.
To avoid such circumstances, AWS Lambda has built-in recursion detection techniques that detect and terminate recursive loops before they cause harm.
Overview
Lambda functions can be triggered in a variety of ways within AWS.
Events generated by AWS services can be used to activate Lambda functions, which can then communicate with other AWS services via message sending.
Ideally, the service or resource that triggers the Lambda function should be different from the one that it outputs.
However, due to misconfiguration or code issues, a function may mistakenly return a processed event to the same service, resulting in a cyclical loop.
Lambda now detects when a function runs in a loop between supported services, resulting in a RecursiveInvocationException after 16 invocations.
There are no additional charges associated with this functionality. Lambda routes asynchronous invocations to a dead-letter queue or an on-failure destination if defined.
Consider the following example of an order processing system:
- An initial order information message is sent to the source SQS queue.
- Lambda retrieves the message from the source queue via an ESM.
- Following processing, Lambda delivers the modified orders message to a destination SQS queue via the SQS SendMessage API.
- A dead-letter queue (DLQ) is configured for the source queue to handle any failed or unprocessed messages.
When Lambda sends the message back to the source SQS queue instead of the destination, a recursive loop of function invocations is created.
After 16 invocations, Lambda throws a RecursiveInvocationException to the ESM.
The ESM stops further invocations, and when the maxReceiveCount is reached, SQS forwards the message to the preset DLQ for the source queue.
Troubleshooting instructions for the function are supplied via an AWS Health Dashboard message and an email sent to the registered account address.
Lambda also emits a CloudWatch statistic called RecursiveInvocationsDropped, which can be accessed using the CloudWatch console.
How Does Lambda Detect Recursion?
Lambda utilizes a mix of approaches to detect recurrent loops:
Invocation History: Lambda tracks each function’s invocation history. If it detects a pattern of a function invoking itself excessively in a short period of time, it issues a recursion detection alert.
Call Stack Analysis: Lambda monitors the call stack for each function invocation. If it finds a recursive pattern in which the same function is called repeatedly within a short period of time, it activates the recursion detection mechanism.
Actions to Take When Lambda Stops a Recursive Loop
When Lambda identifies a recursive loop in a function, it takes numerous steps to address the problem and prevent more resource consumption:
Abort Invocation: Lambda aborts the current function invocation.
Log Error Message: Lambda generates an error message indicating that a recursive loop was found and terminated.
Throttle Invocation: Lambda may use throttling to prevent excessive function invocation.
Let’s split down these actions into coding examples:
1. Abort Invocation:
Consider a Lambda function written in Python that performs a recursive Fibonacci calculation:
If this function accidentally enters a recursive loop due to a programming error or unexpected input, Lambda will abort the invocation to prevent further resource consumption.
2. Log Error Message:
Lambda logs error messages to provide visibility into the occurrence of recursive loop detections. You can view these logs in the AWS CloudWatch console.
In this example, if a recursive loop is detected, Lambda logs an error message with details about the recursion error.
3. Throttle Invocation:
Lambda may apply throttling measures to prevent excessive invocation of the function. Throttling limits the rate at which requests are accepted by the function.
In this example, if a recursion error occurs, the function sleeps for a specified duration before retrying the invocation. This helps prevent overwhelming the Lambda service with repeated invocations during a recursive loop.
By implementing these actions, Lambda effectively manages recursive loops and ensures the reliability and stability of serverless applications.
Disabling Recursion Detection
While recursion detection is a useful function, there may be times when you need to disable it temporarily.
For example, if you purposely use recursive methods like tree traversal or graph traversal, the recursion detection process may generate false positives.
To deactivate recursion detection, change the CloudWatch Logs retention settings for your Lambda function.
By setting the retention period to “Never Expire,” you basically turn off recursion detection for that function.
However, this functionality should only be used when you are convinced that the recursive behavior is purposeful and will not result in infinite loops.
For further information on how to properly employ recursive invocation patterns, see the AWS Lambda Operator Guide’s section titled “Recursive Patterns Causing Run-Away Lambda Functions“.
Final Verdict
While recursion is an effective programming method, it can potentially have unforeseen repercussions if not used correctly.
Fortunately, AWS Lambda includes built-in capabilities for detecting and mitigating repetitive loops, ensuring that serverless applications are reliable and stable.
Understanding how recursion detection works and taking appropriate actions when necessary allows you to maximize the potential of AWS Lambda while reducing the hazards associated with recursive loops.
Remember to thoroughly test your Lambda functions and watch their behavior in production to identify any unexpected recursion concerns early on.
With careful planning and attention to detail, you may confidently use serverless computing.
Also, I’m suggesting everyone read another resourceful insight from Detecting and stopping recursive loops in AWS Lambda functions for better understanding.