Skip to content

Commit f36de79

Browse files
authored
Improve watching resources (PR #38)
* extracted resource watching logic into a separate class; implemented logic for observer in a RecourceWatcher class; added method to stop a thread gracefully * finish the observer logic extraction; merge changes from main; add resource watcher instance to enable_joblogs to subscribe to the event watcher if the log feature is configured; delete logic about event watcher from main; pass container for list objects function instead of container name; remove start methon from log handler class; modify joblogs init to subscribe to event watcher * add number of retry attempts and exponential backoff time to the reconnect loop for event watcher; make number of reconnect attempts, backoff time and a coefficient for exponential growth configurable via config; add backoff_time, reconnection_attempts and backoff_coefficient as attributes to the resource watcher init; add resource_version as a param to w.stream so a failed stream can read from the last resource it was able to catch; add urllib3.exceptions.ProtocolError and handle reconnection after some exponential backoff time to avoid api flooding; add config as a param for init for resource watcher; modify config in kubernetes.yaml and k8s config to contain add backoff_time, reconnection_attempts and backoff_coefficient * add logic to reset number of reconnect attempts and backoff time when connection to the k8s was achieved so only sequential failures detected; add exception handling to watch_pods to handle failure in urllib3, when source version is old and not available anymore, and when stream is ended; remove k8s resource watcher initialization from run function in api.py and move it to k8s.py launcher as _init_resource_watcher; refactor existing logic from joblogs/__init__.py to keep it in _init_resource_watcher and enable_joblogs in k8s launcher * added a CONFIG.md file with detailed explanations about parameters used to re-establish connection to the Kubernetes wather * move section about config file from README.md to CONFIG.md; add a link to the CONFIG.md in the README.md; remove variables for reconnection_attempts, backoff_time and backoff_coefficient fron the sample config since default values are provided in the code.
1 parent 33bae3e commit f36de79

11 files changed

+251
-96
lines changed

CONFIG.md

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
## About
2+
This file provides you with the detailed description of parameters listed in the config file, and explaining why they are used
3+
and when you are expected to provide or change them.
4+
5+
## Configuration file
6+
7+
* `http_port` - defaults to `6800` ([](https://scrapyd.readthedocs.io/en/latest/config.html#http-port))
8+
* `bind_address` - defaults to `127.0.0.1` ([](https://scrapyd.readthedocs.io/en/latest/config.html#bind-address))
9+
* `max_proc` - _(implementation pending)_, if unset or `0` it will use the number of nodes in the cluster, defaults to `0` ([](https://scrapyd.readthedocs.io/en/latest/config.html#max-proc))
10+
* `repository` - Python class for accessing the image repository, defaults to `scrapyd_k8s.repository.Remote`
11+
* `launcher` - Python class for managing jobs on the cluster, defaults to `scrapyd_k8s.launcher.K8s`
12+
* `username` - Set this and `password` to enable basic authentication ([](https://scrapyd.readthedocs.io/en/latest/config.html#username))
13+
* `password` - Set this and `username` to enable basic authentication ([](https://scrapyd.readthedocs.io/en/latest/config.html#password))
14+
15+
The Docker and Kubernetes launchers have their own additional options.
16+
17+
## [scrapyd] section, reconnection_attempts, backoff_time, backoff_coefficient
18+
19+
### Context
20+
The Kubernetes event watcher is used in the code as part of the joblogs feature and is also utilized for limiting the
21+
number of jobs running in parallel on the cluster. Both features are not enabled by default and can be activated if you
22+
choose to use them.
23+
24+
The event watcher establishes a connection to the Kubernetes API and receives a stream of events from it. However, the
25+
nature of this long-lived connection is unstable; it can be interrupted by network issues, proxies configured to terminate
26+
long-lived connections, and other factors. For this reason, a mechanism was implemented to re-establish the long-lived
27+
connection to the Kubernetes API. To achieve this, three parameters were introduced: `reconnection_attempts`,
28+
`backoff_time` and `backoff_coefficient`.
29+
30+
### What are these parameters about?
31+
- `reconnection_attempts` - defines how many consecutive attempts will be made to reconnect if the connection fails;
32+
- `backoff_time` and `backoff_coefficient` - are used to gradually slow down each subsequent attempt to establish a
33+
connection with the Kubernetes API, preventing the API from becoming overloaded with requests. The `backoff_time` increases
34+
exponentially and is calculated as `backoff_time *= self.backoff_coefficient`.
35+
36+
### When do I need to change it in the config file?
37+
Default values for these parameters are provided in the code and are tuned to an "average" cluster setting. If your network
38+
requirements or other conditions are unusual, you may need to adjust these values to better suit your specific setup.

README.md

+1-10
Original file line numberDiff line numberDiff line change
@@ -241,16 +241,7 @@ Not supported, by design.
241241
If you want to delete a project, remove it from the configuration file.
242242
243243
## Configuration file
244-
245-
* `http_port` - defaults to `6800` ([➽](https://scrapyd.readthedocs.io/en/latest/config.html#http-port))
246-
* `bind_address` - defaults to `127.0.0.1` ([➽](https://scrapyd.readthedocs.io/en/latest/config.html#bind-address))
247-
* `max_proc` - _(implementation pending)_, if unset or `0` it will use the number of nodes in the cluster, defaults to `0` ([➽](https://scrapyd.readthedocs.io/en/latest/config.html#max-proc))
248-
* `repository` - Python class for accessing the image repository, defaults to `scrapyd_k8s.repository.Remote`
249-
* `launcher` - Python class for managing jobs on the cluster, defaults to `scrapyd_k8s.launcher.K8s`
250-
* `username` - Set this and `password` to enable basic authentication ([➽](https://scrapyd.readthedocs.io/en/latest/config.html#username))
251-
* `password` - Set this and `username` to enable basic authentication ([➽](https://scrapyd.readthedocs.io/en/latest/config.html#password))
252-
253-
The Docker and Kubernetes launchers have their own additional options.
244+
To read in detail about the config file, please, navigate to the [Configuration Guide](CONFIG.md)
254245
255246
## License
256247

kubernetes.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,8 @@ data:
8787
launcher = scrapyd_k8s.launcher.K8s
8888
8989
namespace = default
90+
91+
max_proc = 2
9092
9193
# This is an example spider that should work out of the box.
9294
# Adapt the spider config to your use-case.

scrapyd_k8s.sample-k8s.conf

+3
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@ namespace = default
1919
# Optional pull secret, in case you have private spiders.
2020
#pull_secret = ghcr-registry
2121

22+
# Maximum number of jobs running in parallel
23+
max_proc = 10
24+
2225
# For each project, define a project section.
2326
# This contains a repository that points to the remote container repository.
2427
# An optional env_secret is the name of a secret with additional environment

scrapyd_k8s/__main__.py

+1-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import logging
22
import sys
3-
from .api import run, config
4-
from .joblogs import joblogs_init
3+
from .api import run
54

65
def setup_logging():
76
logging.basicConfig(
@@ -14,5 +13,4 @@ def setup_logging():
1413

1514
if __name__ == "__main__":
1615
setup_logging()
17-
joblogs_init(config)
1816
run()

scrapyd_k8s/api.py

-6
Original file line numberDiff line numberDiff line change
@@ -155,11 +155,5 @@ def run():
155155
if config_username is not None and config_password is not None:
156156
enable_authentication(app, config_username, config_password)
157157

158-
if config.joblogs() is not None:
159-
launcher.enable_joblogs(config)
160-
logger.info("Job logs handling enabled.")
161-
else:
162-
logger.debug("Job logs handling not enabled; 'joblogs' configuration section is missing.")
163-
164158
# run server
165159
app.run(host=host, port=port)

scrapyd_k8s/joblogs/__init__.py

+1-25
Original file line numberDiff line numberDiff line change
@@ -1,25 +1 @@
1-
import logging
2-
from scrapyd_k8s.joblogs.log_handler_k8s import KubernetesJobLogHandler
3-
4-
logger = logging.getLogger(__name__)
5-
6-
def joblogs_init(config):
7-
"""
8-
Initializes job logs handling by starting the Kubernetes job log handler.
9-
10-
Parameters
11-
----------
12-
config : Config
13-
Configuration object containing settings for job logs and storage.
14-
15-
Returns
16-
-------
17-
None
18-
"""
19-
joblogs_config = config.joblogs()
20-
if joblogs_config and joblogs_config.get('storage_provider') is not None:
21-
log_handler = KubernetesJobLogHandler(config)
22-
log_handler.start()
23-
logger.info("Job logs handler started.")
24-
else:
25-
logger.warning("No storage provider configured; job logs will not be uploaded.")
1+
from scrapyd_k8s.joblogs.log_handler_k8s import KubernetesJobLogHandler

scrapyd_k8s/joblogs/log_handler_k8s.py

+29-49
Original file line numberDiff line numberDiff line change
@@ -63,25 +63,7 @@ def __init__(self, config):
6363
self.pod_tmp_mapping = {}
6464
self.namespace = config.namespace()
6565
self.num_lines_to_check = int(config.joblogs().get('num_lines_to_check', 0))
66-
self.object_storage_provider = None
67-
68-
def start(self):
69-
"""
70-
Starts the pod watcher thread for job logs.
71-
72-
Returns
73-
-------
74-
None
75-
"""
76-
if self.config.joblogs() and self.config.joblogs().get('storage_provider') is not None:
77-
pod_watcher_thread = threading.Thread(
78-
target=self.watch_pods
79-
)
80-
pod_watcher_thread.daemon = True
81-
pod_watcher_thread.start()
82-
logger.info("Started pod watcher thread for job logs.")
83-
else:
84-
logger.warning("No storage provider configured; job logs will not be uploaded.")
66+
self.object_storage_provider = LibcloudObjectStorage(self.config)
8567

8668
def get_last_n_lines(self, file_path, num_lines):
8769
"""
@@ -236,7 +218,7 @@ def stream_logs(self, job_name):
236218
except Exception as e:
237219
logger.exception(f"Error streaming logs for job '{job_name}': {e}")
238220

239-
def watch_pods(self):
221+
def handle_events(self, event):
240222
"""
241223
Watches Kubernetes pods and handles events such as starting log streaming or uploading logs.
242224
@@ -245,36 +227,34 @@ def watch_pods(self):
245227
None
246228
"""
247229
self.object_storage_provider = LibcloudObjectStorage(self.config)
248-
w = watch.Watch()
249-
v1 = client.CoreV1Api()
250230
try:
251-
for event in w.stream(v1.list_namespaced_pod, namespace=self.namespace):
252-
pod = event['object']
253-
if pod.metadata.labels.get("org.scrapy.job_id"):
254-
job_id = pod.metadata.labels.get("org.scrapy.job_id")
255-
pod_name = pod.metadata.name
256-
thread_name = f"{self.namespace}_{pod_name}"
257-
if pod.status.phase == 'Running':
258-
if (thread_name in self.watcher_threads
259-
and self.watcher_threads[thread_name] is not None
260-
and self.watcher_threads[thread_name].is_alive()):
261-
pass
262-
else:
263-
self.watcher_threads[thread_name] = threading.Thread(
264-
target=self.stream_logs,
265-
args=(pod_name,)
266-
)
267-
self.watcher_threads[thread_name].start()
268-
elif pod.status.phase in ['Succeeded', 'Failed']:
269-
log_filename = self.pod_tmp_mapping.get(pod_name)
270-
if log_filename is not None and os.path.isfile(log_filename) and os.path.getsize(log_filename) > 0:
271-
if self.object_storage_provider.object_exists(job_id):
272-
logger.info(f"Log file for job '{job_id}' already exists in storage.")
273-
else:
274-
self.object_storage_provider.upload_file(log_filename)
231+
232+
pod = event['object']
233+
if pod.metadata.labels.get("org.scrapy.job_id"):
234+
job_id = pod.metadata.labels.get("org.scrapy.job_id")
235+
pod_name = pod.metadata.name
236+
thread_name = f"{self.namespace}_{pod_name}"
237+
if pod.status.phase == 'Running':
238+
if (thread_name in self.watcher_threads
239+
and self.watcher_threads[thread_name] is not None
240+
and self.watcher_threads[thread_name].is_alive()):
241+
pass
242+
else:
243+
self.watcher_threads[thread_name] = threading.Thread(
244+
target=self.stream_logs,
245+
args=(pod_name,)
246+
)
247+
self.watcher_threads[thread_name].start()
248+
elif pod.status.phase in ['Succeeded', 'Failed']:
249+
log_filename = self.pod_tmp_mapping.get(pod_name)
250+
if log_filename is not None and os.path.isfile(log_filename) and os.path.getsize(log_filename) > 0:
251+
if self.object_storage_provider.object_exists(job_id):
252+
logger.info(f"Log file for job '{job_id}' already exists in storage.")
275253
else:
276-
logger.info(f"Logfile not found for job '{job_id}'")
277-
else:
278-
logger.debug(f"Other pod event type '{event['type']}' for pod '{pod.metadata.name}' - Phase: '{pod.status.phase}'")
254+
self.object_storage_provider.upload_file(log_filename)
255+
else:
256+
logger.info(f"Logfile not found for job '{job_id}'")
257+
else:
258+
logger.debug(f"Other pod event type '{event['type']}' for pod '{pod.metadata.name}' - Phase: '{pod.status.phase}'")
279259
except Exception as e:
280260
logger.exception(f"Error watching pods in namespace '{self.namespace}': {e}")

scrapyd_k8s/k8s_resource_watcher.py

+151
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
import threading
2+
import logging
3+
import time
4+
from kubernetes import client, watch
5+
from typing import Callable, List
6+
import urllib3
7+
8+
logger = logging.getLogger(__name__)
9+
10+
class ResourceWatcher:
11+
"""
12+
Watches Kubernetes pod events and notifies subscribers about relevant events.
13+
14+
Attributes
15+
----------
16+
namespace : str
17+
Kubernetes namespace to watch pods in.
18+
subscribers : List[Callable]
19+
List of subscriber callback functions to notify on events.
20+
"""
21+
22+
def __init__(self, namespace, config):
23+
"""
24+
Initializes the ResourceWatcher.
25+
26+
Parameters
27+
----------
28+
namespace : str
29+
Kubernetes namespace to watch pods in.
30+
"""
31+
self.namespace = namespace
32+
self.reconnection_attempts = int(config.scrapyd().get('reconnection_attempts', 5))
33+
self.backoff_time = int(config.scrapyd().get('backoff_time', 5))
34+
self.backoff_coefficient = int(config.scrapyd().get('backoff_coefficient', 2))
35+
self.subscribers: List[Callable] = []
36+
self._stop_event = threading.Event()
37+
self.watcher_thread = threading.Thread(target=self.watch_pods, daemon=True)
38+
self.watcher_thread.start()
39+
logger.info(f"ResourceWatcher thread started for namespace '{self.namespace}'.")
40+
41+
def subscribe(self, callback: Callable):
42+
"""
43+
Adds a subscriber callback to be notified on events.
44+
45+
Parameters
46+
----------
47+
callback : Callable
48+
A function to call when an event is received.
49+
"""
50+
if callback not in self.subscribers:
51+
self.subscribers.append(callback)
52+
logger.debug(f"Subscriber {callback.__name__} added.")
53+
54+
def unsubscribe(self, callback: Callable):
55+
"""
56+
Removes a subscriber callback.
57+
58+
Parameters
59+
----------
60+
callback : Callable
61+
The subscriber function to remove.
62+
"""
63+
if callback in self.subscribers:
64+
self.subscribers.remove(callback)
65+
logger.debug(f"Subscriber {callback.__name__} removed.")
66+
67+
def notify_subscribers(self, event: dict):
68+
"""
69+
Notifies all subscribers about an event.
70+
71+
Parameters
72+
----------
73+
event : dict
74+
The Kubernetes event data.
75+
"""
76+
for subscriber in self.subscribers:
77+
try:
78+
subscriber(event)
79+
except Exception as e:
80+
logger.exception(f"Error notifying subscriber {subscriber.__name__}: {e}")
81+
82+
def watch_pods(self):
83+
"""
84+
Watches Kubernetes pod events and notifies subscribers.
85+
Runs in a separate thread.
86+
"""
87+
v1 = client.CoreV1Api()
88+
w = watch.Watch()
89+
resource_version = None
90+
91+
logger.info(f"Started watching pods in namespace '{self.namespace}'.")
92+
backoff_time = self.backoff_time
93+
reconnection_attempts = self.reconnection_attempts
94+
while not self._stop_event.is_set() and reconnection_attempts > 0:
95+
try:
96+
kwargs = {
97+
'namespace': self.namespace,
98+
'timeout_seconds': 0,
99+
}
100+
if resource_version:
101+
kwargs['resource_version'] = resource_version
102+
first_event = True
103+
for event in w.stream(v1.list_namespaced_pod, **kwargs):
104+
if first_event:
105+
# Reset reconnection attempts and backoff time upon successful reconnection
106+
reconnection_attempts = self.reconnection_attempts
107+
backoff_time = self.backoff_time
108+
first_event = False # Ensure this only happens once per connection
109+
pod_name = event['object'].metadata.name
110+
resource_version = event['object'].metadata.resource_version
111+
event_type = event['type']
112+
logger.debug(f"Received event: {event_type} for pod: {pod_name}")
113+
self.notify_subscribers(event)
114+
except (urllib3.exceptions.ProtocolError,
115+
urllib3.exceptions.ReadTimeoutError,
116+
urllib3.exceptions.ConnectionError) as e:
117+
reconnection_attempts -= 1
118+
logger.exception(f"Encountered network error: {e}")
119+
logger.info(f"Retrying to watch pods after {backoff_time} seconds...")
120+
time.sleep(backoff_time)
121+
backoff_time *= self.backoff_coefficient
122+
except client.ApiException as e:
123+
# Resource version is too old and cannot be accessed anymore
124+
if e.status == 410:
125+
logger.error("Received 410 Gone error, resetting resource_version and restarting watch.")
126+
resource_version = None
127+
continue
128+
else:
129+
reconnection_attempts -= 1
130+
logger.exception(f"Encountered ApiException: {e}")
131+
logger.info(f"Retrying to watch pods after {backoff_time} seconds...")
132+
time.sleep(backoff_time)
133+
backoff_time *= self.backoff_coefficient
134+
except StopIteration:
135+
logger.info("Watch stream ended, restarting watch.")
136+
continue
137+
except Exception as e:
138+
reconnection_attempts -= 1
139+
logger.exception(f"Watcher encountered exception: {e}")
140+
logger.info(f"Retrying to watch pods after {backoff_time} seconds...")
141+
time.sleep(backoff_time)
142+
backoff_time *= self.backoff_coefficient
143+
144+
145+
def stop(self):
146+
"""
147+
Stops the watcher thread gracefully.
148+
"""
149+
self._stop_event.set()
150+
self.watcher_thread.join()
151+
logger.info(f"ResourceWatcher thread stopped for namespace '{self.namespace}'.")

0 commit comments

Comments
 (0)