Fix test_server.py for Python 3.14: robust IPv4 binding and error surfacing (!352) · Merge requests · onecommons / unfurl

Open Adam requested to merge copilot/fix-server-test-connection-error into testing Nov 29, 2025

Created by: Copilot

Add wait_for_status helper function near top of tests/test_server.py to poll until expected status code or timeout
Replace direct assertion assert res.status_code == 304 in test_server_export_remote for the application-blueprint test with call to wait_for_status
Also replace the direct assertion for the dashboard cache hit test (the loop at "cache hit for" case)
Add comment explaining the change is to make test robust against async cache population

Combined changes from both PR #352 (robust IPv4 binding and error surfacing) and PR #355 (wait_for_status helper for async cache timing).

Rebased cleanly against testing branch.

Changes from PR #352:

Explicit IPv4 binding: Added HOST = "127.0.0.1" constant to avoid getaddrinfo resolution differences across platforms
Error queue for child tracebacks: New serve_server wrapper forwards exceptions to a multiprocessing.Queue, enabling parent process to surface actual server crash reasons
Robust connection polling: Replaced urllib-based health check with socket.create_connection against both IPv4/IPv6, increased timeout from 2s to 12s
Updated all Process creation sites: Use get_context() and attach error queue to fixture, set_up_deployment, test_server_export_local, test_server_export_remote
Updated all request URLs: Use HOST instead of hardcoded 127.0.0.1

Changes from PR #355:

wait_for_status helper: Polls HTTP endpoint until expected status code or timeout elapses
Async cache timing tolerance: Replaced direct assert res.status_code == 304 assertions with wait_for_status(expected=304, timeout=15.0) calls

Original prompt

Problem

The py3.14 CI job is failing with ConnectionRefusedError in tests/test_server.py::test_server_version because the test helper that starts the server (start_server_process) is fragile: it polls only briefly, uses urllib against "localhost" (subject to IPv6/IPv4 resolution differences), and does not surface server-side exceptions when the child process exits. That hides the root cause when the server process crashes or binds to a different address family under Python 3.14.

Goal

Create a PR that makes tests/test_server.py robust so the test harness:

consistently binds/connects to IPv4 (127.0.0.1) by default,
uses a robust start_server_process that polls for a TCP listener across families and longer timeout,
captures and surfaces server-side Python tracebacks from the child process via a multiprocessing.Queue so failing starts show a clear error in test output,
uses multiprocessing.get_context() to create processes and attaches the error queue to the Process object so the waiter can inspect it.

Files to change

tests/test_server.py (ref ee68f046)

Detailed changes to apply

Add imports near the top (or reuse existing ones) and define a HOST constant:

import traceback
import socket
from multiprocessing import get_context, Queue

# Prefer explicit IPv4 loopback for tests to avoid getaddrinfo resolution ordering differences
HOST = "127.0.0.1"

Replace the current serve_server wrapper with one that forwards child start errors to a Queue:

def serve_server(*args, error_queue: Queue = None, **kw):
    try:
        return server.serve(*args, **kw)
    except Exception:
        tb = traceback.format_exc()
        if error_queue is not None:
            error_queue.put(tb)
        logging.warning("server.serve unexpectedly failed", exc_info=True)
        raise

Replace start_server_process with a robust implementation that:

starts the Process (assumes it was created with kwargs {"error_queue": error_queue}),
polls socket.create_connection against HOST and '::1' for up to timeout,
if the child exits early, reads the queued traceback and raises RuntimeError with the traceback.

def start_server_process(process_obj, port, hosts=(HOST, "::1"), timeout=12.0):
    process_obj.start()
    start = time.time()
    last_exc = None

    def _child_traceback():
        eq = getattr(process_obj, "_error_queue", None)
        if not eq:
            return None
        try:
            return eq.get_nowait()
        except Exception:
            return None

    while time.time() - start < timeout:
        if not process_obj.is_alive():
            tb = _child_traceback()
            if tb:
                raise RuntimeError(f"server process exited prematurely; traceback:\n{tb}")
            else:
                raise RuntimeError(f"server process exited prematurely with exitcode {process_obj.exitcode}")
        for h in hosts:
            try:
                with socket.create_connection((h, port), timeout=1):
                    return process_obj
            except Exception as e:
                last_exc = e
        time.sleep(0.1)

    tb = _child_traceback()
    if tb:
        raise RuntimeError(f"server not reachable on port {port} after {timeout}s; server traceback:\n{tb}")
    raise RuntimeError(f"server not reachable on port {port} after {timeout}s; last error: {last_exc}")

Replace Process creation sites in tests/test_server.py to use get_context() and pass an error_queue, attach the queue to the Process object. Also pass HOST instead of literal 'localhost' in args. Examples (apply to all occurrences):

Before (existing pattern):

server_process = Process(
    target=server.serve,
    args=("localhost", _static_server_port, "secret", ".", "", {}, CLOUD_TEST_SERVER),
)

After:

ctx = get_context()
error_queue = Queue()
server_process = ctx.Process(
    target=serve_server,
    args=(HOST, _static_server_port, "secret", ".", "", {}, CLOUD_TEST_SERVER),
    kwargs={"error_queue": error_queue},
)
server_process._error_queue = error_queue

Do the same in set_up_deployment, test_server_export_local, test_server_export_remote, and other places where Process is used to start server.serve.

Ensure all client request URLs use HOST rather than hard-coded 'localhost'. For example:

res = requests.get(f"http://{HOST}:{_static_server_port}/health", params={"secret": "secret"})

(Optional) Add a small try/except around waitress.serve in unfurl/server/serve.py to make the server log any startup exception clearly. This is optional; the test-side error queue should be sufficient to surface tracebacks.

Why this change

The error_queue surfaces the actual server traceback when the child dies on startup under Python 3.14 (so CI failure becomes actionable).
Using HOST = 127.0.0.1 avoids getaddrinfo ordering differences between IPv4/IPv6.
socket.create_connection is a more d...

This pull request was created as a result of the following prompt from Copilot chat.

Problem

The py3.14 CI job is failing with ConnectionRefusedError in tests/test_server.py::test_server_version because the test helper that starts the server (start_server_process) is fragile: it polls only briefly, uses urllib against "localhost" (subject to IPv6/IPv4 resolution differences), and does not surface server-side exceptions when the child process exits. That hides the root cause when the server process crashes or binds to a different address family under Python 3.14.

Goal

Create a PR that makes tests/test_server.py robust so the test harness:

consistently binds/connects to IPv4 (127.0.0.1) by default,

uses a robust start_server_process that polls for a TCP listener across families and longer timeout,

captures and surfaces server-side Python tracebacks from the child process via a multiprocessing.Queue so failing starts show a clear error in test output,

uses multiprocessing.get_context() to create processes and attaches the error queue to the Process object so the waiter can inspect it.

Files to change

tests/test_server.py (ref ee68f046)

Detailed changes to apply

Add imports near the top (or reuse existing ones) and define a HOST constant:
import traceback
import socket
from multiprocessing import get_context, Queue

# Prefer explicit IPv4 loopback for tests to avoid getaddrinfo resolution ordering differences
HOST = "127.0.0.1"
Replace the current serve_server wrapper with one that forwards child start errors to a Queue:
def serve_server(*args, error_queue: Queue = None, **kw):
    try:
        return server.serve(*args, **kw)
    except Exception:
        tb = traceback.format_exc()
        if error_queue is not None:
            error_queue.put(tb)
        logging.warning("server.serve unexpectedly failed", exc_info=True)
        raise
Replace start_server_process with a robust implementation that:

starts the Process (assumes it was created with kwargs {"error_queue": error_queue}),

polls socket.create_connection against HOST and '::1' for up to timeout,

if the child exits early, reads the queued traceback and raises RuntimeError with the traceback.
def start_server_process(process_obj, port, hosts=(HOST, "::1"), timeout=12.0):
    process_obj.start()
    start = time.time()
    last_exc = None

    def _child_traceback():
        eq = getattr(process_obj, "_error_queue", None)
        if not eq:
            return None
        try:
            return eq.get_nowait()
        except Exception:
            return None

    while time.time() - start < timeout:
        if not process_obj.is_alive():
            tb = _child_traceback()
            if tb:
                raise RuntimeError(f"server process exited prematurely; traceback:\n{tb}")
            else:
                raise RuntimeError(f"server process exited prematurely with exitcode {process_obj.exitcode}")
        for h in hosts:
            try:
                with socket.create_connection((h, port), timeout=1):
                    return process_obj
            except Exception as e:
                last_exc = e
        time.sleep(0.1)

    tb = _child_traceback()
    if tb:
        raise RuntimeError(f"server not reachable on port {port} after {timeout}s; server traceback:\n{tb}")
    raise RuntimeError(f"server not reachable on port {port} after {timeout}s; last error: {last_exc}")
Replace Process creation sites in tests/test_server.py to use get_context() and pass an error_queue, attach the queue to the Process object. Also pass HOST instead of literal 'localhost' in args. Examples (apply to all occurrences):

Before (existing pattern):
server_process = Process(
    target=server.serve,
    args=("localhost", _static_server_port, "secret", ".", "", {}, CLOUD_TEST_SERVER),
)
After:
ctx = get_context()
error_queue = Queue()
server_process = ctx.Process(
    target=serve_server,
    args=(HOST, _static_server_port, "secret", ".", "", {}, CLOUD_TEST_SERVER),
    kwargs={"error_queue": error_queue},
)
server_process._error_queue = error_queue
Do the same in set_up_deployment, test_server_export_local, test_server_export_remote, and other places where Process is used to start server.serve.

Ensure all client request URLs use HOST rather than hard-coded 'localhost'. For example:
res = requests.get(f"http://{HOST}:{_static_server_port}/health", params={"secret": "secret"})
(Optional) Add a small try/except around waitress.serve in unfurl/server/serve.py to make the server log any startup exception clearly. This is optional; the test-side error queue should be sufficient to surface tracebacks.

Why this change

The error_queue surfaces the actual server traceback when the child dies on startup under Python 3.14 (so CI failure becomes actionable).

Using HOST = 127.0.0.1 avoids getaddrinfo ordering differences between IPv4/IPv6.

socket.create_connection is a more direct liveness check than urllib.request.urlopen and avoids name resolution differences.

Increasing the timeout and polling avoids races where the parent attempts to connect too quickly.

Testing

Run the py3.14 tox job locally or in CI after the PR is merged; failing tests should now include a clear server traceback if the server is crashing or should pass if it was only a race/address-family issue.

Please create a PR that updates tests/test_server.py with the changes described above and use ref ee68f046 as the base for modifications.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.