Page MenuHomePhabricator

RfC: Standards for external services in the Wikimedia infrastructure.
Closed, ResolvedPublic

Assigned To
Authored By
Imarlier
Nov 1 2018, 5:21 PM
Referenced Files
None
Tokens
"Like" token, awarded by Ladsgroup."Love" token, awarded by Bawolff."Yellow Medal" token, awarded by Mholloway."Like" token, awarded by Jdforrester-WMF.

Description

This document is an attempt to formalize the output of the "Architecting Core: Standalone Services" session from the 2018 Wikimedia Technical Conference (see https://www.mediawiki.org/wiki/Wikimedia_Technical_Conference/2018/Session_notes/Architecting_Core:_stand-alone_services and T206082).

Proposal

The following proposals aim to set architectural principles and standards for when, how and why to to extend MediaWiki via an external service for use in the WMF production environment.

Definition

''Standalone services'' are applications that extended MediaWiki's functionality, but that are operationally distinct in some meaningful way. The mechanism of extension is not specified, beyond saying that it's necessarily interprocess as opposed to intraprocess; queues, API calls and other types of RPC mechanisms, XHR from a client, are all valid examples, while shelling out from MediaWiki itself is not. It is important that the core business logic is implemented in the service, rather than within MediaWiki itself.

Selection Criteria: Deciding whether an external service is appropriate

The properties listed here are intended to be a guide as to whether a given feature can be provided externally to MediaWiki or not. They are intended to be necessary, but not sufficient. If a proposed feature has one or more properties that appear in the left column, but no properties that appear in the right column, then that feature could be implemented as a standalone service. If the proposed feature has one or more properties that appear in the right column, then a wide consensus must be reached before implementing it as a standalone service.

Properties that make a feature ''suitable'' for implementation as a standalone serviceProperties that make a feature ''unsuitable'' for implementation as a standalone service
State is independent - the functionality provided does not require view of MediaWiki state that is guaranteed to be current.Feature only works correctly with a consistent and current view of MediaWiki state.
3rd party library or service exists that can provide the needed functionality with minimal integrationFeature requires direct access to the MediaWiki database, and cannot use an API to retrieve or update data.
A non-PHP language or framework exists that significantly simplifies implementationRequires features/functionality provided by other MediaWiki extensions that are implemented internally
Functionality gracefully degradates if the external service is unavailableUnavailability of the external service compromises the general availability of the site to the user (e.g. results in a MediaWiki fatal error)
The feature is independently useful, and is likely to have non-MediaWiki use casesThe Feature involves directly parsing wikitext, or accessing MediaWiki i18n messages (also wikitext)

If any of the following are true, then the feature '''absolutely should be implemented as an external service''', with appropriate architectural changes made elsewhere to eliminate disqualifying properties.

Properties that ''require'' that a service is provided externally to MediaWiki
Elevated security need: Due to data isolation or other operational requirements, a given feature cannot be provided in the same operational environment as MediaWiki itself.
Excessive or potentially unbounded resource needs: Image thumbnailing, video transcoding, and machine translation are all examples of features where unpredictable properties such as request rates and input size have a significant impact on the resources required, and based on factors that the operator can't control may result in resource contention and denial of service.
Long-running processes involved.
The feature in question is used to triage or fix MediaWiki in the case of failures
The application is going to be run in a separate environment from MediaWiki itself

Given MediaWiki is not just part of the Wikimedia production infrastructure but also a software used my many third-parties, we can't just delegate some of its fundamental functions to an external service completely. Whenever an external service provides a functionality, we also need to ask ourselves if said functionality is fundamental or optional to MediaWiki.If the functionality replaced by the external service is fundamental, a fallback solution must be present within MediaWiki that substitutes what is being implemented in the service. As an example: for Wikimedia purposes, async processing for MediaWiki is implemented via a series of external services; for simple installations, a MySql-backed version of the same mecahnism exists and cannot be dropped from MediaWiki. What constitutes a core functionality of MediaWiki and what doesn't will need to be further defined elsewhere, and is beyond the scope of this document.

Architectural Guidelines for external services

This document only deals with principles (e.g. an application needs to be observable and expose appropriate metrics) and not with implementation guidelines. Practical implementation guidelines will be written to detail how the principles enumerated in this document are to be applied in technical terms (e.g. the application must expose RED metrics from a <tt>/metrics</tt> endpoint in prometheus format, with a precise naming convention). The reason to split the two parts is that while we don't expect the principles to change much across time, but we do expect the implementation guidelines to change quicker due to technical evolution.

There are various aspects of the development and usage cycle of a new service, and several of those need to be as standardized as possible across the board in order not to make the complexity of our ecosystem become unmaintainable. In general, adopting a non-monolithic architecture has its costs, and unless standards are maintained regarding how different applications need to interoperate and how they're developed.

There are several aspects of the development of a service that need to be taken into account:

  • Development policies
  • Security/privacy requirements
  • Production deployment

In the next few sections we analyze the requirements a new service must fulfill in each of those categories.

Development policies

Everything we develop should be free, open to collaboration and useful in itself. So, a new service must:

  • Actually do something
  • Be created only if there is no well crafted, well maintained, architecturally compatible FLOSS software that provides comparable functionality that can be adopted and improved/modified if needed.
  • Avoid needlessly duplicating features or functionality provided in other services
  • Be licensed under an OSI-approved license
  • Provide a configuration mechanism that does not involve changing the distributed code
  • Use a language and toolset that have been approved by TechCom

While some of our services will be only useful in the WMF context, in other cases the standalone service is intended to be distributed for general use. In that case, it must have the following properties:

  • Have a documented installation and uninstallation process that conform to our implementation guidelines
  • Have a documented upgrade process that conform with our implementation guidelines
  • Be versioned using semver
  • Indicate versions of MediaWiki with which it's compatible
  • Provide a mechanism by which support (community or otherwise) can be requested
  • Provide a mechanism by which patches can be proposed
  • Provide a mechanism by which public security advisories are issued

Security and privacy

All features implemented as standalone services must have the following properties:

  • Minimize data collection for any type of PII
  • Be compliant with the WMF privacy/data retention policies.
  • Implement privacy controls that are ''at least'' equivalent to those of any calling service. For example, if the privacy controls of the calling service specify that IP addresses will not be stored for more than 90 days, the external service may not store IP addresses for longer than that time.
  • Have a privacy policy and privacy practices that are compatible with the WMF/Wikimedia properties
  • Have passed a Security review
  • Have resources allocated so that a prompt response to any security incident is possible

Production deployment

If the standalone service is intended to be used in the Wikimedia production environment, it should comply with the guidelines above, and in addition must

  • Be deployable with standard WMF tooling (as specified in the implementation guidelines)
  • Have an owner, and a plan for onging maintenance. If the owner of a service is missing (because the team is disbanded/has a different focus), a new owner must be found via the code stewardship process
  • Have logging that conforms to the WMF standards - specified in the service implementation guidelines
  • Be able to collect and expose operational metrics according to the current WMF standards specified in the implementation guidelines
  • Have a runbook for operational purposes
  • Support a multi-datacenter active-active (or active-passive) deployment
  • Service Level Iindicators must be defined for the service, and Service Level Objectives should be agreed upon. Failure to meet said service level objectives SHOULD result in actions aimed at getting back on track. The Service Level Objectives can of course be reevaluated and changed, but preferably not as a result of a violation but rather an informed process
  • Have pinned / pinnable dependencies that don't need to be downloaded at runtime and/or from untrusted source
  • Have backups and a restoration/emergency plan if the service stores any data
  • Have users, or a plan to acquire users
Service - Service interaction

Services will likely interact with each other; if that is the case, measures must be taken not to make the whole system dependent on the failure of a single component. Also, increased observability in the flow of requests is needed. So any new service that needs to be deployed in production should:

  • Degrade gracefully its functionality if it can't access another service. If that's not possible, maybe the new service should be logically tied to the other. An exception is explicitly made for the MediaWiki API, given quite a few services might depend on its availability to be useful.
  • Be able to perform requests to a specific hostname/ip provided via configuration
  • Be able to use infrastructure middleware for inter service communication functionalities including, but not limited to, encryption and circuit-breaking. Alternatively, the service SHOULD implement those functionalities internally.
  • Add the appropriate tracing headers to the request, according to the WMF standards specified in the implementation guidelines
  • Log actions via the production logging facilities

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krinkle triaged this task as Medium priority.
Krinkle moved this task from Inbox to In progress on the TechCom board.

Reading through the session notes and this proposal, several aspects come to mind that I believe would be worth mentioning. I'll try to give a quick brain dump:

  • Distinction between public services (that clients use instead of or in addition of mediawiki), and internal services (that are used by mediawiki).
  • Distinction between stateless and stateful services. MediaWiki uses several external services to maintain state (MySQL, memcached, redis, etc).
  • Distinction between long-running tasks (transcoding), async notifications (event logging), and tasks to be performed synchronously during a request "while the user waits" (like the proposed new session storage service).
  • Per the above, stateless services that perform synchronous in-request tasks tend to be pointless. The same functionality can generally be implemented inline, in core or as an extension, without the communication and operational overhead of a standalone service.
  • Discussion of whether HTTP is the preferred method of communication between services (probably yes, at least for synchronous calls)
  • The "chattyness with the API" criterion also works the other way around: if MediaWiki would have to talk to the service a lot, that's an indicator that the functionality should probablly not be in a separate process.
  • Having the need to be synchronous as a counter-indicator doesn't seem useful. Several prime candidates for external services would need synchronous call semantics, e.g. an authentication service.
  • Functionality invoked via shell-out or implemented by forking should be treated separately, as it has very different characteristics.
  • The statement that it's "very likely" that an extension would be used for integration is misleading, as it implies that functionality that is presently in core should not be factored out into an external service. But factoring core functionality out into external services should indeed be considered, especially for security relevant functionality.
  • Anything that needs to function in a shared hosting environment (LAMP) can't be an external service (or needs to have an alternative PHP implementation)
  • Anything that needs to parse wikitext or need access to MediaWiki i18n messages (which are wikitext) should probably not be an external service.

That's it from the top of my head. Too bad I couldn't be in the session at TechConf.

I think @daniel raises quite a few good points, I will try to go through the document and integrate some more observations in the coming days. But I'd first propose we move the discussion mostly on-wiki: this is a big wall of text, and I feel a wiki page it's a better place to collaboratively enhance it.

Also, this is very large, and I can see, from a first read, quite a few things that are needed are missing. I would suggest we split this RfC into separate sections that we could discuss separately, and more specifically:

  1. Selection criteria (as above)
  2. Architectural principles for service-service interaction (so communication methods, failure management, retry policies, etc. this section is not very well defined in the current document and would need expanding
  3. Architectural Implementation Guidelines (as above)

@Imarlier if you agree, I'll move this document to mediawiki.org and we can discuss how to integrate it there? Else, I'll start proposing amendments here.

@daniel one detail I'd want to add: external storage systems like mysql or redis should not be considered "external services" - they're part of where MediaWiki stores its state.

Services that are used to augment the functionality of MediaWiki in a meaningful way, have their own storage (if any) are external services.

So, just to make concrete examples:

  • redis-for-sessions is not a service, as it's used as a pure datastore from Mediawiki, which is where all business logic is stored
  • the new Session Storage service, if it will include some business logic on the service side (like rate-limiting, security checks, invalidation logic), should be considered an external service.

We will need guidelines for data storage adoption, sure, but those should be in a separate document.

@Joe the only distinction I see there is "stuff we write" vs "stuff we don't write", really.

But you are right that criteria for data storage technology selection don't belong here. But "we use an external service for storing X" should follow from this guideline, even if that external service is Redis or Swift. But the criteria for deciding *which* data storage tech to use do not belong here.

Actually, Swift over local files is a good example for a case where "inline" functionality was replaced with an off-the-shelf service. And the criteria developed here should apply to, and be consistent with, that decision.

@Joe the only distinction I see there is "stuff we write" vs "stuff we don't write", really.

But you are right that criteria for data storage technology selection don't belong here. But "we use an external service for storing X" should follow from this guideline, even if that external service is Redis or Swift. But the criteria for deciding *which* data storage tech to use do not belong here.

Actually, Swift over local files is a good example for a case where "inline" functionality was replaced with an off-the-shelf service. And the criteria developed here should apply to, and be consistent with, that decision.

Ok, I don't think the same rules should apply for storage backends and business logic services - the main driver for having external storage backends is usually something absolutely tangential to the criteria expressed here - either scalability, functionality or storage guarantees.

In short:

  • We already have guidelines for choosing if local storage or a storage backend is preferrable. I think those can be summarized with "it is absolutely forbidden to write something as if the local disk was available for non-ephemeral storage", where non-ephemeral means "expects such state to be preserved across requests".
  • The quality criteria (about instrumentation, ownership, stability, language choice, architecture) we must apply to a storage backend are quite different than the ones you'd apply to an actual application performing business logic.

Broadening the scope of the principles expressed here would weaken the prescriptive value of this document (and - I guess - the more detailed implementation documents we will have to write starting from this.

Since I got no feedback on the proposal to move this on-wiki, I'll start commenting and amending here, although I would eventually like to move everything on-wiki.

Section 1: definitions

I don't think it's correct to define "external service" any executable that we launch shelling out from MediaWiki. For the sake of simplicity, I'd go further and assume "external service" means a separate logical cluster that MediaWiki reaches via an IP connection.

Section 2: selection criteria
This is the section I find more problematic, as basically none of our currently deployed services would fit that description.

I would remove "Feature is either async to servicing requests, or if sync, provides optional features." from the "DO's" and " Feature is synchronous to the request, and the request cannot complete until the feature is successfully provided." from the DON'Ts because they're worded in a misleading way. What we really want, if we're writing a sound architecture, is that the failure of an external service will not compromise the functionality of the rest of the "application" as seen from the user.
To make an example using CirrusSearch, which was cited here:
If the elasticsearch cluster is down or very slow, the search extension in mediawiki will just time out with a sensible default, and not block the functionality of the rest of the wikis. Search will be unavailable but the sites will be nonetheless up and running.
That's ok. What we need to avoid is to make daisy chains of microservices (so that service A cannot work unless service B works too - with a possible exception, in our case, for MediaWiki itself, that might be fundamental for the functionality of other services).

I am also definitely not ok with "A desire for the feature to be owned and maintained autonomously, pace of development is unlikely to correspond to the pace of MediaWiki development, or other organizational factors" as a criteria for having an external service.
This statement is unquantifiable and sounds like a "free-for-all" card that can always assumed to be true by anyone, and hard to prove or disprove. Moreover, it's the only criteria amongst all the ones we established that is not technical in nature. I have a further reason to find this problematic: this principle was used is the justification that was put forward in the past for creating external services (some of which are now without an owner following reorgs/strategy shifts) when there was no real need for it.

I think we should instead indicate clearly that, architecturally, creating an external service adds complexity to the infrastructure, and thus should be done only if there are strong reasons for building it rather than just a MediaWiki extension.

@Imarlier you wrote this RFC as the outcome of a session that you took on at the last minute. I suppose it's not fair to expect you to go through with getting this RFC through the process. Reaching consensus on this is not going to be quick or easy. Are you interested in (and have capacity to) take this on? If not, we should designate someone else to drive this.

If no one else wants to pick this up, I'm willing to work on this RfC. The fact I didn't participate in the session can be both and advantage (I have a fresh prespective) and a disadvantage (I don't have more background than what's written here).

Can we add to the "and in addition must" criteria, something along the lines of, have someone responsible for fixing any security issues that come up. Particularly for things we make ourselves, they don't just need to pass security review now, there also needs to be someone responsible for responding to security issues over the long-term, possibly long after development is done. (Things like the lack of response on T207222 are making me concerned about this point)

@daniel @Joe I don't think that I'm really the right person to drive this RfC -- I took on moderating the session, and I think that it's important that there are architectural guidelines in place around the integration of external services, but I don't pretend to be qualified to represent any particular community consensus on this question. I'm happy to continue to be involved to the extent necessary/desired, but don't have any attachment if someone else wants to pick it up.

(I will note, though, that someone should be driving this. I do think that there's a consensus, or close to it, that having actionable guidelines for external vs. internal decisions is important in the very near term.)

@Joe on-wiki is fine, from my perspective.

@Joe I think it would be excellent if you could take this on, thank you!

+1 to moving on-wiki.

I'll also throw in a quibble with this criterion: "A non-PHP language or framework exists that significantly simplifies implementation." To me it seems better to keep the focus here squarely on architectural principles, and to consider the language out of scope as an implementation detail (subject to criteria set elsewhere about things like how many languages we can practically support in production). For our purposes here, could we simply say, "An external tool or framework exists that significantly simplifies implementation"?

+1 to moving on-wiki.

I'll also throw in a quibble with this criterion: "A non-PHP language or framework exists that significantly simplifies implementation." To me it seems better to keep the focus here squarely on architectural principles, and to consider the language out of scope as an implementation detail (subject to criteria set elsewhere about things like how many languages we can practically support in production). For our purposes here, could we simply say, "An external tool or framework exists that significantly simplifies implementation"?

I agree we can be more generic in wording ("implementing the feature in a different language or framework significantly simplifies implementation") but the concept as epxressed is correct - and a good reason for developing a functionality as an external feature.

I will start moving the RfC on-wiki, and add some proposed amendments in the talk page.

I think the current version of the RfC is reasonably well structured, to the point I think we should move the discussion here.

  • Anything that needs to function in a shared hosting environment (LAMP) can't be an external service (or needs to have an alternative PHP implementation)

I strongly second this point, and I see it's missing from the current version of the RFC. One of MediaWiki's strengths as a software product outside of Wikimedia is that it generally works for hosting a low-traffic wiki without a whole lot of MediaWiki-specific setup and configuration.

For example, on a LAMP stack MediaWiki will use a MySQL table for search and key-value storage, while more specialized installations can use Elasticsearch (via the CirrusSearch extension) for the former and memcached, redis, and/or others for the latter. The job queue on a generic LAMP stack is run opportunistically on requests post-send, while more specialized installations can run jobs via cron or do something like Wikimedia's job runner architecture. While not a service, Scribunto on a LAMP stack will shell out to a standalone Lua binary while more specialized installations can install the LuaSandbox PHP extension, and we do similar things in core for image scaling and diff generation. Personally I've always thought it a bit of a failing that WYSIWYG editing (VisualEditor) only works with Parsoid and Restbase, and I'm hopeful that Parsoid-PHP will fix that.

There's also a bit of a spectrum here. Services that have their own communities (memcached, redis, Elasticsearch, etc.) and are packaged by most Linux distributions are likely easier to set up for third parties than bespoke services designed for use in Wikimedia's infrastructure, if only because externally-developed services' communities have had more people installing them in more different environments.

  • Anything that needs to function in a shared hosting environment (LAMP) can't be an external service (or needs to have an alternative PHP implementation)

I strongly second this point, and I see it's missing from the current version of the RFC. One of MediaWiki's strengths as a software product outside of Wikimedia is that it generally works for hosting a low-traffic wiki without a whole lot of MediaWiki-specific setup and configuration.

That's not really a requirement for external services, and while I tend to agree that MediaWiki needs to be an autoconsistent standalone software, I'm not convinced there is consensus on this point, or on the details of it.

Actually, I think the wording used here is also completely wrong. One could have an external service written in PHP and host it in a LAMP environment under a separate URL and work like a charm.

That MediaWiki keeps working correctly in the future when you just clone its repository (or download the tarball) is a requirement of MediaWiki, not of the Wikimedia architecture.

For example, on a LAMP stack MediaWiki will use a MySQL table for search and key-value storage, while more specialized installations can use Elasticsearch (via the CirrusSearch extension) for the former and memcached, redis, and/or others for the latter. The job queue on a generic LAMP stack is run opportunistically on requests post-send, while more specialized installations can run jobs via cron or do something like Wikimedia's job runner architecture. While not a service, Scribunto on a LAMP stack will shell out to a standalone Lua binary while more specialized installations can install the LuaSandbox PHP extension, and we do similar things in core for image scaling and diff generation. Personally I've always thought it a bit of a failing that WYSIWYG editing (VisualEditor) only works with Parsoid and Restbase, and I'm hopeful that Parsoid-PHP will fix that.

Gracefully degrading a feature from a full-fledged service to a simple implementation in the LAMP stack makes sense for features that are fundamental to mediawiki (async processing, search) but IMHO not in every case. While I agree that's desirable to have VE available to most installations, I don't think that's the reason why implementing parsoid outside of MediaWiki was a bad idea.

There's also a bit of a spectrum here. Services that have their own communities (memcached, redis, Elasticsearch, etc.) and are packaged by most Linux distributions are likely easier to set up for third parties than bespoke services designed for use in Wikimedia's infrastructure, if only because externally-developed services' communities have had more people installing them in more different environments.

I don't consider storage backends as "external services", and this RfC is most surely not directed at those. This is a set of standards for when and how to implement a feature within our infrastructure.

So while I understand the concern you and @daniel have, I'm not sure how that requirement fits this RfC rather than a set of principles regarding how we develop MediaWiki (not delegating completely core functionalities outside of the "monolith").

I'd like to clarify on the point of "Anything that needs to function in a shared hosting environment (LAMP) can't be an external service (or needs to have an alternative PHP implementation)"

If there is a feature that we want to implement outside of MW core, as a standalone service, not in PHP, there are then three options:

  1. we make explicit that this feature will simply not work in a minimal environment (that is, the feature is optional).
  2. we provide and maintain an alternative in-process implementation in PHP.
  3. we drop the requirement that "minimal environment" means "LAMP stack with no root access", and require e.g. container based hosting instead.

Parsoid went with (1). The problems we have with it are mainly caused by its need to call back to the core API all the time. As Anomie points out, we use (2) for search. For number (3), I think the question is when and how, rather than if. One, three, five years?

The reason I think this point should be mentioned in the standards is that these three options need to be considered when making the decision on implementing some functionality as an external service. For every such service, we need to explicitly choose one of the three options, and document the rational behind that choice. I agree that this is dictated by requirements of MediaWiki rather than Wikimedia architecture, but requirements of MediaWiki are constraints placed on the Wikimedia architecture, so they need to be considered.

Actually, I think the wording used here is also completely wrong. One could have an external service written in PHP and host it in a LAMP environment under a separate URL and work like a charm.

I suppose you could. Although that would mean you passed up the opportunity to make your service a library installable with Composer so it could be pulled into MediaWiki via composer.json and used that way without the overhead of API calls to localhost.

That MediaWiki keeps working correctly in the future when you just clone its repository (or download the tarball) is a requirement of MediaWiki, not of the Wikimedia architecture.

That strikes me as rather short-sighted. As Daniel points out, constraints on MediaWiki are also constraints on Wikimedia architecture, unless we plan on abandoning MediaWiki for use outside of Wikimedia's infrastructure.

Gracefully degrading a feature from a full-fledged service to a simple implementation in the LAMP stack makes sense for features that are fundamental to mediawiki (async processing, search) but IMHO not in every case. While I agree that's desirable to have VE available to most installations, I don't think that's the reason why implementing parsoid outside of MediaWiki was a bad idea.

In general I agree with your first sentence, not everything needs to be a core feature available on minimal installations. But over the years WYSIWYG editing has been touted as something necessary to running an effective wiki for editing by the general public, to the point that some people were pushing for storing Parsoid-HTML as the canonical representation of the page rather than wikitext.

So while I understand the concern you and @daniel have, I'm not sure how that requirement fits this RfC rather than a set of principles regarding how we develop MediaWiki (not delegating completely core functionalities outside of the "monolith").

The RFC is titled "Standards for external services that integrate with MediaWiki". If you drop the "integrate with MediaWiki" bit, what exactly are we trying to discuss here? And why does https://www.mediawiki.org/wiki/Requests_for_comment/Standards_for_external_services#Selection_Criteria:_Deciding_whether_an_external_service_is_appropriate exist?

For number (3), I think the question is when and how, rather than if. One, three, five years?

I would ask "if". What benefits does doing that bring that outweigh breaking compatibility with low-traffic low-effort installations?

The reason I think this point should be mentioned in the standards is that these three options need to be considered when making the decision on implementing some functionality as an external service. For every such service, we need to explicitly choose one of the three options, and document the rational behind that choice. I agree that this is dictated by requirements of MediaWiki rather than Wikimedia architecture, but requirements of MediaWiki are constraints placed on the Wikimedia architecture, so they need to be considered.

+1 to that.

I would say that what we need is a definition of what functionality is and is not part of a "simple mediawiki installation that should work in a LAMP environment" and thus cannot opt for the third option you your list. I'd expect it's a much needed work that should come out of product people, probably. We can add a couple sentences to further clarify the relationship between MediaWiki and external services in terms of functionality.

The RFC is titled "Standards for external services that integrate with MediaWiki". If you drop the "integrate with MediaWiki" bit, what exactly are we trying to discuss here? And why does https://www.mediawiki.org/wiki/Requests_for_comment/Standards_for_external_services#Selection_Criteria:_Deciding_whether_an_external_service_is_appropriate exist?

You might notice I intentionally dropped the last part from the title on-wiki, and I forgot to do the same here.

These standards should be held up for anything that we want to run in our environment.

That section you cite is there because sometimes we've implemented features in a service when it would've made sense to implement them in MediaWiki , or vice versa, from an architectural point of view.

As I said, adding a caveat (and a reference to some list of product requirements that should exist but doesn't at the moment) about the duplicating efforts that are needed if replacing a core feature with an external service doesn't do harm, and I'll do it.

For number (3), I think the question is when and how, rather than if. One, three, five years?

I would ask "if". What benefits does doing that bring that outweigh breaking compatibility with low-traffic low-effort installations?

The reason I think this point should be mentioned in the standards is that these three options need to be considered when making the decision on implementing some functionality as an external service. For every such service, we need to explicitly choose one of the three options, and document the rational behind that choice. I agree that this is dictated by requirements of MediaWiki rather than Wikimedia architecture, but requirements of MediaWiki are constraints placed on the Wikimedia architecture, so they need to be considered.

+1 to that.

! In T208524#4895821, @daniel wrote:
For number (3), I think the question is when and how, rather than if. One, three, five years?

I would ask "if". What benefits does doing that bring that outweigh breaking compatibility with low-traffic low-effort installations?

This seems to me like a false dilemma. I can readily imagine a future where both of these are true:

  • Mediawiki provides a 'core' feature set that is implemented directly in PHP, installable on a LAMP stack in the usual way, intended for low-traffic installations
  • Mediawiki is also distributed as itself + a small suite of external services, with the deployment managed by docker-compose, or a set of Helm charts for deployment on k8s, or similar. This would save substantial effort for high-traffic users, or for those seeking to replicate a maximum-feature-set install.

I would ask "if". What benefits does doing that bring that outweigh breaking compatibility with low-traffic low-effort installations?

I'm happy to discuss this, but that is beyond the scope of this ticket. Investigating this question more deeply, and writing up the trade-offs involved, is on my list for this quarter. For now, the minimal platform is LAMP-without-root-access, and bare-bone MediaWiki has to work on it.

I would say that what we need is a definition of what functionality is and is not part of a "simple mediawiki installation that should work in a LAMP environment"

While more explicit guidance of this would be useful, it seems like the question whether or not (or to what degree) a given feature needs to be supported in a minimal environment has to be made on a case-by-case basis. Having a check-list of criteria to guide that decision would be great, but I don't think we need to wait for that to exist.

I added the following paragraph to the RfC:

Given MediaWiki is not just part of the Wikimedia production infrastructure but also a software used my many third-parties, we can't just delegate some of its fundamental functions to an external service completely. Whenever an external service provides a functionality, we also need to ask ourselves if said functionality is fundamental or optional to MediaWiki.If the functionality replaced by the external service is fundamental, a fallback solution must be present within MediaWiki that substitutes what is being implemented in the service. As an example: for Wikimedia purposes, async processing for MediaWiki is implemented via a series of external services; for simple installations, a mysql-backed version of the same mecahnism exists and cannot be dropped from MediaWiki. What constitutes a core functionality of MediaWiki and what doesn't will need to be further defined elsewhere, and is beyond the scope of this document.

@Anomie @daniel do you think it would address your concerns and clarify the requirements a bit?

I added the following paragraph to the RfC:

Given MediaWiki is not just part of the Wikimedia production infrastructure but also a software used my many third-parties, we can't just delegate some of its fundamental functions to an external service completely. Whenever an external service provides a functionality, we also need to ask ourselves if said functionality is fundamental or optional to MediaWiki.If the functionality replaced by the external service is fundamental, a fallback solution must be present within MediaWiki that substitutes what is being implemented in the service. As an example: for Wikimedia purposes, async processing for MediaWiki is implemented via a series of external services; for simple installations, a mysql-backed version of the same mecahnism exists and cannot be dropped from MediaWiki. What constitutes a core functionality of MediaWiki and what doesn't will need to be further defined elsewhere, and is beyond the scope of this document.

@Anomie @daniel do you think it would address your concerns and clarify the requirements a bit?

Sounds good to me, thanks.

Joe renamed this task from RfC: Standards for external services that integrate with MediaWiki to RfC: Standards for external services in the Wikimedia infrastructure..Jan 23 2019, 10:13 AM

@Joe In the TechCom meeting today I agreed to collab with you to summarise RFC in the task here with a problem statement and proposed solution.

kchapman subscribed.

TechCom is placing this on Last Call ending March 6 1pm PST(21:00 UTC, 22:00 CET)

Would it be possible to clarify the wording on "There is no existing FLOSS software that provides the same functionality"? I believe the intent here is about surveying the FLOSS ecosystem for well crafted, well maintained, architecturally compatible FLOSS software that provides comparable functionality before specifying and building new non-trivial standalone services.

I think that the

"Collect RED metrics; be able to export those metrics according to WMF standards specified in the implementation guidelines"

should be rephrased to be more generic, e.g.

"Be able to collect and expose operational metrics according to the current WMF standards specified in the implementation guidelines"

and then specify in the implementation guidelines that we want RED metrics, or their larger counterpart, the 4 golden signals[1] or perhaps radically different approaches.

[1] https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals

"Log all requests received via the production logging facilities"

Should we make this a bit more generic? e.g.

"Log actions via the production logging facilities"

I am also a bit skeptical about this:

"Be able to perform requests via TLS to a specific hostname/ip provided via configuration"

In case it isn't clear from the bolding, it's the "via TLS" part I am talking about. The rest of the sentence looks fine to me.

Not that I don't want services to be able to do TLS, but TLS is a thing that has been proven hard to do correctly[1], [2]. It's something that perhaps we should invest in the infrastructure to do correctly and not burden service developers with.

[1] https://github.com/nodejs/node/issues/4175
[2] https://maulwuff.de/research/ssl-debugging.html#hdr3

"Log all requests received via the production logging facilities"

Should we make this a bit more generic? e.g.

"Log actions via the production logging facilities"

Agreed, will amend.

I am also a bit skeptical about this:

"Be able to perform requests via TLS to a specific hostname/ip provided via configuration"

In case it isn't clear from the bolding, it's the "via TLS" part I am talking about. The rest of the sentence looks fine to me.

Not that I don't want services to be able to do TLS, but TLS is a thing that has been proven hard to do correctly[1], [2]. It's something that perhaps we should invest in the infrastructure to do correctly and not burden service developers with.

[1] https://github.com/nodejs/node/issues/4175
[2] https://maulwuff.de/research/ssl-debugging.html#hdr3

To be pefectly clear, I don't think this, or circuit-breaking, will be implemented necessarily inside the service. The infrastructure will provide middlewares to do it, but services will need to have interfaces compatible with such infrastructural middleware, or implement the functionality themselves.

Take as an example Kask, the new cassandra-backed k-v storage service: we might not want a middleware mediating calls there. As the service doesn't need to do any circuit-breaking as it doesn't do any RPC, and it can easly expose TLS natively. So it would fulfill the requirements without the need of infrastructural helpers.

Does this address your concern?

I think that the

"Collect RED metrics; be able to export those metrics according to WMF standards specified in the implementation guidelines"

should be rephrased to be more generic, e.g.

"Be able to collect and expose operational metrics according to the current WMF standards specified in the implementation guidelines"

and then specify in the implementation guidelines that we want RED metrics, or their larger counterpart, the 4 golden signals[1] or perhaps radically different approaches.

[1] https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals

Sure, this was a case where I didn't make the requirement generic enough (given we're delegating this to implementation guidelines).

Would it be possible to clarify the wording on "There is no existing FLOSS software that provides the same functionality"? I believe the intent here is about surveying the FLOSS ecosystem for well crafted, well maintained, architecturally compatible FLOSS software that provides comparable functionality before specifying and building new non-trivial standalone services.

Sure, I'll repurpose what you wrote verbatim - it's much clearer that way!

Production deployment
<snip>
Have backups if the service stores any data

May be we could add "and a restoration/emergency plan", in terms that sometimes it might not be straightforward of what to do with a backup or what is included in a backup.

I am also a bit skeptical about this:

"Be able to perform requests via TLS to a specific hostname/ip provided via configuration"

In case it isn't clear from the bolding, it's the "via TLS" part I am talking about. The rest of the sentence looks fine to me.

Not that I don't want services to be able to do TLS, but TLS is a thing that has been proven hard to do correctly[1], [2]. It's something that perhaps we should invest in the infrastructure to do correctly and not burden service developers with.

[1] https://github.com/nodejs/node/issues/4175
[2] https://maulwuff.de/research/ssl-debugging.html#hdr3

To be pefectly clear, I don't think this, or circuit-breaking, will be implemented necessarily inside the service. The infrastructure will provide middlewares to do it, but services will need to have interfaces compatible with such infrastructural middleware, or implement the functionality themselves.

Yes, agreed on that.

Take as an example Kask, the new cassandra-backed k-v storage service: we might not want a middleware mediating calls there. As the service doesn't need to do any circuit-breaking as it doesn't do any RPC, and it can easly expose TLS natively. So it would fulfill the requirements without the need of infrastructural helpers.

Well, the "easily" part is debatable, nothing seems to be easy with TLS these days. But I get your point.

Does this address your concern?

I would suggest the following change of wording to 2 different sentences

  • Be able to perform requests to a specific hostname/ip provided via configuration
  • Be able to use infrastructure middleware for inter service communication functionalities including, but not limited to, encryption and circuit-breaking. Alternatively, the service MUST implement those functionalities internally.

I am ambivalent on the MUST, maybe we should relax it to SHOULD.

Production deployment
<snip>
Have backups if the service stores any data

May be we could add "and a restoration/emergency plan", in terms that sometimes it might not be straightforward of what to do with a backup or what is included in a backup.

ack, done.

I would suggest the following change of wording to 2 different sentences

  • Be able to perform requests to a specific hostname/ip provided via configuration
  • Be able to use infrastructure middleware for inter service communication functionalities including, but not limited to, encryption and circuit-breaking. Alternatively, the service MUST implement those functionalities internally.

I am ambivalent on the MUST, maybe we should relax it to SHOULD.

I like this a lot, I simplified the requirements (and yes I agree on "SHOULD", there are always special cases to consider).

Last Call extended by one week. Now ending at: March 13 11pm PST (March 14 7:00 UTC, 8:00 CET)

Is the canonical location of the text here on task or the wiki?

Is this now documented somewhere on mediawiki.org? I don't see it linked from https://www.mediawiki.org/wiki/Development_policy.

@Joe: Do you (or anyone else) know where this is documented?

@Aklapper the RfC has been edited to reflect what's on phabricator at https://www.mediawiki.org/wiki/Requests_for_comment/Standards_for_external_services

I didn't add it to the development policy page though. @daniel I'll create a separate page and link it in the development policy page.

I didn't add it to the development policy page though. @daniel I'll create a separate page and link it in the development policy page.

Sounds good!