Extract normative algorithms defined in specs #1614

tidoust · 2024-07-08T07:56:09Z

This adds a browserlib module that creates extracts filled with information about algorithms defined in the spec. The extracts are rather raw: they just capture the tree structure of algorithms and copy the HTML for each step. This makes the resulting extracts less directly useful but also makes it possible to run all sorts of analyses on them (see related PR in Strudy: w3c/strudy#645 and the current results of running the analysis).

This would add about 50MB of additional data to a Webref crawl result. That's significant, roughly equivalent to the IDs extracts, which are the heaviest for now.

The structure of the extracts is very likely going to change substantively as we learn from experience!

There are plenty of things that could be improved. The code contains TODOs for main ones.

An algorithm extract is essentially an object with the following keys:

name: The name of the algorithm, when one exists
href: The URL with fragment to reach the algorithm, when one exists
html: Some introductory prose for the algorithm. That prose may well contain actual algorithmic operations, e.g.: "When invoked, run the following
steps in parallel". href/src attributes in the HTML have absolute URLs.
steps: Atomic algorithm steps.

Each step is essentially an object that follows the same structure as an algorithm, except that it does not have a name and href keys, and may also have the following keys:

operation: Gives the name of the main operation performed by the step, when one was identified. So far, that's only for "switch".
case: Used in switch steps to identify the switch condition that triggers the step.
ignored: Ordered lists found at the step level that do no look like algorithm steps. Or maybe they are? The lists should get reviewed: they usually describe inputs/outputs or conditions, but they may signal parts where the extraction logic needs to be improved. The lists are reported as text prose.
additional: Each step should contain one and only one algorithm. When other algorithms are found at the same level, they get reported in that property. That usually either signals that the spec could be improved because if fails to use different list items for different steps, and/or that the extraction logic needs to be smarter.

This adds a browserlib module that creates extracts filled with information about algorithms defined in the spec. The extracts are rather raw: they just capture the tree structure of algorithms and copy the HTML for each step. This makes the resulting extracts less directly useful but makes it possible to run all sorts of analyses on them. The structure of the extracts is very likely going to change substantively as we learn from experience! An algorithm extract is essentially an object with the following keys: - `name`: The name of the algorithm, when one exists - `href`: The URL with fragment to reach the algorithm, when one exists - `html`: Some introductory prose for the algorithm. That prose may well contain actual algorithmic operations, e.g.: "When invoked, run the following steps in parallel". href/src attributes in the HTML have absolute URLs. - `steps`: Atomic algorithm steps. Each step is essentially an object that follows the same structure as an algorithm, except that it does not have a `name` and `href` keys, and may also have the following keys: - `operation`: Gives the name of the main operation performed by the step, when one was identified. So far, that's only for "switch". - `case`: Used in switch steps to identify the switch condition that triggers the step. - `ignored`: Ordered lists found at the step level that do no look like algorithm steps. Or maybe they are? The lists should get reviewed: they usually describe inputs/outputs or conditions, but they may signal parts where the extraction logic needs to be improved. The lists are reported as text prose. - `additional`: Each step should contain one and only one algorithm. When other algorithms are found at the same level, they get reported in that property. That usually either signals that the spec could be improved because if fails to use different list items for different steps, and/or that the extraction logic needs to be smarter.

dontcallmedom

wow!

results from a first review (hoping that's useful); if you could not force-commit moving forward, that'd be helpful :)

I hope this will be coming with tests given all the funky archaeology you must have done on this!

dontcallmedom · 2024-07-08T09:51:10Z

src/browserlib/extract-algorithms.mjs

+ * The list is completed with a few branching operations that are not verbs:
+ * "for", "if", "while".
+ *
+ * Using a growing list of verbs may not be a good idea. That said, it is an


maybe something for Infra to integrate at some point?

Worth exploring, indeed. Ideally, all operations would have some specific meaning defined somewhere (and there would be a restricted number of them). Now, on top of core operations, some of these verbs are the beginning of a named algorithm that the step calls. Such verbs don't need to be defined in Infra. The extraction logic is not smart enough to detect that for now.

src/browserlib/extract-algorithms.mjs

The extracts are the results of running the following command: node reffy --module algorithms:src/browserlib/extract-algorithms.mjs --output data ... with a version of Reffy that has the extract-algorithms module, see: w3c/reffy#1614

The `cloneAndClean` logic was completed to drop a few additional informative bits and to strip the HTML of comments as well, as was done in dfns extraction. The function is now called from dfns extraction, Web IDL extraction and algorithms extraction.

The detection only looked at the first step in a list to detect an operation. This wasn't the intent. The code now looks at all steps. This allows the code to extract more algorithms and sets of steps than before (dropping a number of cases that used to report `ignored`). Now, this update also means that various lists that look like steps but aren't actual steps get extracted and reported as algorithms. That does not affect the existing analysis so far, but we'll need to iterate on the logic, for sure.

dontcallmedom · 2024-07-09T12:50:26Z

src/browserlib/extract-algorithms.mjs

+ relativeToAbsolute[linkEl.getAttribute(attr)] = url.toString();
+ }
+
+ const clone = cloneAndClean(el);


to help (somewhat) reduce the weight of the extracts, I wonder if we shouldn't apply the same cleaning as definition content (and thus migrate that code into the cloneAndClean module)

That would reduce the weight from ~48MB to ~32MB.

I had considered it but since I did not really know how to scope the analyses we would end up doing against algorithms, I thought I'd preserve the whole thing. With definitions, we want to extract prose that can be inserted somewhere else. We need to be conservative. For analysis, the more the merrier.

There are algorithm steps that contain formulas with syntax highlighting that end up being represented through a <c-> element, others such as Web IDL that use <emu-val> and <emu-const>, and some specs seem to integrate recent changes through <ins> such as nav-speculation. Dropping the elements would cut words in the middle of a sentence. The "fingerprint" image used sometimes could be worth preserving as well for study.

Now, the vast majority of gains seems to come from cutting attributes. Perhaps we can just do that, but preserve elements for now.

(Also, note there's a bug in the extract-dfns function: el.closest('[data-reffy-page]') is unlikely going to return an element, because el is a detached clone. I'm surprised that does not seem to have triggered any issue so far. I'll need to fix that, as done in the extract-algorithms module)

There should be no normative content in any of these asides. Using one list avoids duplicating logic partially, as was done until now!

If an ordered list gets captured as introductory prose, that means it was ignored as an algorithm. Such lists should already get reported in the `ignored` property and likely don't introduce the algorithm per se.

dontcallmedom

noted a few bugs while exploring, will suggest changes or discussion in a bit

The code was assuming relative links would only be to same-page anchors, but in a multi-page spec, it could link to a different page of the spec as well

as described in https://infra.spec.whatwg.org/#algorithm-declaration The extraction criteria ('To ' followed by an exported definition of dfn-type 'dfn' or 'abstract-op') restrictive enough to avoid false positives, but would benefit from checking on real data

dontcallmedom

changes made - the second one in particular deserves review

dontcallmedom · 2024-07-10T10:36:57Z

https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#duration-time-component is extracted as an algorithm, which it isn't really; not sure yet what we can do something about it

tidoust · 2024-07-10T11:03:58Z

src/browserlib/extract-algorithms.mjs

+ const probableOneLine = [...section.querySelectorAll(candidateDfnSelectors.map(s => `p:has(${s})`).join(','))]
+ .filter(p => p.textContent.startsWith("To " p.querySelector(candidateDfnSelectors.join(',')).textContent))
+ .map(p => {
+ return { rationale: '"To <dfn>"', root: p };


Isn't that going to extract the introductory paragraph of multi-step algorithms too, which will also be extracted with an <ol> as root?

<p>To <dfn ...>do plenty of things</dfn>, run the following steps:</p> <ol><li>...</li></ol>

The duplicate filtering some lines below won't catch that because the <p> and <ol> elements are disjoint.

I had (mistakenly) assumed that the duplication filter would handle it (or that the test would yell at me otherwise); will look into fixing it

done in ee10949

tidoust · 2024-07-10T11:14:09Z

https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#duration-time-component is extracted as an algorithm, which it isn't really; not sure yet what we can do something about it

That's precisely the sort of steps that were ignored when the code was only checking the first step in a list. But then extraction missed a number of real algorithms as a result. There's room for improvement, for sure.

tidoust · 2024-07-10T11:26:39Z

I mentioned updating the README, but that's more for Webref in practice. Now, a couple of additional things to do before merge:

src/browserlib/reffy.json needs to be completed with the new module, otherwise it won't run by default. I'll do it.
I noticed that the crawl crashes partially on a few specs (including Web Audio API, WebXR, and Web Authentication), falling back to a regular network request instead of reusing the information cache. Error is "Body is unusable". I need to look into that.

tidoust · 2024-07-10T11:43:41Z

tests/extract-algorithms.js

+ html: 'To <dfn data-export="" data-dfn-type="dfn" id="algo-id">do nothing</dfn>, keep calm and carry on.',
+ rationale: '"To <dfn>"',
+ steps: [ {
+ html: 'To <dfn data-export="" data-dfn-type="dfn" id="algo-id">do nothing</dfn>, keep calm and carry on.'


Conceptually, do we really want to duplicate the text both in the root html property, and as a single step? Wouldn't it be clearer to not have any steps property in such cases? I.e., to end up with:

[ { name: 'do nothing', href: 'about:blank#algo-id', html: 'To <dfn data-export="" data-dfn-type="dfn" id="algo-id">do nothing</dfn>, keep calm and carry on.', rationale: '"To <dfn>"' } ]

I wondered about it, but started with the smallest code change. I think the options are:

split the actual procedural stuff ("keep calm and carry on" into the step (ideal, but maybe hard to do robustly)

not have steps (which at least provide a simple signal this is a "short form" algorithm)

duplicate root markup in a single step

I'm happy to use the 2nd option at least as a starting point, which I agree would be an improvement.

1 for the second option. It has the advantage of being consistent with the way a step in an algorithm is extracted when it does not define substeps: It gets reported with an html property and no steps.

done in 8487c8e

tidoust · 2024-07-10T13:22:51Z

1. `src/browserlib/reffy.json` needs to be completed with the new module, otherwise it won't run by default. I'll do it.

Done.

2. I noticed that the crawl crashes partially on a few specs (including Web Audio API, WebXR, and Web Authentication), falling back to a regular network request instead of reusing the information cache. Error is "Body is unusable". I need to look into that.

Well, That no longer seems to happen. Now, I don't understand how that error could appear in the first place (plus, that seemed reproducible), and what change made the error disappear.

per #1614 (comment)

per #1614 (review)

dontcallmedom · 2024-07-10T14:55:30Z

feel free to merge if you're satisfied with my changes to address your comments

src/browserlib/extract-algorithms.mjs

per #1614 (comment)

This re-writes the "one-step" logic slightly to: 1. filter algorithms that are not "one-step" ones directly from within the `findAlgorithms` function 2. preserve the document order for algorithms 3. not output an empty `steps` property, consistent with how atomic steps are serialized. The update also fixes the serialization of switch cases, which introduced an additional steps level that wasn't needed. Switch cases are now directly serializes at the `case` level.

tidoust · 2024-07-11T07:15:09Z

I'm running a crawl locally to review extracts one more time before merge.

When an explicitly flagged algorithm contains a `<dfn>`, that definition is now used as the anchor to find the introductory paragraph. This is useful for cases where the spec adds some intermediary paragraph between the actual introductory paragraph and the steps, as in: https://url.spec.whatwg.org/#percent-decode The new logic is not perfect. It cannot spot the cases where the algorithm is not explicitly linked to the steps, as in: https://w3c.github.io/webappsec-csp/#abstract-opdef-parse-a-serialized-csp In that case, the extract will contain two algorithms: first one with the actual introductory paragraph and dfn, thought to be a "single-step" algorithm, second one with the actual steps but the wrong introductory paragraph (and no dfn), because the logic only checks the previous paragraph. The URL spec also uses `data-algorithm-for` to create a namespace for the algorithms. Now supported. Example: https://url.spec.whatwg.org/#percent-decode https://url.spec.whatwg.org/#string-percent-decode

src/browserlib/extract-algorithms.mjs

One spec had an intermediary note paragraph that got picked up as introductory paragraph: https://w3c.github.io/webappsec-credential-management/#abstract-opdef-create-a-passwordcredential-from-an-htmlformelement The code now ignores such informative paragraphs.

tidoust · 2024-07-11T09:55:34Z

I made a few adjustments, with related tests. The single-step algorithms seem to get correctly extracted without creating duplicates now (with the CSP exception, which I'm happy to live with for now). Good to merge?

dontcallmedom

congrats again for an amazing source of yak wool!

New feature: - Extract normative algorithms defined in specs (#1614) The algorithms extracts should be seen as unstable for now. The structure may evolve and there are a number of TODOs to improve the extraction. Dependencies bumped: - Bump puppeteer from 22.12.1 to 22.13.0 (#1616) - Bump web-specs from 3.12.1 to 3.13.1 (#1617) - Bump mocha from 10.5.1 to 10.6.0 (#1612) - Bump rollup from 4.18.0 to 4.18.1 (#1615)

This was referenced Jul 8, 2024

Add initial logic to study in parallel algorithm steps w3c/strudy#645

Merged

Automated analysis of in parallel algorithms tidoust/parallel-promise#2

Open

dontcallmedom reviewed Jul 8, 2024

View reviewed changes

tidoust added 2 commits July 8, 2024 16:18

dontcallmedom reviewed Jul 9, 2024

View reviewed changes

tidoust added 4 commits July 9, 2024 18:12

Drop more informative prose when cleaning up the HTML

d608d9c

Merge "aside" selectors with the informative selectors

fb59e47

There should be no normative content in any of these asides. Using one list avoids duplicating logic partially, as was done until now!

Drop ordered lists from introductory prose

b05a7e5

If an ordered list gets captured as introductory prose, that means it was ignored as an algorithm. Such lists should already get reported in the `ignored` property and likely don't introduce the algorithm per se.

Add more tests on algorithms extraction

6a3de57

tidoust marked this pull request as ready for review July 10, 2024 08:41

dontcallmedom approved these changes Jul 10, 2024

View reviewed changes

dontcallmedom requested changes Jul 10, 2024

View reviewed changes

dontcallmedom added 2 commits July 10, 2024 11:56

Fix href extraction of algorithm in multipage spec

0d59622

The code was assuming relative links would only be to same-page anchors, but in a multi-page spec, it could link to a different page of the spec as well

dontcallmedom self-requested a review July 10, 2024 10:34

dontcallmedom reviewed Jul 10, 2024

View reviewed changes

dontcallmedom self-requested a review July 10, 2024 10:35

tidoust commented Jul 10, 2024

View reviewed changes

Add algorithms extraction to the core list of modules

0e0ecde

tidoust and others added 3 commits July 10, 2024 16:50

Adjust crawl tests with new "algorithms" extracts

ff4011a

Return empty steps on short-form algorithms

8487c8e

per #1614 (comment)

Avoid duplicates in multi steps algos matching short form convention

ee10949

per #1614 (review)

dontcallmedom approved these changes Jul 10, 2024

View reviewed changes

tidoust commented Jul 10, 2024

View reviewed changes

src/browserlib/extract-algorithms.mjs Outdated Show resolved Hide resolved

tidoust commented Jul 10, 2024

View reviewed changes

src/browserlib/extract-algorithms.mjs Outdated Show resolved Hide resolved

dontcallmedom and others added 2 commits July 10, 2024 17:39

Limit short-form duplicate filtering to zero-steps algorithms

11ea7a0

per #1614 (comment)

dontcallmedom reviewed Jul 11, 2024

View reviewed changes

src/browserlib/extract-algorithms.mjs Outdated Show resolved Hide resolved

dontcallmedom reviewed Jul 11, 2024

View reviewed changes

src/browserlib/extract-algorithms.mjs Show resolved Hide resolved

tidoust requested a review from dontcallmedom July 11, 2024 09:55

dontcallmedom approved these changes Jul 11, 2024

View reviewed changes

tidoust merged commit 1a6de0f into main Jul 11, 2024
1 check passed

tidoust deleted the algorithms branch July 11, 2024 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract normative algorithms defined in specs #1614

Extract normative algorithms defined in specs #1614

tidoust commented Jul 8, 2024 •

edited

Loading

dontcallmedom left a comment

dontcallmedom Jul 8, 2024

tidoust Jul 8, 2024

dontcallmedom Jul 9, 2024

tidoust Jul 9, 2024

dontcallmedom left a comment

dontcallmedom left a comment

dontcallmedom commented Jul 10, 2024

tidoust Jul 10, 2024

dontcallmedom Jul 10, 2024

dontcallmedom Jul 10, 2024

tidoust commented Jul 10, 2024

tidoust commented Jul 10, 2024

tidoust Jul 10, 2024

dontcallmedom Jul 10, 2024

tidoust Jul 10, 2024

dontcallmedom Jul 10, 2024

tidoust commented Jul 10, 2024

dontcallmedom commented Jul 10, 2024

tidoust commented Jul 11, 2024

tidoust commented Jul 11, 2024

dontcallmedom left a comment

Extract normative algorithms defined in specs #1614

Extract normative algorithms defined in specs #1614

Conversation

tidoust commented Jul 8, 2024 • edited Loading

dontcallmedom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dontcallmedom left a comment

Choose a reason for hiding this comment

dontcallmedom left a comment

Choose a reason for hiding this comment

dontcallmedom commented Jul 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tidoust commented Jul 10, 2024

tidoust commented Jul 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tidoust commented Jul 10, 2024

dontcallmedom commented Jul 10, 2024

tidoust commented Jul 11, 2024

tidoust commented Jul 11, 2024

dontcallmedom left a comment

Choose a reason for hiding this comment

tidoust commented Jul 8, 2024 •

edited

Loading