Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract normative algorithms defined in specs #1614

Merged
merged 17 commits into from
Jul 11, 2024
Merged

Extract normative algorithms defined in specs #1614

merged 17 commits into from
Jul 11, 2024

Conversation

tidoust
Copy link
Member

@tidoust tidoust commented Jul 8, 2024

This adds a browserlib module that creates extracts filled with information about algorithms defined in the spec. The extracts are rather raw: they just capture the tree structure of algorithms and copy the HTML for each step. This makes the resulting extracts less directly useful but also makes it possible to run all sorts of analyses on them (see related PR in Strudy: w3c/strudy#645 and the current results of running the analysis).

This would add about 50MB of additional data to a Webref crawl result. That's significant, roughly equivalent to the IDs extracts, which are the heaviest for now.

The structure of the extracts is very likely going to change substantively as we learn from experience!

There are plenty of things that could be improved. The code contains TODOs for main ones.

An algorithm extract is essentially an object with the following keys:

  • name: The name of the algorithm, when one exists
  • href: The URL with fragment to reach the algorithm, when one exists
  • html: Some introductory prose for the algorithm. That prose may well contain actual algorithmic operations, e.g.: "When invoked, run the following
    steps in parallel". href/src attributes in the HTML have absolute URLs.
  • steps: Atomic algorithm steps.

Each step is essentially an object that follows the same structure as an algorithm, except that it does not have a name and href keys, and may also have the following keys:

  • operation: Gives the name of the main operation performed by the step, when one was identified. So far, that's only for "switch".
  • case: Used in switch steps to identify the switch condition that triggers the step.
  • ignored: Ordered lists found at the step level that do no look like algorithm steps. Or maybe they are? The lists should get reviewed: they usually describe inputs/outputs or conditions, but they may signal parts where the extraction logic needs to be improved. The lists are reported as text prose.
  • additional: Each step should contain one and only one algorithm. When other algorithms are found at the same level, they get reported in that property. That usually either signals that the spec could be improved because if fails to use different list items for different steps, and/or that the extraction logic needs to be smarter.

This adds a browserlib module that creates extracts filled with information
about algorithms defined in the spec. The extracts are rather raw: they just
capture the tree structure of algorithms and copy the HTML for each step. This
makes the resulting extracts less directly useful but makes it possible to run
all sorts of analyses on them.

The structure of the extracts is very likely going to change substantively as
we learn from experience!

An algorithm extract is essentially an object with the following keys:
- `name`: The name of the algorithm, when one exists
- `href`: The URL with fragment to reach the algorithm, when one exists
- `html`: Some introductory prose for the algorithm. That prose may well
contain actual algorithmic operations, e.g.: "When invoked, run the following
 steps in parallel". href/src attributes in the HTML have absolute URLs.
- `steps`: Atomic algorithm steps.

Each step is essentially an object that follows the same structure as an
algorithm, except that it does not have a `name` and `href` keys, and may
also have the following keys:
- `operation`: Gives the name of the main operation performed by the step,
when one was identified. So far, that's only for "switch".
- `case`: Used in switch steps to identify the switch condition that triggers
the step.
- `ignored`: Ordered lists found at the step level that do no look like
algorithm steps. Or maybe they are? The lists should get reviewed: they
usually describe inputs/outputs or conditions, but they may signal parts
where the extraction logic needs to be improved. The lists are reported as
text prose.
- `additional`: Each step should contain one and only one algorithm. When
other algorithms are found at the same level, they get reported in that
property. That usually either signals that the spec could be improved
because if fails to use different list items for different steps, and/or
that the extraction logic needs to be smarter.
Copy link
Member

@dontcallmedom dontcallmedom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow!

results from a first review (hoping that's useful); if you could not force-commit moving forward, that'd be helpful :)

I hope this will be coming with tests given all the funky archaeology you must have done on this!

* The list is completed with a few branching operations that are not verbs:
* "for", "if", "while".
*
* Using a growing list of verbs may not be a good idea. That said, it is an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something for Infra to integrate at some point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth exploring, indeed. Ideally, all operations would have some specific meaning defined somewhere (and there would be a restricted number of them). Now, on top of core operations, some of these verbs are the beginning of a named algorithm that the step calls. Such verbs don't need to be defined in Infra. The extraction logic is not smart enough to detect that for now.

src/browserlib/extract-algorithms.mjs Outdated Show resolved Hide resolved
src/browserlib/extract-algorithms.mjs Outdated Show resolved Hide resolved
src/browserlib/extract-algorithms.mjs Outdated Show resolved Hide resolved
tidoust added a commit to tidoust/parallel-promise that referenced this pull request Jul 8, 2024
The extracts are the results of running the following command:

node reffy --module algorithms:src/browserlib/extract-algorithms.mjs --output data

... with a version of Reffy that has the extract-algorithms module, see:
w3c/reffy#1614
The `cloneAndClean` logic was completed to drop a few additional informative
bits and to strip the HTML of comments as well, as was done in dfns extraction.
The function is now called from dfns extraction, Web IDL extraction and
algorithms extraction.
The detection only looked at the first step in a list to detect an operation.
This wasn't the intent. The code now looks at all steps. This allows the code
to extract more algorithms and sets of steps than before (dropping a number of
cases that used to report `ignored`).

Now, this update also means that various lists that look like steps but aren't
actual steps get extracted and reported as algorithms. That does not affect the
existing analysis so far, but we'll need to iterate on the logic, for sure.
relativeToAbsolute[linkEl.getAttribute(attr)] = url.toString();
}

const clone = cloneAndClean(el);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to help (somewhat) reduce the weight of the extracts, I wonder if we shouldn't apply the same cleaning as definition content (and thus migrate that code into the cloneAndClean module)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would reduce the weight from ~48MB to ~32MB.

I had considered it but since I did not really know how to scope the analyses we would end up doing against algorithms, I thought I'd preserve the whole thing. With definitions, we want to extract prose that can be inserted somewhere else. We need to be conservative. For analysis, the more the merrier.

There are algorithm steps that contain formulas with syntax highlighting that end up being represented through a <c-> element, others such as Web IDL that use <emu-val> and <emu-const>, and some specs seem to integrate recent changes through <ins> such as nav-speculation. Dropping the elements would cut words in the middle of a sentence. The "fingerprint" image used sometimes could be worth preserving as well for study.

Now, the vast majority of gains seems to come from cutting attributes. Perhaps we can just do that, but preserve elements for now.

(Also, note there's a bug in the extract-dfns function: el.closest('[data-reffy-page]') is unlikely going to return an element, because el is a detached clone. I'm surprised that does not seem to have triggered any issue so far. I'll need to fix that, as done in the extract-algorithms module)

There should be no normative content in any of these asides. Using one list
avoids duplicating logic partially, as was done until now!
If an ordered list gets captured as introductory prose, that means it was
ignored as an algorithm. Such lists should already get reported in the
`ignored` property and likely don't introduce the algorithm per se.
@tidoust tidoust marked this pull request as ready for review July 10, 2024 08:41
Copy link
Member

@dontcallmedom dontcallmedom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted a few bugs while exploring, will suggest changes or discussion in a bit

The code was assuming relative links would only be to same-page anchors, but in a multi-page spec, it could link to a different page of the spec as well
as described in https://infra.spec.whatwg.org/#algorithm-declaration
The extraction criteria ('To ' followed by an exported definition of dfn-type 'dfn' or 'abstract-op') restrictive enough to avoid false positives, but would benefit from checking on real data
@dontcallmedom dontcallmedom self-requested a review July 10, 2024 10:34
Copy link
Member

@dontcallmedom dontcallmedom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes made - the second one in particular deserves review

@dontcallmedom dontcallmedom self-requested a review July 10, 2024 10:35
@dontcallmedom
Copy link
Member

https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#duration-time-component is extracted as an algorithm, which it isn't really; not sure yet what we can do something about it

const probableOneLine = [...section.querySelectorAll(candidateDfnSelectors.map(s => `p:has(${s})`).join(','))]
.filter(p => p.textContent.startsWith("To " p.querySelector(candidateDfnSelectors.join(',')).textContent))
.map(p => {
return { rationale: '"To <dfn>"', root: p };
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that going to extract the introductory paragraph of multi-step algorithms too, which will also be extracted with an <ol> as root?

<p>To <dfn ...>do plenty of things</dfn>, run the following steps:</p>
<ol><li>...</li></ol>

The duplicate filtering some lines below won't catch that because the <p> and <ol> elements are disjoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had (mistakenly) assumed that the duplication filter would handle it (or that the test would yell at me otherwise); will look into fixing it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in ee10949

@tidoust
Copy link
Member Author

tidoust commented Jul 10, 2024

https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#duration-time-component is extracted as an algorithm, which it isn't really; not sure yet what we can do something about it

That's precisely the sort of steps that were ignored when the code was only checking the first step in a list. But then extraction missed a number of real algorithms as a result. There's room for improvement, for sure.

@tidoust
Copy link
Member Author

tidoust commented Jul 10, 2024

I mentioned updating the README, but that's more for Webref in practice. Now, a couple of additional things to do before merge:

  1. src/browserlib/reffy.json needs to be completed with the new module, otherwise it won't run by default. I'll do it.
  2. I noticed that the crawl crashes partially on a few specs (including Web Audio API, WebXR, and Web Authentication), falling back to a regular network request instead of reusing the information cache. Error is "Body is unusable". I need to look into that.

html: 'To <dfn data-export="" data-dfn-type="dfn" id="algo-id">do nothing</dfn>, keep calm and carry on.',
rationale: '"To <dfn>"',
steps: [ {
html: 'To <dfn data-export="" data-dfn-type="dfn" id="algo-id">do nothing</dfn>, keep calm and carry on.'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, do we really want to duplicate the text both in the root html property, and as a single step? Wouldn't it be clearer to not have any steps property in such cases? I.e., to end up with:

    [
      {
        name: 'do nothing',
        href: 'about:blank#algo-id',
        html: 'To <dfn data-export="" data-dfn-type="dfn" id="algo-id">do nothing</dfn>, keep calm and carry on.',
        rationale: '"To <dfn>"'
     }
    ]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about it, but started with the smallest code change. I think the options are:

  • split the actual procedural stuff ("keep calm and carry on" into the step (ideal, but maybe hard to do robustly)
  • not have steps (which at least provide a simple signal this is a "short form" algorithm)
  • duplicate root markup in a single step

I'm happy to use the 2nd option at least as a starting point, which I agree would be an improvement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 for the second option. It has the advantage of being consistent with the way a step in an algorithm is extracted when it does not define substeps: It gets reported with an html property and no steps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 8487c8e

@tidoust
Copy link
Member Author

tidoust commented Jul 10, 2024

1. `src/browserlib/reffy.json` needs to be completed with the new module, otherwise it won't run by default. I'll do it.

Done.

2. I noticed that the crawl crashes partially on a few specs (including Web Audio API, WebXR, and Web Authentication), falling back to a regular network request instead of reusing the information cache. Error is "Body is unusable". I need to look into that.

Well, That no longer seems to happen. Now, I don't understand how that error could appear in the first place (plus, that seemed reproducible), and what change made the error disappear.

@dontcallmedom
Copy link
Member

feel free to merge if you're satisfied with my changes to address your comments

dontcallmedom and others added 2 commits July 10, 2024 17:39
This re-writes the "one-step" logic slightly to:
1. filter algorithms that are not "one-step" ones directly from within
the `findAlgorithms` function
2. preserve the document order for algorithms
3. not output an empty `steps` property, consistent with how atomic steps
are serialized.

The update also fixes the serialization of switch cases, which introduced an
additional steps level that wasn't needed. Switch cases are now directly
serializes at the `case` level.
@tidoust
Copy link
Member Author

tidoust commented Jul 11, 2024

I'm running a crawl locally to review extracts one more time before merge.

When an explicitly flagged algorithm contains a `<dfn>`, that definition is now
used as the anchor to find the introductory paragraph. This is useful for cases
where the spec adds some intermediary paragraph between the actual introductory
paragraph and the steps, as in:
https://url.spec.whatwg.org/#percent-decode

The new logic is not perfect. It cannot spot the cases where the algorithm is
not explicitly linked to the steps, as in:
https://w3c.github.io/webappsec-csp/#abstract-opdef-parse-a-serialized-csp

In that case, the extract will contain two algorithms: first one with the
actual introductory paragraph and dfn, thought to be a "single-step" algorithm,
second one with the actual steps but the wrong introductory paragraph (and no
dfn), because the logic only checks the previous paragraph.

The URL spec also uses `data-algorithm-for` to create a namespace for the
algorithms. Now supported. Example:
https://url.spec.whatwg.org/#percent-decode
https://url.spec.whatwg.org/#string-percent-decode
One spec had an intermediary note paragraph that got picked up as introductory
paragraph:
https://w3c.github.io/webappsec-credential-management/#abstract-opdef-create-a-passwordcredential-from-an-htmlformelement

The code now ignores such informative paragraphs.
@tidoust
Copy link
Member Author

tidoust commented Jul 11, 2024

I made a few adjustments, with related tests. The single-step algorithms seem to get correctly extracted without creating duplicates now (with the CSP exception, which I'm happy to live with for now). Good to merge?

Copy link
Member

@dontcallmedom dontcallmedom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

congrats again for an amazing source of yak wool!

@tidoust tidoust merged commit 1a6de0f into main Jul 11, 2024
1 check passed
@tidoust tidoust deleted the algorithms branch July 11, 2024 09:59
tidoust added a commit that referenced this pull request Jul 11, 2024
New feature:
- Extract normative algorithms defined in specs (#1614)

The algorithms extracts should be seen as unstable for now. The structure may
evolve and there are a number of TODOs to improve the extraction.

Dependencies bumped:
- Bump puppeteer from 22.12.1 to 22.13.0 (#1616)
- Bump web-specs from 3.12.1 to 3.13.1 (#1617)
- Bump mocha from 10.5.1 to 10.6.0 (#1612)
- Bump rollup from 4.18.0 to 4.18.1 (#1615)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants