-
Notifications
You must be signed in to change notification settings - Fork 104
Add CI for checking for broken links manually, weekly and in PRs #1633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
16a2593
d73e0d1
90129d9
025bdbd
3ebc92e
e895554
53412f0
7d74c7c
5581fb2
7ef91f2
d795dea
3fb1404
dfc7877
94276f5
f15f23e
6dddd9b
a1f8133
5339c58
eabe97e
9794589
1393495
985875e
f8bd433
aeafe2d
e6b8b23
5296721
06fe559
7b1d0aa
7c2350e
6adfe91
530a792
5e0059c
bd8c3a0
1a2074f
fc2cc5c
7d5b88a
77c78db
e419e4b
5ad2057
cfa5b8d
3373b55
42e39a9
122ba08
c9f4aa3
f34f557
4d83878
05e85dd
a0f91c2
9bf7b9a
137f3d0
d42614d
ab50688
ad297bd
2e673e3
a7c8f34
5f8d4b7
f09fba1
2fe80db
2e7cab2
975021a
8aa0130
05e2037
235745d
02b5458
2310eff
d1db746
52e9836
ca247b7
99b2a9e
969905f
2bd4298
f53a1e0
f024ad6
7b0e42e
f840b17
46f77a1
64ed9e8
9ecf296
fbdb9bd
55beebc
6fbe1f6
ce73e77
c239c6a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,161 @@ | ||
| name: Check URLs | ||
|
|
||
| on: | ||
| workflow_dispatch: | ||
| schedule: | ||
| - cron: '17 5 * * 0' # 5:17 AM every Sunday | ||
| pull_request: | ||
| branches: [ main ] | ||
| env: | ||
| ignore_file_patterns: | | ||
| docs | ||
| images | ||
| utils | ||
| Events | ||
|
|
||
| jobs: | ||
| check-urls: | ||
| runs-on: ubuntu-latest | ||
|
|
||
| steps: | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v2 | ||
| with: | ||
| python-version: '3.9' | ||
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| python -m pip --no-cache-dir --disable-pip-version-check install --upgrade pip | ||
| python -m pip --no-cache-dir --disable-pip-version-check install linkchecker | ||
|
|
||
| - name: Reformat environment variables | ||
| id: setup_vars | ||
| run: | | ||
| tmp=$(echo "${{ env.ignore_file_patterns }}" | tr '\n' ' ') | ||
| echo "ignore_file_patterns=$tmp" >> $GITHUB_OUTPUT | ||
|
|
||
| - name: Checkout Repo for PR branch | ||
| if: ${{ github.event_name == 'pull_request' }} | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Checkout Repo for link check | ||
| if: ${{ github.event_name != 'pull_request' }} | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| ref: 'sched-link-checks' | ||
|
|
||
| - name: Sync main to link check branch | ||
| if: ${{ github.event_name != 'pull_request' }} | ||
| run: | | ||
| git config user.name 'github-actions' | ||
| git config user.email 'github-actions@github.com' | ||
| git fetch origin main | ||
| git merge origin/main --no-edit -X thiers | ||
| git push origin sched-link-checks | ||
|
|
||
| - name: Get Changed Files (for PRs) | ||
| if: ${{ github.event_name == 'pull_request' }} | ||
| id: changed-files | ||
| uses: tj-actions/changed-files@v42 | ||
| with: | ||
| separator: ' ' | ||
|
|
||
| - name: Generate lists of files to check and ignore | ||
| id: file_list | ||
| run: | | ||
| if [ "${{ github.event_name }}" = "pull_request" ]; then | ||
| echo "files=${{ steps.changed-files.outputs.all_changed_files }}" >> $GITHUB_OUTPUT | ||
| echo "ignore_file_patterns=" >> $GITHUB_OUTPUT | ||
| else | ||
| echo "files=" >> $GITHUB_OUTPUT | ||
| echo "ignore_file_patterns=${{ steps.setup_vars.outputs.ignore_file_patterns }}" >> $GITHUB_OUTPUT | ||
| fi | ||
|
|
||
| - name: Check URLs in selected files | ||
| run: | | ||
| for f in ${{ steps.file_list.outputs.files }}; do | ||
| if [ "${f##*.}" != "md" ]; then | ||
| continue | ||
| fi | ||
| for ef in ${{ steps.file_list.outputs.ignore_file_patterns }}; do | ||
| if [ "$ef" = "$f" ]; then | ||
| continue 2 # ignore this file | ||
| fi | ||
| fd=$(echo $f | cut -d'/' -f1) | ||
| if [ "$ef" = "$fd" ]; then | ||
| continue 2 # ignore this dir | ||
| fi | ||
| done | ||
| linkchecker -f utils/LinkChecker/.linkcheckerrc file://$(pwd)/$f >> linkchecker.out || true | ||
| cat linkchecker.out >> linkchecker-all.out | ||
| done | ||
|
|
||
| - name: Process log | ||
| run: | | ||
| python utils/LinkChecker/cklcresults.py ${{ github.event_name }} | ||
|
|
||
| - name: Upload artifact | ||
| if: ${{ github.event_name == 'pull_request' }} | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: bad-links | ||
| path: bad_links.txt | ||
|
Comment on lines
+99
to
+103
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that I think you might only want to upload the list of bad links for files on the 'main' branch, not a topic branch. Also, don't you want the full list of *.md files being processed when you generate the list of bad links?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where is the |
||
|
|
||
| - name: Update link logs | ||
| if: ${{ github.event_name != 'pull_request' }} | ||
| run: | | ||
| git commit -m 'Update link logs' | ||
| git push origin sched-link-checks | ||
|
|
||
| # | ||
| # Keep the recurring failures and definitely bad lists in repo on | ||
| # branch manage-broken-links | ||
|
Comment on lines
+112
to
+113
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This comment is not correct, is it? GitHub Action's job artifacts system is being used instead, right? |
||
| # | ||
| # Download those files before startin | ||
| # | ||
| # If a link "works" (200) remove it from "recurring failures" list | ||
| # If a link "does not work" (!= 200) | ||
| # - if it is already on recurring failures list | ||
| # - if it is too old, flag it as "definitely bad", else nothing | ||
| # | ||
| # - if it is not already on persistent failures list, add it to "new" and "persistent failures" list and date it | ||
|
|
||
| # | ||
| # Upload the recurring failures and definitely bad lists to somehwere | ||
| # Report success if definitely bad list is empty, otherwise failure | ||
| # generate email with links or actual data | ||
| # | ||
| # you have to use file:// on command-line to checker | ||
|
|
||
| # bare URLs in markdown are not actually links and will not be checked. Many | ||
| # markdown renderers and browsers will recognize these and handle them as links | ||
| # but that is by convention only. There is no markdown standard for how bare | ||
| # URLs in markdown are handled. The only standard is to enclose them in `<` and | ||
| # `>` chars. | ||
|
|
||
| # | ||
| # Description: | ||
| # | ||
| # Triggers in one of three ways; 1) manually, 2) scheduled weekly Sunday's 5:17 AM | ||
| # or 3) pull request | ||
| # | ||
| # Stores ignore pattern cases in env. variables and then reformats those (because | ||
| # they have newlines) into a comma-separated single line string that can be digested | ||
| # as inputs to other actions. | ||
| # | ||
| # For PRs, uses changed-files action to get list of changed files and passes this | ||
| # to urlchecker via `include_files param. Also, ignore file patterns is set to | ||
| # empty string for PRs because we think URLs anywhere in PRs should be checked. | ||
| # | ||
| # For scheduled or manual triggers, uses fact that empty `include_files` param | ||
| # causes urlchecker to process *all* files that match in `file_type` param but do | ||
| # not match any `exclude_files` patterns. These file patterns for exclude work | ||
| # more or less like file globs. So, specifying the initial part of the string | ||
| # for a file (path) name is sufficient to ignore the file. | ||
| # | ||
| # We include Events in file patterns to ignore because of all content we host, | ||
| # we suspect Event URLs are the most likely to go stale rather quickly **and** because | ||
| # the URL validness is important only during the short window prior to the event. | ||
| # That said, we don't want to ignore Events in PRs and we do not as per above. | ||
| # | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it might be good to break this out into a Python script taking arguments so it can be developed and tested locally. Also, the Python implementation for some of these operations is a bit cleaner than bash commands.