Given the usefulness of having metadata including sha of deleted files, and that it is available on the toolserver it should be exposed on labs.
Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=49088
Given the usefulness of having metadata including sha of deleted files, and that it is available on the toolserver it should be exposed on labs.
Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=49088
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • bd808 | T60788 Toolserver migration to Tools (tracking) | |||
Resolved | • bd808 | T60791 Missing Toolserver features in Tools (tracking) | |||
Resolved | • jcrespo | T50930 Database replication problems - production and labs (tracking) | |||
Resolved | None | T63813 filearchive table not available on labs |
That information is not available to normal users on the project, and therefore requires an okay by Legal to clear. Toolserver had imperfectly sanitized replication, and there were quite a few things available there that never should have been without clearance. :-)
Adding Luis to the bug so that they can opine.
Is https://www.mediawiki.org/wiki/Manual:Filearchive_table the best place to figure out what is actually in the relevant table? And do we want all fields or just some?
I would prefer as much as possible, the only field that should contain information that is sensitive is fa_description
The current toolserver view seems to be everything but fa_description and fa_sha1.
mysql> describe filearchive;
---------------------- -------------------------------------------------------------------------------------------------------- ------ ----- --------- -------
Field | Type | Null | Key | Default | Extra |
---------------------- -------------------------------------------------------------------------------------------------------- ------ ----- --------- -------
fa_id | int(11) | NO | 0 | ||
fa_name | varbinary(255) | NO | |||
fa_archive_name | varbinary(255) | YES | |||
fa_storage_group | varbinary(16) | YES | NULL | ||
fa_storage_key | varbinary(64) | YES | |||
fa_deleted_user | int(11) | YES | NULL | ||
fa_deleted_timestamp | varbinary(14) | YES | |||
fa_deleted_reason | blob | YES | NULL | ||
fa_size | int(8) unsigned | YES | 0 | ||
fa_width | int(5) | YES | 0 | ||
fa_height | int(5) | YES | 0 | ||
fa_metadata | mediumblob | YES | NULL | ||
fa_bits | int(3) | YES | 0 | ||
fa_media_type | enum('UNKNOWN','BITMAP','DRAWING','AUDIO','VIDEO','MULTIMEDIA','OFFICE','TEXT','EXECUTABLE','ARCHIVE') | YES | NULL | ||
fa_major_mime | enum('unknown','application','audio','image','text','video','message','model','multipart') | YES | unknown | ||
fa_minor_mime | varbinary(32) | YES | unknown | ||
fa_user | int(5) unsigned | YES | 0 | ||
fa_user_text | varbinary(255) | YES | |||
fa_timestamp | varbinary(14) | YES | |||
fa_deleted | tinyint(1) unsigned | NO | 0 | ||
---------------------- -------------------------------------------------------------------------------------------------------- ------ ----- --------- -------
20 rows in set (0.00 sec)
I know we've seen crazy things be put in filenames before - is that oversightable? Otherwise, agree that fa_sha1 should not be problematic.
Oversight no longer exists, but pretty much anything can be rev_del'ed if that is what you are referring to. However I have never seen a case of a file name being problematic.
I think it was James who told me that there have been crazy file names in the past, but that may be a fever dream - James?
With regards fa_description: is that normally publicly visible? I.e., would sensitive information in it be rev_del'd as part of normal site moderation/oversight? Because with other sensitive fields, one option is to simply respect revdel and keep it from being propagated.
There is, IMO, a plausible issue with the SHA but I don't know whether it is relevant for legal: its primary use case is (of course) to note files which have been previously uploaded then deleted, but it therefore necessarily allows any third party to determine whether any specific file they have the hash to has been uploaded in the past.
Could this be used by, say, a government agency to find who uploaded some files that they were displeased with?
At best they could tell that some file with the same /name/ existed; the SHA will confirm content. AFAIK, uploading doesn't check against deleted files' SHAs.
I(In reply to Marc A. Pelletier from comment #11)
At best they could tell that some file with the same /name/ existed; the SHA
will confirm content. AFAIK, uploading doesn't check against deleted files'
SHAs.
I may be wrong but I believe it does (and tells you that the same file is uploaded at X and I 'think' that one was deleted before though I'd have to double check that.
(In reply to Marc A. Pelletier from comment #11)
AFAIK, uploading doesn't check against deleted files' SHAs.
It does. And it tells you the title. From the title, look up the (public) logs and you have that user.