Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync command copies object everytime #676

Closed
bounlu opened this issue Oct 25, 2023 · 2 comments · Fixed by #740
Closed

sync command copies object everytime #676

bounlu opened this issue Oct 25, 2023 · 2 comments · Fixed by #740
Assignees
Labels

Comments

@bounlu
Copy link

bounlu commented Oct 25, 2023

Native aws s3 sync command copies the file from S3 bucket to local only once (as expected):

aws s3 sync s3://data/testfile .

But s5cmd sync copies the file everytime and overwrites:

s5cmd sync s3://data/testfile .

@ehudkaldor
Copy link

same

@thiell
Copy link

thiell commented Jun 14, 2024

Same with 2.2.2 against MinIO even with --size-only:

$ ./bin/aws s3api head-object --bucket 's5cmd-test' --key 'path/to/libmkl_vml_cmpt.so.1' --query ContentLength
7756240
$ stat -c '%s' /home/groups/.snapshot/groups.daily.latest/path/to/libmkl_vml_cmpt.so.1
7756240

But every time I sync:

$ ./s5cmd_2.2.2 --log=info sync --size-only /home/groups/.snapshot/groups.daily.latest/path/to/libmkl_vml_cmpt.so.1 s3://s5cmd-test/path/to/libmkl_vml_cmpt.so.1
cp /home/groups/.snapshot/groups.daily.latest/path/to/libmkl_vml_cmpt.so.1 s3://s5cmd-test/path/to/libmkl_vml_cmpt.so.1
$

@ilkinulas ilkinulas added the sync label Jun 23, 2024
@ilkinulas ilkinulas added this to s5cmd Jun 28, 2024
@ilkinulas ilkinulas moved this to Planned in s5cmd Jul 5, 2024
ilkinulas pushed a commit that referenced this issue Jul 12, 2024
The problem is mainly caused by the `compareObjects` function inside
`command/sync.go`, where `s5cmd` compares source and destination paths
and extracts files that are present only in the source or destination
path (while also counting nested folders or rather name with its
prefixes) along with common objects. If they both are non-objects, like
wildcard expression, prefix, or bucket, getting relative paths of files
with `src.URL.Relative()` results in compatible and comparable paths. In
this case, no problem is present, at least within the scope of this
issue.

However, when an object is selected as the source, it is not assigned a
relative path using `func (u *url.URL) SetRelative(base *url.URL)`, so
the `src.URL.Relative()` function returns its absolute path.

Let's say the source file has an absolute path of `folder/foo.txt`. The
algorithm compares `folder/foo.txt` with `s3://bucket/remoteFolder/` and
looks for the item `s3://bucket/remoteFolder/folder/foo.txt`. If it does
not match, except for the edge case where there is a duplicate item
inside the searched path, the files never match.

While copying files, `s5cmd` does not use relative paths, so `foo.txt`
is written to the intended path in the remote. However, this happens
during every sync operation, as they do not match.

Problem solved by taking path of source object as its name. This made
algorithm to simply look for matches in destination, a file named
`foo.txt` as intended.

This PR adds new test cases to the sync command. Previously, tests
failed to capture sync command cases where the source is an object in a
prefix, not an object directly in a bucket, or not multiple objects like
a wildcard or prefix expression.

If an object is in the `s3://bucket/` path, its relative path is the
same as its absolute path, so they match during comparison. This
prevented copying the file every time. The new test cases cover all
scenarios.
Resolves: #676.
@github-project-automation github-project-automation bot moved this from Planned to Done in s5cmd Jul 12, 2024
tarikozyurtt added a commit to tarikozyurtt/s5cmd that referenced this issue Jul 12, 2024
The problem is mainly caused by the `compareObjects` function inside
`command/sync.go`, where `s5cmd` compares source and destination paths
and extracts files that are present only in the source or destination
path (while also counting nested folders or rather name with its
prefixes) along with common objects. If they both are non-objects, like
wildcard expression, prefix, or bucket, getting relative paths of files
with `src.URL.Relative()` results in compatible and comparable paths. In
this case, no problem is present, at least within the scope of this
issue.

However, when an object is selected as the source, it is not assigned a
relative path using `func (u *url.URL) SetRelative(base *url.URL)`, so
the `src.URL.Relative()` function returns its absolute path.

Let's say the source file has an absolute path of `folder/foo.txt`. The
algorithm compares `folder/foo.txt` with `s3://bucket/remoteFolder/` and
looks for the item `s3://bucket/remoteFolder/folder/foo.txt`. If it does
not match, except for the edge case where there is a duplicate item
inside the searched path, the files never match.

While copying files, `s5cmd` does not use relative paths, so `foo.txt`
is written to the intended path in the remote. However, this happens
during every sync operation, as they do not match.

Problem solved by taking path of source object as its name. This made
algorithm to simply look for matches in destination, a file named
`foo.txt` as intended.

This PR adds new test cases to the sync command. Previously, tests
failed to capture sync command cases where the source is an object in a
prefix, not an object directly in a bucket, or not multiple objects like
a wildcard or prefix expression.

If an object is in the `s3://bucket/` path, its relative path is the
same as its absolute path, so they match during comparison. This
prevented copying the file every time. The new test cases cover all
scenarios.
Resolves: peak#676.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants