Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement EulerFD approximate discovering FD algorithm #414

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

mitya-y
Copy link

@mitya-y mitya-y commented May 11, 2024

Implement approximate FD (Functional Dependencies) discovery algorithm based on the article "EulerFD: An Efficient Double-Cycle Approximation of Functional Dependencies" by Qiongqiong Lin, Yunfan Gu, Jingyan Sai.
Add unit tests for the approximate FD algorithms, including a custom random option to ensure consistent test results across different systems. Utilize a custom random function to calculate answers of the current EulerFD version for each test dataset with a selected seed.
Integrate the EulerFD algorithm into Python bindings and the Python console.

For more information on EulerFD, refer to the presentation: EulerFD Overview.
Detailed development information and test results can be found here: Development Details.

@mitya-y mitya-y force-pushed the euler-fd-development branch 3 times, most recently from 4c95ec1 to 2af6292 Compare May 13, 2024 17:23
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.h Outdated Show resolved Hide resolved
}

void SearchTreeEulerFD::UpdateInterAndUnion(std::shared_ptr<Node> const& node) {
auto node_copy = node;

This comment was marked as resolved.

This comment was marked as resolved.

}

std::shared_ptr<SearchTreeEulerFD::Node> SearchTreeEulerFD::FindNode(Bitset const& set) {
auto current_node = root_;

This comment was marked as resolved.

This comment was marked as resolved.

std::sort(neg.begin(), neg.end(), [](Bitset const &left, Bitset const &right) {
return left.count() > right.count();
});
fd_num = Invert(real_rhs, neg);

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

src/core/config/custom_random/option.cpp Outdated Show resolved Hide resolved
src/core/config/descriptions.h Outdated Show resolved Hide resolved
src/python_bindings/py_util/py_to_any.cpp Outdated Show resolved Hide resolved
src/tests/test_fd_util.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.cpp Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/search_tree.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/search_tree.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/search_tree.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/search_tree.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved

// in each column mapping string values into integer values.
// using only hash isnt good idea because colisions dont processing
std::vector<std::unordered_map<std::string, size_t>> columns(number_of_attributes_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to use string_view to alleviate unnecessary copy

Copy link
Author

@mitya-y mitya-y Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no unnecessary copy, because all maps "indexes" (strings) was moved (values[std::move(line[i])] = id, line 54) from std::vector<std::string>, on which allocation I can't influence, because it is a result of input_table_->GetNextRow() method.

src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved

class MLFQ {
private:
using Queue = std::pair<std::queue<Cluster *>, double>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot find where double is used

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be barrier values, but now it isn't used, because I use log instead.
Should I remove it?


namespace algos {

void Cluster::ShuffleData(RandomStrategy rand) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use std::shuffle instead of custom shuffling?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).

This is an important piece of information, so I believe it is a good idea to leave it as a comment in the code somewhere near this method.

}

double Cluster::GetAverage() const {
double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

Copy link
Author

@mitya-y mitya-y Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where const?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const double sum = ....

just for readability, not necessary


Cluster *MLFQ::Get() {
if (actual_queue_ >= 0) {
Cluster *save = queues_[actual_queue_].first.front();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why such name save?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I save a pointer on cluster in this variable, and return it in end of function.

double EulerFD::SamlingInCluster(Cluster *cluster) {
return cluster->Sample([this](size_t t1, size_t t2) -> size_t {
Bitset agree_set = BuildAgreeSet(t1, t2);
auto &&[_, result] = invalids_.insert(agree_set);

This comment was marked as resolved.

src/core/algorithms/fd/eulerfd/eulerfd.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/eulerfd.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/eulerfd.h Outdated Show resolved Hide resolved
break;
}

tuples_.emplace_back(std::vector<size_t>(number_of_attributes_));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tuples_.emplace_back(std::vector<size_t>(number_of_attributes_));
tuples_.emplace_back(number_of_attributes_);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think first is variant is better, because it is more explicit, but hasn't overhead, because will be called move constructor.

…for set random strategy and seed (it will be necessary for writing tests)
…om random from utilities. Check EulerFD in these test cases : calculating answer of algorithm in custom seed
…D class at python bindings. Also for loading EulerFD add custom random option at GetPyType function and in test_bindings.py set default value for it.
…om python tuple to custom random option and register in EulerFD kEqualNull option
@mitya-y mitya-y force-pushed the euler-fd-development branch from 2af6292 to 012944f Compare September 23, 2024 09:28
tuples_.back()[i] = it->second;
} else {
size_t id = current_id[i] ;
values[std::move(line[i])] = id;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line is const, move will have no effect here

}

double Cluster::GetAverage() const {
double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const double sum = ....

just for readability, not necessary

#include <utility>

namespace config {
using CustomRandomFlagType = std::pair<bool, int>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using CustomRandomFlagType = std::pair<bool, int>;
using CustomRandomFlagType = std::optional<int>;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pybind11's standard conversions work fine with either of those

is_first_sample_ = true;

mlfq_.Clear();
effective_treshold_ = kInitialEffectiveThreshold;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
effective_treshold_ = kInitialEffectiveThreshold;
effective_threshold_ = kInitialEffectiveThreshold;

using ::testing::ContainerEq, ::testing::Eq;

namespace tests {
static std::vector<unsigned int> BitsetToIndexVector(boost::dynamic_bitset<> const& bitset) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already available in src/core/util/bitset_utils.h

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants