Implement EulerFD approximate discovering FD algorithm #414

mitya-y · 2024-05-11T15:13:00Z

Implement approximate FD (Functional Dependencies) discovery algorithm based on the article "EulerFD: An Efficient Double-Cycle Approximation of Functional Dependencies" by Qiongqiong Lin, Yunfan Gu, Jingyan Sai.
Add unit tests for the approximate FD algorithms, including a custom random option to ensure consistent test results across different systems. Utilize a custom random function to calculate answers of the current EulerFD version for each test dataset with a selected seed.
Integrate the EulerFD algorithm into Python bindings and the Python console.

For more information on EulerFD, refer to the presentation: EulerFD Overview.
Detailed development information and test results can be found here: Development Details.

src/core/algorithms/fd/eulerfd/cluster.cpp

src/core/algorithms/fd/eulerfd/mlfq.h

src/core/algorithms/fd/eulerfd/search_tree.cpp

+}
+
+void SearchTreeEulerFD::UpdateInterAndUnion(std::shared_ptr<Node> const& node) {
+    auto node_copy = node;


src/core/algorithms/fd/eulerfd/search_tree.cpp

+}
+
+std::shared_ptr<SearchTreeEulerFD::Node> SearchTreeEulerFD::FindNode(Bitset const& set) {
+    auto current_node = root_;


src/core/algorithms/fd/eulerfd/eulerfd.cpp

+        std::sort(neg.begin(), neg.end(), [](Bitset const &left, Bitset const &right) {
+            return left.count() > right.count();
+        });
+        fd_num  = Invert(real_rhs, neg);


src/core/config/custom_random/option.cpp

src/core/config/descriptions.h

src/python_bindings/py_util/py_to_any.cpp

src/tests/test_fd_util.h

src/core/algorithms/fd/eulerfd/cluster.cpp

src/core/algorithms/fd/eulerfd/mlfq.h

src/core/algorithms/fd/eulerfd/mlfq.cpp

src/core/algorithms/fd/eulerfd/search_tree.h

src/core/algorithms/fd/eulerfd/search_tree.cpp

src/core/algorithms/fd/eulerfd/eulerfd.cpp

slesarev-hub · 2024-07-09T18:52:31Z

src/core/algorithms/fd/eulerfd/eulerfd.cpp

+
+    // in each column mapping string values into integer values.
+    // using only hash isnt good idea because colisions dont processing
+    std::vector<std::unordered_map<std::string, size_t>> columns(number_of_attributes_);


Try to use string_view to alleviate unnecessary copy

There is no unnecessary copy, because all maps "indexes" (strings) was moved (values[std::move(line[i])] = id, line 54) from std::vector<std::string>, on which allocation I can't influence, because it is a result of input_table_->GetNextRow() method.

src/core/algorithms/fd/eulerfd/eulerfd.cpp

slesarev-hub · 2024-08-05T20:10:23Z

src/core/algorithms/fd/eulerfd/mlfq.h

+
+class MLFQ {
+private:
+    using Queue = std::pair<std::queue<Cluster *>, double>;


Cannot find where double is used

It must be barrier values, but now it isn't used, because I use log instead.
Should I remove it?

slesarev-hub · 2024-08-05T20:14:16Z

src/core/algorithms/fd/eulerfd/cluster.cpp

+
+namespace algos {
+
+void Cluster::ShuffleData(RandomStrategy rand) {


Could we use std::shuffle instead of custom shuffling?

No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).

No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).

This is an important piece of information, so I believe it is a good idea to leave it as a comment in the code somewhere near this method.

slesarev-hub · 2024-08-05T20:16:36Z

src/core/algorithms/fd/eulerfd/cluster.cpp

+}
+
+double Cluster::GetAverage() const {
+    double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0);


where const?

const double sum = ....

just for readability, not necessary

slesarev-hub · 2024-08-05T20:40:02Z

src/core/algorithms/fd/eulerfd/mlfq.cpp

+
+Cluster *MLFQ::Get() {
+    if (actual_queue_ >= 0) {
+        Cluster *save = queues_[actual_queue_].first.front();


Why such name save?

Because I save a pointer on cluster in this variable, and return it in end of function.

src/core/algorithms/fd/eulerfd/eulerfd.cpp

+double EulerFD::SamlingInCluster(Cluster *cluster) {
+    return cluster->Sample([this](size_t t1, size_t t2) -> size_t {
+        Bitset agree_set = BuildAgreeSet(t1, t2);
+        auto &&[_, result] = invalids_.insert(agree_set);


src/core/algorithms/fd/eulerfd/eulerfd.h

slesarev-hub · 2024-08-05T20:59:48Z

src/core/algorithms/fd/eulerfd/eulerfd.cpp

+            break;
+        }
+
+        tuples_.emplace_back(std::vector<size_t>(number_of_attributes_));


Suggested change

tuples_.emplace_back(std::vector<size_t>(number_of_attributes_));

tuples_.emplace_back(number_of_attributes_);

I think first is variant is better, because it is more explicit, but hasn't overhead, because will be called move constructor.

…for set random strategy and seed (it will be necessary for writing tests)

…om random from utilities. Check EulerFD in these test cases : calculating answer of algorithm in custom seed

…D class at python bindings. Also for loading EulerFD add custom random option at GetPyType function and in test_bindings.py set default value for it.

…om python tuple to custom random option and register in EulerFD kEqualNull option

slesarev-hub · 2024-08-07T22:09:51Z

src/core/algorithms/fd/eulerfd/eulerfd.cpp

+                tuples_.back()[i] = it->second;
+            } else {
+                size_t id = current_id[i]  ;
+                values[std::move(line[i])] = id;


line is const, move will have no effect here

slesarev-hub · 2024-10-12T18:33:53Z

src/core/algorithms/fd/eulerfd/cluster.cpp

+}
+
+double Cluster::GetAverage() const {
+    double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0);


const double sum = ....

just for readability, not necessary

BUYT-1 · 2024-10-24T09:56:50Z

src/core/config/custom_random/type.h

+#include <utility>
+
+namespace config {
+using CustomRandomFlagType = std::pair<bool, int>;


Suggested change

using CustomRandomFlagType = std::pair<bool, int>;

using CustomRandomFlagType = std::optional<int>;

pybind11's standard conversions work fine with either of those

BUYT-1 · 2024-10-24T09:59:37Z

src/core/algorithms/fd/eulerfd/eulerfd.cpp

+    is_first_sample_ = true;
+
+    mlfq_.Clear();
+    effective_treshold_ = kInitialEffectiveThreshold;


Suggested change

effective_treshold_ = kInitialEffectiveThreshold;

effective_threshold_ = kInitialEffectiveThreshold;

BUYT-1 · 2024-10-26T16:58:47Z

src/tests/test_fd_approximate.cpp

+using ::testing::ContainerEq, ::testing::Eq;
+
+namespace tests {
+static std::vector<unsigned int> BitsetToIndexVector(boost::dynamic_bitset<> const& bitset) {


Already available in src/core/util/bitset_utils.h

mitya-y force-pushed the euler-fd-development branch 3 times, most recently from 4c95ec1 to 2af6292 Compare May 13, 2024 17:23

Firsov62121 reviewed May 31, 2024

View reviewed changes

slesarev-hub reviewed Jul 6, 2024

View reviewed changes

slesarev-hub reviewed Jul 9, 2024

View reviewed changes

slesarev-hub reviewed Aug 5, 2024

View reviewed changes

mitya-y added 8 commits September 21, 2024 09:55

Implement the algorithm according to the EulerFD article, add option …

e121830

…for set random strategy and seed (it will be necessary for writing tests)

Create test cases for approximate fd discovering algortihms. Use cust…

9f7de02

…om random from utilities. Check EulerFD in these test cases : calculating answer of algorithm in custom seed

Add EulerFD algorithm at python console (with short help). Add EulerF…

102a476

…D class at python bindings. Also for loading EulerFD add custom random option at GetPyType function and in test_bindings.py set default value for it.

Fix python bindings algortitms configurating tests. Add converting fr…

8fbeb64

…om python tuple to custom random option and register in EulerFD kEqualNull option

Fix code style using project clang format

b54cb9a

Fix clang-tidy naming warnings and errors

1bd5397

Again fix code style (this commit will be merged with previous 2)

c33d3e9

Changes after review

012944f

mitya-y force-pushed the euler-fd-development branch from 2af6292 to 012944f Compare September 23, 2024 09:28

slesarev-hub approved these changes Oct 16, 2024

View reviewed changes

BUYT-1 requested changes Oct 25, 2024

View reviewed changes

BUYT-1 requested changes Oct 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement EulerFD approximate discovering FD algorithm #414

Implement EulerFD approximate discovering FD algorithm #414

mitya-y commented May 11, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

slesarev-hub Jul 9, 2024

mitya-y Sep 21, 2024 •

edited

Loading

slesarev-hub Aug 5, 2024

mitya-y Sep 21, 2024

slesarev-hub Aug 5, 2024

mitya-y Sep 21, 2024

chernishev Sep 22, 2024

slesarev-hub Aug 5, 2024

mitya-y Sep 21, 2024 •

edited

Loading

slesarev-hub Oct 12, 2024

slesarev-hub Aug 5, 2024

mitya-y Sep 21, 2024

This comment was marked as resolved.

slesarev-hub Aug 5, 2024

mitya-y Sep 22, 2024

slesarev-hub Aug 7, 2024

slesarev-hub Oct 12, 2024

BUYT-1 Oct 24, 2024

BUYT-1 Oct 24, 2024

BUYT-1 Oct 24, 2024

BUYT-1 Oct 26, 2024


		namespace algos {

		void Cluster::ShuffleData(RandomStrategy rand) {

	tuples_.emplace_back(std::vector<size_t>(number_of_attributes_));
	tuples_.emplace_back(number_of_attributes_);

	using CustomRandomFlagType = std::pair<bool, int>;
	using CustomRandomFlagType = std::optional<int>;

	effective_treshold_ = kInitialEffectiveThreshold;
	effective_threshold_ = kInitialEffectiveThreshold;

Implement EulerFD approximate discovering FD algorithm #414

Are you sure you want to change the base?

Implement EulerFD approximate discovering FD algorithm #414

Conversation

mitya-y commented May 11, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Choose a reason for hiding this comment

mitya-y Sep 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mitya-y Sep 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mitya-y Sep 21, 2024 •

edited

Loading

mitya-y Sep 21, 2024 •

edited

Loading