-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement EulerFD approximate discovering FD algorithm #414
base: main
Are you sure you want to change the base?
Conversation
4c95ec1
to
2af6292
Compare
} | ||
|
||
void SearchTreeEulerFD::UpdateInterAndUnion(std::shared_ptr<Node> const& node) { | ||
auto node_copy = node; |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
} | ||
|
||
std::shared_ptr<SearchTreeEulerFD::Node> SearchTreeEulerFD::FindNode(Bitset const& set) { | ||
auto current_node = root_; |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
std::sort(neg.begin(), neg.end(), [](Bitset const &left, Bitset const &right) { | ||
return left.count() > right.count(); | ||
}); | ||
fd_num = Invert(real_rhs, neg); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
||
// in each column mapping string values into integer values. | ||
// using only hash isnt good idea because colisions dont processing | ||
std::vector<std::unordered_map<std::string, size_t>> columns(number_of_attributes_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to use string_view
to alleviate unnecessary copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no unnecessary copy, because all maps "indexes" (strings) was moved (values[std::move(line[i])] = id
, line 54) from std::vector<std::string>
, on which allocation I can't influence, because it is a result of input_table_->GetNextRow()
method.
|
||
class MLFQ { | ||
private: | ||
using Queue = std::pair<std::queue<Cluster *>, double>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot find where double
is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be barrier values, but now it isn't used, because I use log instead.
Should I remove it?
|
||
namespace algos { | ||
|
||
void Cluster::ShuffleData(RandomStrategy rand) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use std::shuffle
instead of custom shuffling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).
This is an important piece of information, so I believe it is a good idea to leave it as a comment in the code somewhere near this method.
} | ||
|
||
double Cluster::GetAverage() const { | ||
double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const double sum = ....
just for readability, not necessary
|
||
Cluster *MLFQ::Get() { | ||
if (actual_queue_ >= 0) { | ||
Cluster *save = queues_[actual_queue_].first.front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why such name save
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I save a pointer on cluster in this variable, and return it in end of function.
double EulerFD::SamlingInCluster(Cluster *cluster) { | ||
return cluster->Sample([this](size_t t1, size_t t2) -> size_t { | ||
Bitset agree_set = BuildAgreeSet(t1, t2); | ||
auto &&[_, result] = invalids_.insert(agree_set); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
break; | ||
} | ||
|
||
tuples_.emplace_back(std::vector<size_t>(number_of_attributes_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuples_.emplace_back(std::vector<size_t>(number_of_attributes_)); | |
tuples_.emplace_back(number_of_attributes_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think first is variant is better, because it is more explicit, but hasn't overhead, because will be called move constructor.
…for set random strategy and seed (it will be necessary for writing tests)
…om random from utilities. Check EulerFD in these test cases : calculating answer of algorithm in custom seed
…D class at python bindings. Also for loading EulerFD add custom random option at GetPyType function and in test_bindings.py set default value for it.
…om python tuple to custom random option and register in EulerFD kEqualNull option
2af6292
to
012944f
Compare
tuples_.back()[i] = it->second; | ||
} else { | ||
size_t id = current_id[i] ; | ||
values[std::move(line[i])] = id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line
is const, move will have no effect here
} | ||
|
||
double Cluster::GetAverage() const { | ||
double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const double sum = ....
just for readability, not necessary
#include <utility> | ||
|
||
namespace config { | ||
using CustomRandomFlagType = std::pair<bool, int>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using CustomRandomFlagType = std::pair<bool, int>; | |
using CustomRandomFlagType = std::optional<int>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pybind11's standard conversions work fine with either of those
is_first_sample_ = true; | ||
|
||
mlfq_.Clear(); | ||
effective_treshold_ = kInitialEffectiveThreshold; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
effective_treshold_ = kInitialEffectiveThreshold; | |
effective_threshold_ = kInitialEffectiveThreshold; |
using ::testing::ContainerEq, ::testing::Eq; | ||
|
||
namespace tests { | ||
static std::vector<unsigned int> BitsetToIndexVector(boost::dynamic_bitset<> const& bitset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already available in src/core/util/bitset_utils.h
Implement approximate FD (Functional Dependencies) discovery algorithm based on the article "EulerFD: An Efficient Double-Cycle Approximation of Functional Dependencies" by Qiongqiong Lin, Yunfan Gu, Jingyan Sai.
Add unit tests for the approximate FD algorithms, including a custom random option to ensure consistent test results across different systems. Utilize a custom random function to calculate answers of the current EulerFD version for each test dataset with a selected seed.
Integrate the EulerFD algorithm into Python bindings and the Python console.
For more information on EulerFD, refer to the presentation: EulerFD Overview.
Detailed development information and test results can be found here: Development Details.