Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak and crash, now and 2 years ago, tfjs-node #8326

Open
borodadada opened this issue Jul 6, 2024 · 8 comments
Open

Memory leak and crash, now and 2 years ago, tfjs-node #8326

borodadada opened this issue Jul 6, 2024 · 8 comments
Assignees
Labels
comp:node.js type:bug Something isn't working

Comments

@borodadada
Copy link

borodadada commented Jul 6, 2024

The problem has not disappeared anywhere, 2 years ago and now the same thing. Visually, through the task manager in the past there was an information leak, and the process was constantly increasing in memory, now this is not the case, everything is fine, but the result is the same, after a while everything crashes. I managed to take screenshots at the moment when it all started. The screenshots show my working environment, NOT the TEST that I posted, the test itself is as close and simplified as possible, in the future I will post information on this test.
I'm not doing anything, just taking screenshots.

There are 4 identical programs running on the computer, 4 copies, one of them begins to fail. This happens when the number of epochs is measured in millions. For test you can run only one copy. On a modern processor, the procedure usually takes 4-6 hours, on an old one more than a day.

Memory leak starts, this starts happening quickly, as can be seen in the screenshot
Snipaste_2024-07-06_07-28-22

Process, note 3 other processes, usually from 100 to 200 megabytes in size
Snipaste_2024-07-06_07-28-54

Full memory
Snipaste_2024-07-06_07-30-42

After
Snipaste_2024-07-06_07-30-58

All node js have closed
Snipaste_2024-07-06_07-31-53

Logs, there is nothing in them, they are empty, the editor is open
Snipaste_2024-07-06_07-37-51

For test simple code, just copy past

TEST CODE

const tf = require('@tensorflow/tfjs-node');

const size = 50
const units = 100

const letsgo = async function(){

    const model = tf.sequential();
    model.add( tf.layers.dense({ inputShape: [units], units, activation: 'linear', useBias: true }));
    model.add( tf.layers.dense({ units, activation: 'linear', useBias: true }));
    model.add( tf.layers.dense({ units, activation: 'linear', useBias: true }));
    model.compile({ optimizer: tf.train.adam(0.005, 0.9, 0.999), loss: tf.losses.absoluteDifference });

    let a = []
    let b = []
    for (let i = 0; i < size; i  ) {
        let aa = []
        let bb = []
        for (let ii = 0; ii < units; ii  ) {
            aa.push( Math.random() )
            bb.push( Math.random() )
        }
        a.push(aa)
        b.push(bb)
    }

    let xs = tf.tensor2d( a );
    let ys = tf.tensor2d( b );

    await model.fit(xs, ys, {
        epochs: 50000000,
        shuffle: false,
        verbose: 0,
        callbacks:{
            onTrainBegin: ()=>{
                console.log('start')
            },
            onTrainEnd: ()=>{
                console.log('done')
            },
            onEpochEnd: async (epoch, logs)=>{
                if( epoch % 100000 === 0 )
                    console.log(epoch, logs.loss)
            }
        }
    })
}

const loop = async function(){
    for (let i = 0; i < 1; i  ) {
        await letsgo()
    }
}

loop()

System information

  • Windows 11 x64
  • node-v19.9.0-x64
  • node-v20.15.0-x64
  • "@tensorflow/tfjs": "^4.20.0",
  • "@tensorflow/tfjs-node": "^4.20.0",

Okey, this is results from test code:

modern PC intel 13700, crash after 4.4 millions epochs

Snipaste_2024-07-06_12-38-43

old PC intel 3770, crash after 4.4 millions epochs - windows 10 x64 nodejs 20.10.0

Snipaste_2024-07-06_16-49-30

I can't do the calculations because the program always crashes, and I need many more epochs than here!!! I really hope you fix this, it's a disaster that this bug hasn't been fixed for years!

@borodadada borodadada added the type:bug Something isn't working label Jul 6, 2024
@borodadada borodadada changed the title Memory leak and crash, now and 2 years ago Memory leak and crash, now and 2 years ago, tfjs-node Jul 6, 2024
@gaikwadrahul8 gaikwadrahul8 self-assigned this Jul 8, 2024
@gaikwadrahul8
Copy link
Contributor

Hi, @borodadada

I apologize for the delay in my response and thank you for bringing this issue to our attention and as far I know to avoid memory leak you'll have to use tf.tidy which executes the provided function fn and after it is executed, cleans up all intermediate tensors allocated by fn except those returned by fn. fn must not return a Promise (async functions not allowed). The returned result can be a complex object.

Using this method helps avoid memory leaks. In general, wrap calls to operations in tf.tidy() for automatic memory cleanup.

NOTE: Variables do not get cleaned up when inside a tidy(). If you want to dispose variables, please use tf.disposeVariables() or call dispose() directly on variables please refer tf.dispose.

You can also use tf.memory which returns memory info at the current time in the program.

Could you please give it try after adding tf.tidy and tf.dispose in your code and see memory leak is happening or not ?

If I have missed something here please let me know.

Thank you for your cooperation and patience.

@borodadada
Copy link
Author

borodadada commented Jul 12, 2024

I don’t understand the principle of a memory leak, all the algorithm needs is to change the coefficients and then feedback and comparison of the result, it should work endlessly without a memory leak.
What you wrote has nothing to do with the crash of the program, because the code is as simple as possible and there is nothing in it except the fit function, which crashes.
Or I don't understand something.

@mightyplow
Copy link

You have to dispose the xs and ys after the fit step. Otherwise tfjs creates new tensors in every loop step. They fill up the memory with each step if you don't dispose them.

@borodadada
Copy link
Author

borodadada commented Jul 31, 2024

using my example, can you show how it should be?
the crash process occurs when the fit function is executed, there is no loop, the data was declared once and after that the learning process started

@mightyplow
Copy link

mightyplow commented Jul 31, 2024

it should be like this

let xs = tf.tensor2d( a );
let ys = tf.tensor2d( b );

await model.fit(xs, ys, {
        epochs: 50000000,
        shuffle: false,
        verbose: 0,
        callbacks:{
            onTrainBegin: ()=>{
                console.log('start')
            },
            onTrainEnd: ()=>{
                console.log('done')
            },
            onEpochEnd: async (epoch, logs)=>{
                if( epoch % 100000 === 0 )
                    console.log(epoch, logs.loss)
            }
        }
})

xs.dispose();
ys.dispose();

This way the tensors get unusable and are freed by tfjs.

By the way that doesn't mean that there isn't any other memory leak. I stumbled across your comment because I'm also hunting a memory issue. But disposing unused tensors will at least wipe out one possible reason.

@borodadada
Copy link
Author

borodadada commented Aug 1, 2024

I am writing to you about this, that it has not reached this point
The program crashes on the - model.fit

await model.fit(xs, ys, {
        epochs: 50000000,
        shuffle: false,
        verbose: 0,
        callbacks:{
            onTrainBegin: ()=>{
                console.log('start')
            },
            onTrainEnd: ()=>{
                console.log('done')
            },
            onEpochEnd: async (epoch, logs)=>{
                if( epoch % 100000 === 0 )
                    console.log(epoch, logs.loss)
            }
        }
})

this part of the code doesn't work because it should fire after the fit function

xs.dispose();
ys.dispose();

if you have the opportunity to run the code, you will see everything for yourself

Now I’ll run the test with your amendments, I’ll write a little later

@mightyplow
Copy link

Oh sorry, my fault. I didn't recognize the amount of epochs. Then you're right and it looks like some internal problem. I'll try it out and see what happens.

@borodadada
Copy link
Author

Thank you, I'll wait for the result, this problem is bothering me a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:node.js type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants