Sign in

Does Python optimize text processing efficiency and memory usage?

rechargeplan edited in Thu, 25 Aug 2022

There is a large file (795g) with 7 columns. If columns 1, 2, 3, 6 and 7 are the same, then the values of columns 4 and 5 are added. Write a simple version, can be achieved, but the server memory is only 200g, can't read in? My program code is as follows:

# -*- coding: utf-8 -*-
__author__ = ' author'
__author_email__ = '[email protected]'

def add(line, anno):
    chr, position, strand, methy_read, all_read, methy_nt, nt = line.strip().split()
    key = (chr, position, strand, methy_nt, nt)
    if key in anno.keys():
        anno[key] = map(lambda x, y: x + y, anno[key], (int(methy_read), int(all_read)))
    if key not in anno.keys():
        anno[key] = (int(methy_read), int(all_read))
    return anno

with open('test.tab', 'r') as f:
    dict1 = {}
    for line in f:
        add(line, dict1)

    for key, value in dict1.items():
        key = list(key)
        value = list(value)
        print(*key, *value, sep='\t')

2 Replies
commented on Thu, 25 Aug 2022

If the combination of (1, 2, 3, 6, 7) does not change too much, it is unlikely to run out of memory. Of course, there may be a problem with the data file itself, such as missing line breaks.

Maybe there is still a problem with the code. I don't think your writing is very clear. I changed it for reference

from collections import defaultdict

def readTabFile(filename):
    anno = defaultdict(lambda: (0, 0))
    with open(filename, 'r') as f:
        for line in f:
    return anno

    def add(line):
        chr, position, strand, methy_read, all_read, methy_nt, nt = line.strip().split()
        k = (chr, position, strand, methy_nt, nt)
        v = (int(methy_read), int(all_read))
        nonlocal anno
        import operator
        anno[k] = tuple(map(operator.add, anno[k], v))

for k, v in readTabFile('test.tab').items():
    print(*k, *v, sep='\t')
commented on Thu, 25 Aug 2022

From the point of view of program optimization and performance improvement, we should consider fragment reading and multi process processing https://www.jianshu.com/p/445...

lock This question has been locked and the reply function has been disabled.