Handling very large files with openpyxl in Python
Sometimes, we can have huge Excel or CSV files with many rows and columns. In this case, loading such files directly in Python can cause the interpreter to crash. Thus, we can make use of the openpyxl library in Python to handle these massive amounts of files. Therefore, in this tutorial, we will learn to handle large files with openpyxl in Python.
Loading large files using the read-only mode
Openpyxl is a library used to read and write excel files in Python. The common functions available in openpyxl won’t be able to handle reading and writing extremely large files. But, in openpyxl there are two modes available through which we can read or write such large files in nearly constant memory consumption.
In this case, our dataset contains 1192 rows and is of size 1MB. Using normal functions in openpyxl won’t be enough to load this huge amount of dataset. So we open the workbook in read_only
mode and print every cell of every row in the current worksheet.
from openpyxl import load_workbook wb = load_workbook(filename='dataset.xlsx', read_only=True) ws=wb.active for row in ws.rows: for cell in row: print(cell.value) wb.close()
Output:
https://www.udemy.com/trading-forex-lanalyse-des-volumes/ 70 4 3 13 All Levels 0.97 2 2017-06-29T21:32:21Z Business Finance 1239206 Win 90% of Trades; Quantitative Investments Models in Excel https://www.udemy.com/quantitative-investments/ 200 4 1 16 All Levels 0.97 1.5 2017-06-28T15:09:21Z Business Finance 625670 Technical Charting https://www.udemy.com/technical-charting/ 20 4 1 6 All Levels 0.97 1 2015-11-02T20:23:54Z Business Finance 1153854 Working Capital assessment for bankers, credit analysts https://www.udemy.com/working-capital-assessment-for-bankers-credit-analysts/ 20 4 0 29
Slicing and extending data from huge Excel files using openpyxl
We will import the range_boundaries
function which generates a tuple of cell boundaries from a given range mentioned in the parameters: (min_col, min_row, max_col, max_row)
. Next, we use the iter_rows () function to read and iterate through every cell corresponding to the rows as per the coordinates mentioned in the parameters. We will then slice the number of columns only up to column E by indexing each row_cell
. We can also apply the extend function to extend the data generated above. We will add the value from column B till the end, to every row. Lastly, we will print the output data.
from openpyxl.utils import range_boundaries from openpyxl import load_workbook wb = load_workbook(filename='dataset.xlsx') ws=wb.active min_col, min_row, max_col, max_row = range_boundaries('A1:E3') for row_cells in ws.iter_rows(min_col=min_col, min_row=min_row, max_col=max_col, max_row=max_row): # Slice Column Values up to E data = [cell.value for cell in row_cells[:5]] # Extend data with sliced Column Values from B up to End data.extend([cell.value for cell in row_cells[2:]]) print(data)
Output:
['course_id', 'course_title', 'url', 'price', 'num_subscribers', 'url', 'price', 'num_subscribers'] [49798, 'Bitcoin or How I Learned to Stop Worrying and Love Crypto', 'https://www.udemy.com/bitcoin-or-how-i-learned-to-stop-worrying-and-love-crypto/', 0, 65576, 'https://www.udemy.com/bitcoin-or-how-i-learned-to-stop-worrying-and-love-crypto/', 0, 65576] [48841, 'Accounting in 60 Minutes - A Brief Introduction', 'https://www.udemy.com/accounting-in-60-minutes-a-brief-introduction/', 0, 56659, 'https://www.udemy.com/accounting-in-60-minutes-a-brief-introduction/', 0, 56659]
Thus we have reached the end of this tutorial on how to handle large files in Python using openpyxl. To learn more about openpyxl, click on the following link: How to append data in excel using openpyxl in Python
Leave a Reply