隨著國(guó)產(chǎn)化的呼聲越來(lái)越高,國(guó)家也越來(lái)越重視,數(shù)據(jù)庫(kù)國(guó)產(chǎn)化一直是大家遇到的重要困難。那么個(gè)玩意,國(guó)內(nèi)就是沒(méi)有可以替代的好產(chǎn)品,主要是國(guó)內(nèi)的環(huán)境,大家都懂的,領(lǐng)導(dǎo)們不懂技術(shù),都只看形式,以及想要看到的數(shù)據(jù),或者為了騙取經(jīng)費(fèi),科研人員忙著應(yīng)付各類的填表,報(bào)告,制度的約束,沒(méi)有很好的科研環(huán)境。所以國(guó)內(nèi)很難誕生跨時(shí)代的技術(shù)產(chǎn)品。
我們找到一篇外文,供給大家參考一下。
翻譯
這個(gè)是好問(wèn)題,值得長(zhǎng)期回答.
大部分?jǐn)?shù)據(jù)庫(kù)是由c編寫(xiě)的,通過(guò)b樹(shù)存儲(chǔ)數(shù)據(jù).在過(guò)去的一段時(shí)間李,有個(gè)產(chǎn)品叫C-Isam(c的library 用于索引順序訪問(wèn)方法),這是一個(gè)低水平就能幫助c語(yǔ)言程序員寫(xiě)數(shù)據(jù)以b 樹(shù)的數(shù)據(jù)格式.所以你需要知道b樹(shù)并且立即他是什么.
大部分?jǐn)?shù)據(jù)庫(kù)存儲(chǔ)數(shù)據(jù)的時(shí)候都會(huì)構(gòu)建索引.讓我們假設(shè)一行記錄是800字節(jié),你寫(xiě)5行到一個(gè)文件.付過(guò)行包含類似firstname,lastname,address等字段,并且你想要通過(guò)lastname搜索一條特殊的記錄,你可以打開(kāi)文件,順序搜索每一條記錄,但是這個(gè)很慢.相反的,你打開(kāi)索引文件,索引文件包含了lastname和行記錄在數(shù)據(jù)文件的位置.然后當(dāng)你你定位到你打開(kāi)的文件,調(diào)到剛才索引找到的問(wèn)題,讀取數(shù)據(jù).因?yàn)樗饕龜?shù)據(jù)非常小,他可以非??斓恼业綌?shù)據(jù).并且索引文件按照b樹(shù)組織數(shù)據(jù),他可以非??觳⑶矣行У牟檎?
所以你要理解,對(duì)于一張表,會(huì)有一份數(shù)據(jù)文件和一份(或多分)索引文件.第一個(gè)索引可以是lastname,下一個(gè)可能是SS number.當(dāng)用戶定義他們的查詢?nèi)カ@取一些數(shù)據(jù),他們會(huì)決定使用哪個(gè)索引去查找.如果你能找到任何信息在 C-ISAM 上,你將會(huì)很好的理解這個(gè)概念.
一旦你存儲(chǔ)數(shù)據(jù),索引文件,使用ISAM去達(dá)到想要通過(guò)一個(gè)值找到一條記錄,或者插入一條新的記錄.然而現(xiàn)代數(shù)據(jù)庫(kù)服務(wù)器都支持sql,所以你需要一個(gè)sql解析器,他會(huì)轉(zhuǎn)化sql語(yǔ)句到一系列相關(guān)的get.sql 可能join 2張表 ,所以哪一個(gè)表需要先去讀的優(yōu)化是有必要的(一般是根據(jù)索引命中的行數(shù)),并且如何關(guān)聯(lián)第二張表的數(shù)據(jù).sql能夠插入數(shù)據(jù),所以你需要去解析到put語(yǔ)句,但是他支持結(jié)合多個(gè)插入語(yǔ)句到事務(wù),所以你需要事務(wù)管理器來(lái)管理事務(wù),你需要事務(wù)日志去存儲(chǔ)事務(wù).
可能你需要回滾/重新存儲(chǔ) 命令去回滾你的數(shù)據(jù)文件和索引文件,可能同樣包含你的事務(wù)日志文件.如果你真的想要這么做,你可以寫(xiě)一些復(fù)制工具去備份你的數(shù)據(jù)庫(kù)到另一個(gè)服務(wù)器. 記錄如果你的客戶端程序駐留在單獨(dú)的機(jī)器上,那么你的數(shù)據(jù)庫(kù)服務(wù)器需要寫(xiě)一個(gè)連接管理器用于發(fā)送sql請(qǐng)求通過(guò)tcp/ip協(xié)議到你的服務(wù)器.授權(quán)需要一些認(rèn)證,解析請(qǐng)求,執(zhí)行g(shù)ets,發(fā)送執(zhí)行完的結(jié)果數(shù)據(jù)給客戶端.
所以這些數(shù)據(jù)庫(kù)服務(wù)器能后一直工作,尤其是為一個(gè)人.但是你創(chuàng)建一個(gè)簡(jiǎn)單的版本為這些工具在一個(gè)時(shí)間.以如何存儲(chǔ)數(shù)據(jù)和索引開(kāi)始,如何取回?cái)?shù)據(jù)通過(guò)ISAM接口.
這里有一些樹(shù)–找一些舊書(shū)關(guān)于mysql和msql,在google上找btrees和isam,找開(kāi)源的isam 的C實(shí)現(xiàn). 對(duì)c操作文件io有一個(gè)好的理解在linux服務(wù)器上.許多商業(yè)數(shù)據(jù)庫(kù)現(xiàn)在甚至不使用文件系統(tǒng)存儲(chǔ)數(shù)據(jù),由于緩存問(wèn)題–他們直接寫(xiě)行數(shù)據(jù)到硬盤.你最初只是想寫(xiě)入文件.
希望本文對(duì)你有一點(diǎn)幫助
數(shù)據(jù)庫(kù)的最簡(jiǎn)單實(shí)現(xiàn)
數(shù)據(jù)庫(kù)的最簡(jiǎn)單實(shí)現(xiàn)
所有應(yīng)用軟件之中,數(shù)據(jù)庫(kù)可能是最復(fù)雜的。
MySQL的手冊(cè)有3000多頁(yè),PostgreSQL的手冊(cè)有2000多頁(yè),Oracle的手冊(cè)更是比它們相加還要厚。
但是,自己寫(xiě)一個(gè)最簡(jiǎn)單的數(shù)據(jù)庫(kù),做起來(lái)并不難。Reddit上面有一個(gè)帖子,只用了幾百個(gè)字,就把原理講清楚了。下面是我根據(jù)這個(gè)帖子整理的內(nèi)容。
一、數(shù)據(jù)以文本形式保存
第一步,就是將所要保存的數(shù)據(jù),寫(xiě)入文本文件。這個(gè)文本文件就是你的數(shù)據(jù)庫(kù)。
為了方便讀取,數(shù)據(jù)必須分成記錄,每一條記錄的長(zhǎng)度規(guī)定為等長(zhǎng)。比如,假定每條記錄的長(zhǎng)度是800字節(jié),那么第5條記錄的開(kāi)始位置就在3200字節(jié)。
大多數(shù)時(shí)候,我們不知道某一條記錄在第幾個(gè)位置,只知道主鍵(primary key)的值。這時(shí)為了讀取數(shù)據(jù),可以一條條比對(duì)記錄。但是這樣做效率太低,實(shí)際應(yīng)用中,數(shù)據(jù)庫(kù)往往采用B樹(shù)(B-tree)格式儲(chǔ)存數(shù)據(jù)。
二、什么是B樹(shù)?
要理解B樹(shù),必須從二叉查找樹(shù)(Binary search tree)講起。
二叉查找樹(shù)是一種查找效率非常高的數(shù)據(jù)結(jié)構(gòu),它有三個(gè)特點(diǎn)。
(1)每個(gè)節(jié)點(diǎn)最多只有兩個(gè)子樹(shù)。
(2)左子樹(shù)都為小于父節(jié)點(diǎn)的值,右子樹(shù)都為大于父節(jié)點(diǎn)的值。
(3)在n個(gè)節(jié)點(diǎn)中找到目標(biāo)值,一般只需要log(n)次比較。
二叉查找樹(shù)的結(jié)構(gòu)不適合數(shù)據(jù)庫(kù),因?yàn)樗牟檎倚逝c層數(shù)相關(guān)。越處在下層的數(shù)據(jù),就需要越多次比較。極端情況下,n個(gè)數(shù)據(jù)需要n次比較才能找到目標(biāo)值。對(duì)于數(shù)據(jù)庫(kù)來(lái)說(shuō),每進(jìn)入一層,就要從硬盤讀取一次數(shù)據(jù),這非常致命,因?yàn)橛脖P的讀取時(shí)間遠(yuǎn)遠(yuǎn)大于數(shù)據(jù)處理時(shí)間,數(shù)據(jù)庫(kù)讀取硬盤的次數(shù)越少越好。
B樹(shù)是對(duì)二叉查找樹(shù)的改進(jìn)。它的設(shè)計(jì)思想是,將相關(guān)數(shù)據(jù)盡量集中在一起,以便一次讀取多個(gè)數(shù)據(jù),減少硬盤操作次數(shù)。
B樹(shù)的特點(diǎn)也有三個(gè)。
(1)一個(gè)節(jié)點(diǎn)可以容納多個(gè)值。比如上圖中,最多的一個(gè)節(jié)點(diǎn)容納了4個(gè)值。
(2)除非數(shù)據(jù)已經(jīng)填滿,否則不會(huì)增加新的層。也就是說(shuō),B樹(shù)追求"層"越少越好。
(3)子節(jié)點(diǎn)中的值,與父節(jié)點(diǎn)中的值,有嚴(yán)格的大小對(duì)應(yīng)關(guān)系。一般來(lái)說(shuō),如果父節(jié)點(diǎn)有a個(gè)值,那么就有a+1個(gè)子節(jié)點(diǎn)。比如上圖中,父節(jié)點(diǎn)有兩個(gè)值(7和16),就對(duì)應(yīng)三個(gè)子節(jié)點(diǎn),第一個(gè)子節(jié)點(diǎn)都是小于7的值,最后一個(gè)子節(jié)點(diǎn)都是大于16的值,中間的子節(jié)點(diǎn)就是7和16之間的值。
這種數(shù)據(jù)結(jié)構(gòu),非常有利于減少讀取硬盤的次數(shù)。假定一個(gè)節(jié)點(diǎn)可以容納100個(gè)值,那么3層的B樹(shù)可以容納100萬(wàn)個(gè)數(shù)據(jù),如果換成二叉查找樹(shù),則需要20層!假定操作系統(tǒng)一次讀取一個(gè)節(jié)點(diǎn),并且根節(jié)點(diǎn)保留在內(nèi)存中,那么B樹(shù)在100萬(wàn)個(gè)數(shù)據(jù)中查找目標(biāo)值,只需要讀取兩次硬盤。
三、索引
數(shù)據(jù)庫(kù)以B樹(shù)格式儲(chǔ)存,只解決了按照"主鍵"查找數(shù)據(jù)的問(wèn)題。如果想查找其他字段,就需要建立索引(index)。
所謂索引,就是以某個(gè)字段為關(guān)鍵字的B樹(shù)文件。假定有一張"雇員表",包含了員工號(hào)(主鍵)和姓名兩個(gè)字段。可以對(duì)姓名建立索引文件,該文件以B樹(shù)格式對(duì)姓名進(jìn)行儲(chǔ)存,每個(gè)姓名后面是其在數(shù)據(jù)庫(kù)中的位置(即第幾條記錄)。查找姓名的時(shí)候,先從索引中找到對(duì)應(yīng)第幾條記錄,然后再?gòu)谋砀裰凶x取。
這種索引查找方法,叫做"索引順序存取方法"(Indexed Sequential Access Method),縮寫(xiě)為ISAM。它已經(jīng)有多種實(shí)現(xiàn)(比如C-ISAM庫(kù)和D-ISAM庫(kù)),只要使用這些代碼庫(kù),就能自己寫(xiě)一個(gè)最簡(jiǎn)單的數(shù)據(jù)庫(kù)。
四、高級(jí)功能
部署了最基本的數(shù)據(jù)存?。òㄋ饕┮院螅€可以實(shí)現(xiàn)一些高級(jí)功能。
(1)SQL語(yǔ)言是數(shù)據(jù)庫(kù)通用操作語(yǔ)言,所以需要一個(gè)SQL解析器,將SQL命令解析為對(duì)應(yīng)的ISAM操作。
(2)數(shù)據(jù)庫(kù)連接(join)是指數(shù)據(jù)庫(kù)的兩張表通過(guò)"外鍵",建立連接關(guān)系。你需要對(duì)這種操作進(jìn)行優(yōu)化。
(3)數(shù)據(jù)庫(kù)事務(wù)(transaction)是指批量進(jìn)行一系列數(shù)據(jù)庫(kù)操作,只要有一步不成功,整個(gè)操作都不成功。所以需要有一個(gè)"操作日志",以便失敗時(shí)對(duì)操作進(jìn)行回滾。
(4)備份機(jī)制:保存數(shù)據(jù)庫(kù)的副本。
(5)遠(yuǎn)程操作:使得用戶可以在不同的機(jī)器上,通過(guò)TCP/IP協(xié)議操作數(shù)據(jù)庫(kù)。
外文原文如下,翻譯可能不好,可以參考:
How do you build a database? (self.Database)
Its a great question, and deserves a long answer.
Most database servers are built in C, and store data using B-tree type constructs.
In the old days there was a product called C-Isam (c library for an indexed sequential access method) which is a low level library to help C programmers write data in B-tree format.
So you need to know about btrees and understand what these are.
Most databases store data separate to indexes. Lets assume a record (or row) is 800 bytes long and you write 5 rows of data to a file.
If the row contains columns such as first name, last name, address etc. and you want to search for a specific record by last name, you can open the file and sequentially search through each record but this is very slow.
Instead you open an index file which just contains the lastname and the position of the record in the data file.
Then when you have the position you open the data file, lseek to that position and read the data.
Because index data is very small it is much quicker to search through index files.
Also as the index files are stored in btrees in it very quick to effectively do a quicksearch (divide and conquer) to find the record you are looking for.
So you understand for one “table” you will have a data file with the data and one (or many) index files.
The first index file could be for lastname, the next could be to search by SS number etc.
When the user defines their query to get some data, they decide which index file to search through.
If you can find any info on C-ISAM (there used to be an open source version (or cheap commercial) called D-ISAM) you will understand this concept quite well.
Once you have stored data and have index files, using an ISAM type approach allows you to GET a record based on a value, or PUT a new record.
However modern database servers all support SQL, so you need an SQL parser that translates the SQL statement into a sequence of related GETs.
SQL may join 2 tables so an optimizer is also needed to decide which table to read first (normally based on number of rows in each table and indexes available) and how to relate it to the next table.
SQL can INSERT data so you need to parse that into PUT statements but it can also combine multiple INSERTS into transactions so you need a transaction manager to control this, and you will need transaction logs to store wip/completed transactions.
It is possible you will need some backup/restore commands to backup your data files and index files and maybe also your transaction log files, and if you really want to go for it you could write some replication tools to read your transaction log and replicate the transactions to a backup database on a different server.
Note if you want your client programs (for example an SQL UI like phpmyadmin) to reside on separate machine than your database server you will need to write a connection manager that sends the SQL requests over TCP/IP to your server, then authenticate it using some credentials, parse the request, run your GETS and send back the data to the client.
So these database servers can be a lot of work, especially for one person.
But you can create simple versions of these tools one at a time.
Start with how to store data and indexes, and how to retrieve data using an ISAM type interface.
There are books out there - look for older books on mysql and msql, look for anything on google re btrees and isam, look for open source C libraries that already do isam.
Get a good understanding on file IO on a linux machine using C.
Many commercial databases now dont even use the filesystem for their data files because of cacheing issues - they write directly to raw disk.
You want to just write to files initially.
I hope this helps a little bit.
2024-07-11 16:25:16
2024-07-08 16:52:55
2024-07-01 11:17:11
2024-05-17 16:26:26
2024-05-15 14:37:53
2024-05-09 18:08:16
2024-04-29 16:29:55
2024-04-24 15:58:15
2024-04-22 17:27:24
2024-03-18 15:17:11
本文版權(quán)歸作者所有!如有侵權(quán),請(qǐng)聯(lián)系管理員刪除。文章僅代表作者觀點(diǎn),不代表行迪醫(yī)管立場(chǎng)。
嘗試看看下列相關(guān)的交流摘要推薦
二叉樹(shù) 硬盤 記錄 查找 文件 節(jié)點(diǎn) 索引 數(shù)據(jù) 數(shù)據(jù)庫(kù) 葉亞維 三甲醫(yī)院 醫(yī)院排名 吳洪川 指標(biāo) 名院名科 國(guó)家衛(wèi)健委 院辦 醫(yī)共體 運(yùn)營(yíng)管理委員會(huì) 混雜性
網(wǎng)友評(píng)論
未登錄