2012-04-08 45 views
1

我有這樣的一些數據,其中第二個字段是第一個字段的概率,所以「0:0.017」意味着有0.017的概率爲0.所有概率的總和爲1.什麼是在數字線上設置範圍的好方法?

我的問題是:我如何從概率「範圍線」,以便我可以找到每個角色的下限和上限?所以0將是[0,0.017),[0.017,0.022)等等。

我想實現算術編碼。

(0: 0.017, 
1: 0.022, 
2: 0.033, 
3: 0.033, 
4: 0.029, 
5: 0.028, 
6: 0.035, 
7: 0.032, 
8: 0.028, 
9: 0.027, 
a: 0.019, 
b: 0.022, 
c: 0.029, 
d: 0.03, 
e: 0.028, 
f: 0.035, 
g: 0.026, 
h: 0.037, 
i: 0.029, 
j: 0.025, 
k: 0.025, 
l: 0.037, 
m: 0.025, 
n: 0.023, 
o: 0.026, 
p: 0.035, 
q: 0.033, 
r: 0.031, 
s: 0.023, 
t: 0.022, 
u: 0.038, 
v: 0.022, 
w: 0.016, 
x: 0.026, 
y: 0.021, 
z: 0.033,) 

編輯*

NVM我理解了它,只是在無聊的數學搞亂......所有輸入的感謝!

+0

是在一個文本文件數據?或者這是某種數據結構? – George 2012-04-08 01:19:59

+0

@George需要一個數據結構,我從隨機字符/數字的文本文件中獲得概率 – iCodeLikeImDrunk 2012-04-08 01:25:35

+3

*「0將會是[0,0.017),[0.017,0.022)」* - 您是不是指「0會是[ 0,0.017),1將爲[0.017,0.017 + 0.022),2將爲[0.017 + 0.022,0.017 + 0.022 + 0.033)「 – ninjagecko 2012-04-08 01:28:26

回答

2
# The data is input as '1: 0.022,' format 
def process_data(line): 
    # for returning the new string that is cleaned up 
    result_line = '' 
    for character in line: 
     # check if it is either a number or a letter 
     if character.isdigit() or character.isalpha(): 
      result_line += character 
     # we want the decimal point 
     elif character == '.': 
      result_line += character 
     # else we replace it with space ' ' 
     else: 
      result_line += ' ' 
    return result_line 

my_list = [] 

with open('input.txt') as file: 
    for lines in file: 
     processed_line = process_data(lines) 
     # temp_list has ['letter', 'frequency'] 
     temp_list = (processed_line.split()) 
     value = temp_list[0] 
     # Require to cast it to a float, since it is a string 
     frequency = float(temp_list[1]) 
     my_list.append([value, frequency]) 

print(my_list)   

從你這一點可以找出與你的價值觀做。我記錄了代碼(授予一個非常簡單樸素的方式來處理輸入文件)。但my_list現在乾淨,格式良好,其中string(值)和float(頻率)。希望這個幫助。從上面的代碼的

輸出:

[['0', 0.017], ['1', 0.022], ['2', 0.033], ['3', 0.033], 
['4', 0.029], ['5', 0.028], ['6', 0.035], ['7', 0.032], 
['8', 0.028], ['9', 0.027], ['a', 0.019], ['b', 0.022], 
['c', 0.029], ['d', 0.03], ['e', 0.028], ['f', 0.035], 
['g', 0.026], ['h', 0.037], ['i', 0.029], ['j', 0.025], 
['k', 0.025], ['l', 0.037], ['m', 0.025], ['n', 0.023], 
['o', 0.026], ['p', 0.035], ['q', 0.033], ['r', 0.031], 
['s', 0.023], ['t', 0.022], ['u', 0.038], ['v', 0.022], 
['w', 0.016], ['x', 0.026], ['y', 0.021], ['z', 0.033]] 

然後...的

# Took a page out of TokenMacGuy, credit to him 
distribution = [] 
distribution.append(0.00) 
total = 0.0 # Create a float here 

for entry in my_list: 
    distribution.append(entry[1]) 
    total += frequency 
    total = round(total, 3) # Rounding to 2 decimal points 

distribution.append(1.00) # Missing the 1.00 value 
print(distribution) # Print to check 

輸出是在這裏:

[0.0, 0.017, 0.022, 0.033, 0.033, 0.029, 0.028, 0.035, 0.032, 
0.028, 0.027, 0.019, 0.022, 0.029, 0.03, 0.028, 0.035, 0.026, 
0.037, 0.029, 0.025, 0.025, 0.037, 0.025, 0.023, 0.026, 0.035, 
0.033, 0.031, 0.023, 0.022, 0.038, 0.022, 0.016, 0.026, 0.021, 
0.033, 1.0] 

最後,爲了輸出最終結果:在那裏沒有什麼特別的,我用patternformat讓它們看起來更漂亮。這幾乎是根據ninjagecko的方法來計算的。因爲計算沒有顯示它,所以我必須將0.00和1.00填充到分佈中。非常直接的執行之後我們計算出如何做概率。

pattern = '{0}: [{1:1.3f}, {2:1.3f})' 
count = 1 # a counter to keep track of the index 

pre_p = distribution[0] 
p = distribution[1] 

# Here we will print it out at the end in the format you said in the question 
for entry in my_list: 
    print(pattern.format(entry[0], pre_p, p)) 
    pre_p += distribution[count] 
    p += distribution[count+1] 
    count = count + 1 

輸出:

0: [0.000, 0.017) 
1: [0.017, 0.039) 
2: [0.039, 0.072) 
3: [0.072, 0.105) 
4: [0.105, 0.134) 
5: [0.134, 0.162) 
6: [0.162, 0.197) 
7: [0.197, 0.229) 
8: [0.229, 0.257) 
9: [0.257, 0.284) 
a: [0.284, 0.303) 
b: [0.303, 0.325) 
c: [0.325, 0.354) 
d: [0.354, 0.384) 
e: [0.384, 0.412) 
f: [0.412, 0.447) 
g: [0.447, 0.473) 
h: [0.473, 0.510) 
i: [0.510, 0.539) 
j: [0.539, 0.564) 
k: [0.564, 0.589) 
l: [0.589, 0.626) 
m: [0.626, 0.651) 
n: [0.651, 0.674) 
o: [0.674, 0.700) 
p: [0.700, 0.735) 
q: [0.735, 0.768) 
r: [0.768, 0.799) 
s: [0.799, 0.822) 
t: [0.822, 0.844) 
u: [0.844, 0.882) 
v: [0.882, 0.904) 
w: [0.904, 0.920) 
x: [0.920, 0.946) 
y: [0.946, 0.967) 
z: [0.967, 1.000) 

完整的源是在這裏:http://codepad.org/a6YkHhed

1

創建字典,鍵是你的字符,值是一對定義下限和上限。

prev_p = 0 
bounds = {} 
for line in open(a_file): 
    character, p = parse_the_line(line) 
    bounds[character] = (prev_p, p) 
    prev_p = p 
2

您的數據轉換爲Python是作爲一個練習:

>>> corpus = [('0', 0.017), ('1', 0.022), ('2', 0.033), ('3', 0.033), ('4', 0.029), 
...   ('5', 0.028), ('6', 0.035), ('7', 0.032), ('8', 0.028), ('9', 0.027), 
...   ('a', 0.019), ('b', 0.022), ('c', 0.029), ('d', 0.030), ('e', 0.028), 
...   ('f', 0.035), ('g', 0.026), ('h', 0.037), ('i', 0.029), ('j', 0.025), 
...   ('k', 0.025), ('l', 0.037), ('m', 0.025), ('n', 0.023), ('o', 0.026), 
...   ('p', 0.035), ('q', 0.033), ('r', 0.031), ('s', 0.023), ('t', 0.022), 
...   ('u', 0.038), ('v', 0.022), ('w', 0.016), ('x', 0.026), ('y', 0.021), 
...   ('z', 0.033)] 

創建一個累計總和:

>>> distribution = [] 
>>> total = 0.0 
>>> for letter, frequency in corpus: 
...  distribution.append(total) 
...  total += frequency 
... 

實際使用這種類型的數據是的麪包和黃油bisect模塊。

>>> import bisect, random 
>>> def random_letter(): 
...  value = random.random() 
...  index = bisect.bisect(distribution, value) - 1 
...  return corpus[index][0] 
... 
>>> [random_letter() for n in range(10)] # doctest: +SKIP 
['d', '6', 'p', 'c', '8', 'f', '7', 'm', 'z', '7'] 
相關問題